Ad

Python - Regex - Match Characters Between Certain Characters

- 1 answer

I have a textfile and i want to match/findall/parse all characters that are between certain characters ([\n"text to match"\n]). The text itself can differ a lot from each other in respect to the structure and characters they contain (they can contain every possible char there is).

I posted this question before (sorry for the duplicate) but so far the problem couldnt be solved, so now i am trying to be even more precise about the problem.

The text in the file is build up like this:

    test =""" 
        [
        "this is a text and its supposed to contain every possible char."
        ], 
        [
        "like *.;#]§< and many "" more."
        ], 
        [
        "plus there are even
newlines

in it."
        ]"""

My desired output should be a list (for example) with each text in between the seperators as an element, like the following:

['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']

I tried to solve it with Regex and two solutions with the according output i came up with:

my_list = re.findall(r'(?<=\[\n {8}\").*(?=\"\n {8}\])', test)
print (my_list)

['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.']

well this one was close. Its listing the first two elements as its supposed to but unfortunately not the third one as it has newlines within.

my_list = re.findall(r'(?<=\[\n {8}\")[\s\S]*(?=\"\n {8}\])', test)
print (my_list)

['this is a text and its supposed to contain every possible char."\n        ], \n        [\n        "like *.;#]§< and many "" more."\n        ], \n        [\n        "plus there are even\nnewlines\n        \n        in it.']

okay this time every element is included but the list has only one element in it and the lookahead doesnt seem to be working as i thought it would.

So whats the right Regex to use to get my desired output? Why does the second approach not include the lookahead?

Or is there even a cleaner, faster way to get what i want (beautifulsoup or other methods?)?

I am very thankful for any help and hints.

i am using python 3.6.

Ad

Answer

You should use DOTALL flag for matching newlines

print(re.findall(r'\[\n\s+"(.*?)"\n\s+\]', test, re.DOTALL))

Output

['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even\nnewlines\n\nin it.']
Ad
source: stackoverflow.com
Ad