Ad

Splitting A Unique String - Python

I'm trying to find the best way to parse this type of string:

Operating Status: NOT AUTHORIZED Out of Service Date: None

I need the output to be like this:

['Operating Status: NOT AUTHORIZED', 'Out of Service Data: None']

Is there an easy way of doing this? I am parsing hundreds of string like this. There is no deterministic text but its always in the above format.

Other string examples:

MC/MX/FF Number(s): None  DUNS Number: -- 
Power Units: 1  Drivers: 1 

Expected Output:

['MC/MX/FF Number(s): None, 'DUNS Number: --']
['Power Units: 1,  Drivers: 1 ']
Ad

Answer

There's two ways. Both are super klugy, and extremely dependent on very little fluctuation in the original string. However, you can modify the code to offer a little more flexibility.

Both of the options depend on the line meeting these characteristics... The grouping in question must...

  1. Start with a letter or slash, probably capitalized
  2. That title of interest is followed by a colon (":")
  3. Grab ONLY the first word after the colon.

Method 1, regex, this can only grab TWO blocks of data. The second group is "everything else" because I can't get the search pattern to repeat properly :P

code:

import re

l = [ 'MC/MX/FF Number(s): None DUNS Number: -- ', 'Power Units: 1 Drivers: 1 ' ]

pattern = ''.join([
                 "(", # Start capturing group  
                 "\s*[A-Z/]", # Any number of space, until and including only the first capital or forward slash 
                 ".+?\:", # any character (non-greedy) up to and including the colon
                 "\s*", # One or more spaces
                 "\w+\s*", # One or more alphanumeric chars i.e. [a-zA-Z0-9]
                  ")", # End capturing group
                  "(.*)"
])

for s in l: 
    m = re.search(pattern, s)
    print("----------------")
    try:
        print(m.group(1))
        print(m.group(2))
        print(m.group(3))
    except Exception as e:
        pass

Output:

----------------
MC/MX/FF Number(s): None 
DUNS Number: -- 
----------------
Power Units: 1 
Drivers: 1 

Method two, parsing the string word by word. This method has the same basic characteristics as the regex, but can do more than two blocks of interest. It works by...

  1. Start parsing each string word for word, and loading that into a newstring.
  2. When it hits a colon, mark a flag.
  3. Add the first word from the next loop to newstring. You could change this to the 1-2, 1-3, or 1-n word if you wanted. You could also just have it keep adding words after colonflag is set until some criteria is met, like a word with a capital...although that could break on words like "None." You could go until a word is met that is ALL capitals, but then a not-all-capital header would break it.
  4. Add newstring to the newlist, reset the flag, and keep parsing words.

code:

s =     'MC/MX/FF Number(s): None DUNS Number: -- ' 
for s in l: 
    newlist = []
    newstring = ""
    colonflag = False
    for w in s.split():
        newstring += " " + w
        if colonflag: 
            newlist.append(newstring)
            newstring = ""
            colonflag = False

        if ":" in w:
            colonflag = True
    print(newlist)

Output:

[' MC/MX/FF Number(s): None', ' DUNS Number: --']
[' Power Units: 1', ' Drivers: 1']

Third option: Create a list of all the expected headers, like header_list = ["Operating Status:", "Out of Service Date:", "MC/MX/FF Number(s):", "DUNS Number:", "Power Units:", "Drivers:", ] and have it split/parse based on those.

Fourth option

Use Natural Language Processing and Machine Learning to actually figure out where the logical sentences are ;)

Ad
source: stackoverflow.com
Ad