Splitting A Unique String - Python
I'm trying to find the best way to parse this type of string:
Operating Status: NOT AUTHORIZED Out of Service Date: None
I need the output to be like this:
['Operating Status: NOT AUTHORIZED', 'Out of Service Data: None']
Is there an easy way of doing this? I am parsing hundreds of string like this. There is no deterministic text but its always in the above format.
Other string examples:
MC/MX/FF Number(s): None DUNS Number: -- Power Units: 1 Drivers: 1
['MC/MX/FF Number(s): None, 'DUNS Number: --'] ['Power Units: 1, Drivers: 1 ']
There's two ways. Both are super klugy, and extremely dependent on very little fluctuation in the original string. However, you can modify the code to offer a little more flexibility.
Both of the options depend on the line meeting these characteristics... The grouping in question must...
- Start with a letter or slash, probably capitalized
- That title of interest is followed by a colon (":")
- Grab ONLY the first word after the colon.
Method 1, regex, this can only grab TWO blocks of data. The second group is "everything else" because I can't get the search pattern to repeat properly :P
import re l = [ 'MC/MX/FF Number(s): None DUNS Number: -- ', 'Power Units: 1 Drivers: 1 ' ] pattern = ''.join([ "(", # Start capturing group "\s*[A-Z/]", # Any number of space, until and including only the first capital or forward slash ".+?\:", # any character (non-greedy) up to and including the colon "\s*", # One or more spaces "\w+\s*", # One or more alphanumeric chars i.e. [a-zA-Z0-9] ")", # End capturing group "(.*)" ]) for s in l: m = re.search(pattern, s) print("----------------") try: print(m.group(1)) print(m.group(2)) print(m.group(3)) except Exception as e: pass
---------------- MC/MX/FF Number(s): None DUNS Number: -- ---------------- Power Units: 1 Drivers: 1
Method two, parsing the string word by word. This method has the same basic characteristics as the regex, but can do more than two blocks of interest. It works by...
- Start parsing each string word for word, and loading that into a
- When it hits a colon, mark a flag.
- Add the first word from the next loop to
newstring. You could change this to the 1-2, 1-3, or 1-n word if you wanted. You could also just have it keep adding words after
colonflagis set until some criteria is met, like a word with a capital...although that could break on words like "None." You could go until a word is met that is ALL capitals, but then a not-all-capital header would break it.
newlist, reset the flag, and keep parsing words.
s = 'MC/MX/FF Number(s): None DUNS Number: -- ' for s in l: newlist =  newstring = "" colonflag = False for w in s.split(): newstring += " " + w if colonflag: newlist.append(newstring) newstring = "" colonflag = False if ":" in w: colonflag = True print(newlist)
[' MC/MX/FF Number(s): None', ' DUNS Number: --'] [' Power Units: 1', ' Drivers: 1']
Create a list of all the expected headers, like
header_list = ["Operating Status:", "Out of Service Date:", "MC/MX/FF Number(s):", "DUNS Number:", "Power Units:", "Drivers:", ]
and have it split/parse based on those.
Use Natural Language Processing and Machine Learning to actually figure out where the logical sentences are ;)
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module