Ad

Python Splitting A Setence Based On Several Tokens

- 1 answer

I want to split a sentence based on several keywords:

p = r'(?:^|\s)(standard|of|total|sum)(?:\s|$)'
re.split(p,'10-methyl-Hexadecanoic acid of total fatty acids')

This outputs:

['10-methyl-Hexadecanoic acid', 'of', 'total fatty acids']

Expected output: ['10-methyl-Hexadecanoic acid', 'of', 'total', 'fatty acids']

I am not sure why the reg. expression does not split based on the token 'total'.

Ad

Answer

You may use

import re
p = r'(?<!\S)(standard|of|total|sum)(?!\S)'
s = '10-methyl-Hexadecanoic acid of total fatty acids'
print([x.strip() for x in re.split(p,s) if x.strip()])
# => ['10-methyl-Hexadecanoic acid', 'of', 'total', 'fatty acids']

See the Python demo

Details

  • (?<!\S)(standard|of|total|sum)(?!\S) will match and capture into Group 1 words in the group when enclosed with whitespaces or at the string start/end.
  • Comprehension will help get rid of blank items (if x.strip()) and x.strip() will trim whitespace from each non-blank item.
Ad
source: stackoverflow.com
Ad