Ad

Remove Every Word After Hitting An Integer In A List Of Strings (including The Number)

- 1 answer

The following is a subset of a list I'm having trouble with:

array(['DORFLEX 10 CP AV CH', 'CLOR.CICLOBENZAPRINA 5MG 30 CP EMS GEN C',
       'ADVIL MULHER 400MG AVULSO', 'SPIDUFEN MENTA 600MG C/10 SACHES L',
       'PONSTAN 500MG 8X3 CP', 'TANDRILAX 30 CP',
       'PARACETAMOL 750MG 20 CP NEOQ GEN',
       'DICLOFENACO SOD 50MG 20 CP MEDL GEN C','DORFLEX 30CP',
       'BENLYSTA 200MG/ML SOL INJ 4 SER PRE 1ML GELAD'], dtype=object)

The problem is that, for example, the first element of the list: "DORFLEX 10 ..." (and many other names) appear repeatedly in the list with the same name, but the number that follows it is different (different sizes), for example, "DORFLEX 15 ...". I'm trying to leave only the word "DORFLEX". Dropping the string after a space would solve this specific problem, but I have a lot of compound names like "DICLOFENAC SOD 50MG ...". That's why I'm wanting to drop the entire string after reaching the number (including the number) in order to reduce the number of products that are repeated but appear in different sizes.

So far I haven't found anything that brings me close to that. Any help is welcome. Thank you very much in advance

Ad

Answer

You can use forward look-up (regex):

re.search('.*?(?=( \d)|$)', some_string).group()

Or apply this to the entire list:

[re.search('.*?(?=( \d)|$)', line).group() for line in lines]

Then you get all your names in one go:

['DORFLEX',
 'CLOR.CICLOBENZAPRINA',
 'ADVIL MULHER',
 'SPIDUFEN MENTA',
 'PONSTAN',
 'TANDRILAX',
 'PARACETAMOL',
 'DICLOFENACO SOD',
 'DORFLEX',
 'BENLYSTA']
Ad
source: stackoverflow.com
Ad