Ad

Extracting Substring From A Pandas Column And Creating A New Column

- 1 answer

I have a column in my dataframe with numbers of clinical trials - NCT ids. It starts with \nNTC, and ends with \n.

Example:

  Old column
0 209629\nCTR20191933\nNCT04136145\nTrialTroveID...
1 54767414ALZ001\nDARZAD\nNCT04070378\nTrialTrov...
2 D5495C00005\nNCT04024501\nTrialTroveID-353576

etc

I want to extract only NCT numbers and create a new column in the dataframe with them

Expected output:

  Old column                                          New column
0 209629\nCTR20191933\nNCT04136145\nTrialTroveID...   NCT04136145
1 54767414ALZ001\nDARZAD\nNCT04070378\nTrialTrov...   NCT04070378
2 D5495C00005\nNCT04024501\nTrialTroveID-353576       NCT04024501
Ad

Answer

Use str.extract:

df['New column'] = df['Old column'].str.extract(r'(NCT\d+)')
print(df)

# Output
                                          Old column   New column
0  209629\nCTR20191933\nNCT04136145\nTrialTroveID...  NCT04136145
1  54767414ALZ001\nDARZAD\nNCT04070378\nTrialTrov...  NCT04070378
2      D5495C00005\nNCT04024501\nTrialTroveID-353576  NCT04024501

Note: the regex means match 'NCT' strings followed by 1 or more digits.

Ad
source: stackoverflow.com
Ad