Ad

Remove Last Occurrence Of A One Of Many Possible Substrings From A String In R

- 1 answer

This is a problem I would like to solve using R.

Suppose I have the following representative artificial example:

messy_addresses = c("12 Vancouver StreetVancouver",
                    "3 Victoria StreetVancouver",
                    "45 Vancouver StreetVictoria",
                    "678 New York AvenueNew York")

locations = c("Vancouver", "Victoria", "New York")

I would like to remove the last occurrence of any one of the entries in the 'locations' vector from the 'messy_addresses' vector.

The desired output should look like the following:

# Desired output
[1] "12 Vancouver Street" "3 Victoria Street" "45 Vancouver Street" "678 New York Avenue" 

I can do it manually for small data sets:

# ok for small data sets, not good for large ones:
library(stringi)
stri_replace_last_fixed(messy_addresses, "Vancouver", '')
stri_replace_last_fixed(messy_addresses, "Victoria", '')

Is there a nice way to do this for a large vector with many possible candidates of substrings, possibly even using regular expressions?

Ad

Answer

You can collapse the locations vector in one | separated regex and use it in stri_replace_last_regex.

stringi::stri_replace_last_regex(messy_addresses, paste0(locations, collapse = '|'), '')
[1] "12 Vancouver Street" "3 Victoria Street"   "45 Vancouver Street" "678 New York Avenue"
Ad
source: stackoverflow.com
Ad