Ad

How To Remove All Occurrences Of A Word Pattern But Excluding A Particular Pattern Using Str_remove In R

- 1 answer

I want to go through a vector and look for a particular string pattern (e.g. 'an'). If a match is found, remove the whole word, but only if that word is not a particular string pattern (e.g. 'orange').

So far I have come up with the following. In this example, I'm looking for the pattern 'an', and if a match is found, the whole word that that string is part of should be removed.

library(stringr)
# Create a small short data vector
    my_vec <- fruit[str_detect(fruit, "an")]

# remove all words that contain the pattern 'an'
str_remove(my_vec, "\\w*an\\w*" )

The output shows that most elements are removed (because they contain the pattern 'an'), but keeps the words "blood", "melon", and "purple" (which is as expected).

Next, I want to expand the str_remove-statement so that it does not remove the word 'orange'. So, still all words that contain "an" should be removed, but not if that word is 'orange'. The expect output is: "blood orange", "melon", and "orange".

I believe that '!' means to exclude a particular pattern, but I have not managed to get this to work. Any tips and insights are much appreciated.

Ad

Answer

You can do that in several ways:

str_remove_all(my_vec, "\\b(?!orange\\b)\\w*an\\w*" )
str_replace_all(my_vec, "\\b(orange)\\b|\\w*an\\w*", "\\1" )

See an R test:

library(stringr)
my_vec <- c("man,blood,melon,purple,orange.")
str_remove_all(my_vec, "\\b(?!orange\\b)\\w*an\\w*" )
# => [1] ",blood,melon,purple,orange."
str_replace_all(my_vec, "\\b(orange)\\b|\\w*an\\w*", "\\1" )
# => [1] ",blood,melon,purple,orange."

Details:

  • \b - a word boundary
  • (?!orange\b) - immediately to the right, there should be no orange as whole word
  • \w*an\w* - zero or more word chars, an and zero or more word chars.

In str_replace_all(my_vec, "\\b(orange)\\b|\\w*an\\w*", "\\1"), the regex matches and capturesorange as a whole word and puts it into Group 1, then a whole word with an is matched, and the replacement is \1, the backreference to Group 1.

Ad
source: stackoverflow.com
Ad