Ad
How Do I Get The Number Of Occurrences Of A List Of Words (substrings) In A Pandas Dataframe?
I have a pandas data frame with approximately 1.5 million rows. I want to find the number of occurrences of specific, selected words (which are all known) in a certain column. This works for a single word.
d = df["Content"].str.contains("word").value_counts()
But I want to find out the occurrences of multiple, known words like "word1", "word2" from a list. Also word2 could be word2 or wordtwo, like so:
word1 40
word2/wordtwo 120
How do I accomplish this?
Ad
Answer
IMO one of the most efficient approaches would be using sklearn.feature_extraction.text.CountVectorizer passing it a vocabulary (list of words, that you want to count).
Demo:
In [21]: text = """
...: I have a pandas data frame with approximately 1.5 million rows. I want to find the number of occurrences of specific, selected words in a certain colu
...: mn. This works for a single word. But I want to find out the occurrences of multiple, known words like "word1", "word2" from a list. Also word2 could
...: be word2 or wordtwo, like so"""
In [22]: df = pd.DataFrame(text.split('. '), columns=['Content'])
In [23]: df
Out[23]:
Content
0 \nI have a pandas data frame with approximatel...
1 I want to find the number of occurrences of sp...
2 This works for a single word
3 But I want to find out the occurrences of mult...
4 Also word2 could be word2 or wordtwo, like so
In [24]: from sklearn.feature_extraction.text import CountVectorizer
In [25]: vocab = ['word', 'words', 'word1', 'word2', 'wordtwo']
In [26]: vect = CountVectorizer(vocabulary=vocab)
In [27]: res = pd.Series(np.ravel((vect.fit_transform(df['Content']).sum(axis=0))),
index=vect.get_feature_names())
In [28]: res
Out[28]:
word 1
words 2
word1 1
word2 3
wordtwo 1
dtype: int64
Ad
source: stackoverflow.com
Related Questions
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Can't turn off Javascript using Selenium
- → WebDriver click() vs JavaScript click()
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module
Ad