How do I get word frequency in a corpus using Scikit Learn CountVectorizer?

We are going to use the zip method to make dict from a list of words and list of their counts

import pandas as pd
import numpy as np    
from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish", "dog cat cat", "fish bird", "bird"]    

cv = CountVectorizer()   
cv_fit = cv.fit_transform(texts)    
word_list = cv.get_feature_names() 
count_list = cv_fit.toarray().sum(axis=0)

The outputs are following:

>> print word_list
['bird', 'cat', 'dog', 'fish']    
>> print count_list
[2 3 2 2]    
>> print dict(zip(word_list,count_list))
{'fish': 2, 'dog': 2, 'bird': 2, 'cat': 3}

cv_fit.toarray().sum(axis=0) definitely gives the correct result, but it will be much faster to perform the sum on the sparse matrix and then transform it to an array:

np.asarray(cv_fit.sum(axis=0))

cv.vocabulary_ in this instance is a dict, where the keys are the words (features) that you've found and the values are indices, which is why they're 0, 1, 2, 3. It's just bad luck that it looked similar to your counts :)

You need to work with the cv_fit object to get the counts

from sklearn.feature_extraction.text import CountVectorizer

texts = ["dog cat fish", "dog cat cat", "fish bird", "bird"]
cv = CountVectorizer()
cv_fit = cv.fit_transform(texts)

print(cv.get_feature_names())
print(cv_fit.toarray())
# ["bird", "cat", "dog", "fish"]
# [[0 1 1 1]
#  [0 2 1 0]
#  [1 0 0 1]
#  [1 0 0 0]]

Each row in the array is one of your original documents (strings), each column is a feature (word), and the element is the count for that particular word and document. You can see that if you sum each column you'll get the correct number

print(cv_fit.toarray().sum(axis=0))
# [2 3 2 2]

Honestly though, I'd suggest using collections.Counter or something from NLTK, unless you have some specific reason to use scikit-learn, as it'll be simpler.

How do I get word frequency in a corpus using Scikit Learn CountVectorizer?

Tags:

Python

Scikit Learn

Related

Recent Posts