How to efficiently use CountVectorizer to get ngram counts for all files in a directory combined?

You can build a solution using the following flow:

1) Loop through you files and create a set of all tokens in your files. In the example below this is done using Counter, but you can use python sets to achieve the same result. The bonus here is that Counter will also give you the total number of occurrences of each term.

2) Fit CountVectorizer with the set/list of tokens. You can instantiate CountVectorizer with ngram_range=(1, 4). Below this is avoided in order to limit the number of features in df_new_data.

3) Transform new data as usual.

The example below works on small data. I hope you can adapt the code to suit your needs.

import glob
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

# Create a list of file names
pattern = 'C:\\Bytes\\*.csv'
csv_files = glob.glob(pattern)

# Instantiate Counter and loop through the files chunk by chunk 
# to create a dictionary of all tokens and their number of occurrence
counter = Counter()
c_size = 1000
for file in csv_files:
    for chunk in pd.read_csv(file, chunksize=c_size, index_col=0, header=None):
        counter.update(chunk[1])

# Fit the CountVectorizer to the counter keys
vectorizer = CountVectorizer(lowercase=False)
vectorizer.fit(list(counter.keys()))

# Loop through your files chunk by chunk and accummulate the counts
counts = np.zeros((1, len(vectorizer.get_feature_names())))
for file in csv_files:
    for chunk in pd.read_csv(file, chunksize=c_size,
                             index_col=0, header=None):
        new_counts = vectorizer.transform(chunk[1])
        counts += new_counts.A.sum(axis=0)

# Generate a data frame with the total counts
df_new_data = pd.DataFrame(counts, columns=vectorizer.get_feature_names())

df_new_data
Out[266]: 
      00     01     0A     0B     10     11     1A     1B     A0     A1  \
0  258.0  228.0  286.0  251.0  235.0  273.0  259.0  249.0  232.0  233.0   

      AA     AB     B0     B1     BA     BB  
0  248.0  227.0  251.0  254.0  255.0  261.0  

Code for the generation of the data:

import numpy as np
import pandas as pd

def gen_data(n): 
    numbers = list('01')
    letters = list('AB')
    numlet = numbers + letters
    x = np.random.choice(numlet, size=n)
    y = np.random.choice(numlet, size=n)
    df = pd.DataFrame({'X': x, 'Y': y})
    return df.sum(axis=1)

n = 2000
df_1 = gen_data(n)
df_2 = gen_data(n)

df_1.to_csv('C:\\Bytes\\df_1.csv')
df_2.to_csv('C:\\Bytes\\df_2.csv')

df_1.head()
Out[218]: 
0    10
1    01
2    A1
3    AB
4    1A
dtype: object

The sklearn documentation states that .fit_transform could take an iterable which yields either str, unicode or file objects. So you can create a generator which yield your files one by one and passes it to the fit method. You can create a generator by passing the path to your files as shown below:

def gen(path):
    A = os.listdir(path)
    for i in A:
        yield (i)

Now you can create your generator and pass it on to CountVectorizer as follows:

q = gen("/path/to/your/file/")

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1, 4))
cv.fit_transform(q)

Hoe this helps you!


By using a generator instead of list, your code won't store the value of your files into your memory. Instead, it will yield a value and let forget it, then yields the next, and so on. Here, I'll use your code and do a simple tweak to change list into a generator. You could just use () instead of [].

cv = CountVectorizer(ngram_range=(1, 4))
temp = cv.fit_transform((open(file).read() for file in files))