How to use spacy in large dataset with short sentences efficiently?

You can use multithreading in spacy to create a fast tokenization and data ingestion pipeline.

Rewriting your code block and functionality using the nlp.pipe method would look something like this:

import spacy
nlp = spacy.load('en')

docs = df['text'].tolist()

def token_filter(token):
    return not (token.is_punct | token.is_space | token.is_stop | len(token.text) <= 4)

filtered_tokens = []
for doc in nlp.pipe(docs):
    tokens = [token.lemma_ for token in doc if token_filter(token)]
    filtered_tokens.append(tokens)

This way puts all your filtering into the token_filter function, which takes in a spacy token and returns True only if it is not punctuation, a space, a stopword, and 4 or less characters. Then, you use this function as you pass through each token in each document, where it will return the lemma only if it meets all of those conditions. Then, filtered_tokens is a list of your tokenized documents.

Some helpful references for customizing this pipeline would be:

  • Token attributes
  • Language.pipe

You should filter out tokens after parsing. This way the trained model will give better tagging (unless it was trained on text filtered in a similar way, which is unlikely). Also, filtering afterwards makes it possible to use nlp.pipe, which is told to be fast. See the nlp.pipe example at http://spacy.io/usage/spacy-101#lightning-tour-multi-threaded.

Tags:

Python

Nlp

Spacy