Computing TF-IDF on the whole dataset or only on training data?

According to the documentation of scikit-learn, fit() is used in order to

Learn vocabulary and idf from training set.

On the other hand, fit_transform() is used in order to

Learn vocabulary and idf, return term-document matrix.

while transform()

Transforms documents to document-term matrix.

On the training set you need to apply both fit() and transform() (or just fit_transform() that essentially joins both operations) however, on the testing set you only need to transform() the testing instances (i.e. the documents).

Remember that training sets are used for learning purposes (learning is achieved through fit()) while testing set is used in order to evaluate whether the trained model can generalise well to new unseen data points.

For more details you can refer to the article fit() vs transform() vs fit_transform()

Author gives all text data before separating train and test to function. Is it a true action or we must separate data first then perform tfidf fit_transform on train and transform on test?

I would consider this as already leaking some information about the test set into the training set.

I tend to always follow the rule that before any pre-processing first thing to do is to separate the data, create a hold-out set.

Computing TF-IDF on the whole dataset or only on training data?

Tags:

Python

Nlp

Machine Learning

Tf Idf

Scikit Learn

Related

Recent Posts