Understanding min_df and max_df in scikit CountVectorizer

I would add this point also for understanding min_df and max_df in tf-idf better.

If you go with the default values, meaning considering all terms, you have generated definitely more tokens. So your clustering process (or any other thing you want to do with those terms later) will take a longer time.

BUT the quality of your clustering should NOT be reduced.

One might think that allowing all terms (e.g. too frequent terms or stop-words) to be present might lower the quality but in tf-idf it doesn't. Because tf-idf measurement instinctively will give a low score to those terms, effectively making them not influential (as they appear in many documents).

So to sum it up, pruning the terms via min_df and max_df is to improve the performance, not the quality of clusters (as an example).

And the crucial point is that if you set the min and max mistakenly, you would lose some important terms and thus lower the quality. So if you are unsure about the right threshold (it depends on your documents set), or if you are sure about your machine's processing capabilities, leave the min, max parameters unchanged.

max_df is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:

max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
max_df = 25 means "ignore terms that appear in more than 25 documents".

The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.

min_df is used for removing terms that appear too infrequently. For example:

min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
min_df = 5 means "ignore terms that appear in less than 5 documents".

The default min_df is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

Understanding min_df and max_df in scikit CountVectorizer

Tags:

Python

Nlp

Machine Learning

Scikit Learn

Related

Recent Posts