find n most frequent words in a file

That's pretty much the most common way of finding "N most common things", except you're missing a sort, and you've got a gratuitious cat:

tr -c '[:alnum:]' '[\n*]' < test.txt | sort | uniq -c | sort -nr | head  -10

If you don't put in a sort before the uniq -c you'll probably get a lot of false singleton words. uniq only does unique runs of lines, not overall uniquness.

EDIT: I forgot a trick, "stop words". If you're looking at English text (sorry, monolingual North American here), words like "of", "and", "the" almost always take the top two or three places. You probably want to eliminate them. The GNU Groff distribution has a file named eign in it which contains a pretty decent list of stop words. My Arch distro has /usr/share/groff/current/eign, but I think I've also seen /usr/share/dict/eign or /usr/dict/eign in old Unixes.

You can use stop words like this:

tr -c '[:alnum:]' '[\n*]' < test.txt |
fgrep -v -w -f /usr/share/groff/current/eign |
sort | uniq -c | sort -nr | head  -10

My guess is that most human languages need similar "stop words" removed from meaningful word frequency counts, but I don't know where to suggest getting other languages stop words lists.

EDIT: fgrep should use the -w command, which enables whole-word matching. This avoids false positives on words that merely contain short stop works, like "a" or "i".


This works better with utf-8:

$ sed -e 's/\s/\n/g' < test.txt | sort | uniq -c | sort -nr | head  -10

Let's use AWK!

This function lists the frequency of each word occurring in the provided file in Descending order:

function wordfrequency() {
  awk '
     BEGIN { FS="[^a-zA-Z]+" } {
         for (i=1; i<=NF; i++) {
             word = tolower($i)
             words[word]++
         }
     }
     END {
         for (w in words)
              printf("%3d %s\n", words[w], w)
     } ' | sort -rn
}

You can call it on your file like this:

$ cat your_file.txt | wordfrequency

and for the top 10 words:

$ cat your_file.txt | wordfrequency | head -10

Source: AWK-ward Ruby