How can I count the number of different characters in a file?

The following should work:

$ sed 's/\(.\)/\1\n/g' text.txt | sort | uniq -c

First, we insert a newline after every character, putting each character on its own line. Then we sort it. Then we use the uniq command to remove the duplicates, prefixing each line with the number of occurrences of that character.

To sort the list by frequency, pipe this all into sort -nr.

Steven's solution is a good, simple one. It's not so performant for very large files (files that don't fit comfortably in about half your RAM) because of the sorting step. Here's an awk version. It's also a little more complicated because it tries to do the right thing for a few special characters (newlines, ', \, :).

awk '
  {for (i=1; i<=length; i++) ++c[substr($0,i,1)]; ++c[RS]}
  function chr (x) {return x=="\n" ? "\\n" : x==":" ? "\\072" :
                           x=="\\" || x=="'\''" ? "\\" x : x}
  END {for (x in c) printf "'\''%s'\'': %d\n", chr(x), c[x]}
' | sort -t : -k 2 -r | sed 's/\\072/:/'

Here's a Perl solution on the same principle. Perl has the advantage of being able to sort internally. Also this will correctly not count an extra newline if the file does not end in a newline character.

perl -ne '
  ++$c{$_} foreach split //;
  END { printf "'\''%s'\'': %d\n", /[\\'\'']/ ? "\\$_" : /./ ? $_ : "\\n", $c{$_}
        foreach (sort {$c{$b} <=> $c{$a}} keys %c) }'

How can I count the number of different characters in a file?

Tags:

Command Line

Text Processing

Files

Related

Recent Posts