How can I find the most frequent word in a .csv file, ignoring duplicates on each line?

With GNU grep or compatible:

$ grep -nEo '\w+' file.csv|sort -u|cut -d: -f2-|sort|uniq -c|sort -k1rn|head
      2 blue
      2 green
      2 red
      1 brown
      1 yellow

I would probably reach for perl

  • Use uniq from the List::Util module to de-duplicate each row.
  • Use a hash to count the resulting occurrences.

For example

perl -MList::Util=uniq -F, -lnE '
  map { $h{$_}++ } uniq @F 
  }{ 
  foreach $k (sort { $h{$b} <=> $h{$a} } keys %h) {say "$h{$k}: $k"}
' file.csv
2: red
2: green
2: blue
1: yellow
1: brown

If you have no option except the sort and uniq coreutils, then you can implement a similar algorithm with the addition of a shell loop

while IFS=, read -a words; do 
  printf '%s\n' "${words[@]}" | sort -u
done < file.csv | sort | uniq -c | sort -rn
  2 red
  2 green
  2 blue
  1 yellow
  1 brown

however please refer to Why is using a shell loop to process text considered bad practice?


You can use awk with an associative array and a simple logic check.

awk -F, '
  {split("", c); for (i=1; i<=NF; i++) 
      if (!c[$i]){c[$i]++;wds[$i]++}}
  END{for (wd in wds) print wds[wd], wd}' file

Output

1 brown
2 red
1 yellow
2 blue
2 green

Walkthrough

Set the field separator to ,

awk -F, '

You are going to count c to see if more than one occurrence of a word is on a line so make sure the word count is null at the beginning of each line with delete c; or split("", c) and then iterate over the fields

      {split("", c); for (i=1; i<=NF; i++) 

or

      {delete c; for (i=1; i<=NF; i++) 

if you have not seen the word $i yet on this line !c[$i] then increment the counter for that word c[$i]++ (to 1 so the condition test fails if it occurs again on the same line) and then increment the overall counting for that word wds[$i]++ when the test does not fail

      if (!c[$i]){c[$i]++;wds[$i]++}}

When the file is done then just iterate over the wds array and print the count wds[wd] and the word wd

      END{for (wd in wds) print wds[wd], wd}' file

just for fun

A hacky one with no awk associative array bits

awk -F, '{for (i=1; i<=NF; i++) print NR, $i}' file | 
    sort | uniq | awk '{print $2}'| sort | uniq -c | sort -nr

awk out the fields so they are preceded by their line numbers then sort | uniq to lose the line dupes, awk again to lose the numbering and then revert to your original code.