Count the number of rows with a string occurring n times in multiple columns

Yes, you can do this in awk:

awk '{ 
       k=0; 
       for(i=2;i<=NF;i++){ 
         if($i == 0){
             k++
         }
       }
       if(k==3){
         tot++
       }
      }
      END{
          print tot
      }' file 

And also with (GNU) sed and wc:

$ sed -nE '/\b0\b.*\b0\b.*\b0\b/p' file | wc -l
7

But, personally, I would do in in perl instead:

$ perl -ale '$tot++ if (grep{$_ == 0 } @F) == 3 }{ print $tot' file 
7

Or, the slightly less condensed:

$ perl -ale 'if( (grep{$_ == 0 } @F) == 3 ){
                  $tot++ 
              }
              END{
                  print $tot
              }' file 
7

And the same thing, for the golfers among you:

$ perl -ale '(grep{$_==0}@F)==3&&$t++}{print$t' file
7

Explanation

  • -ale: -a makes perl behave like awk. It will read each line of the input file and split it on whitespace into the array @F. The -l adds a \n to each call of print and removes trailing newlines from the input and the -e is the script that should be applied to each line of input.
  • $tot++ if (grep{$_ == 0 } @F) == 3 : increment $tot by one, for every time where there are exactly 3 fields that are 0. Since the 1st field starts from 1, we know it will never be 0 so we don't need to exclude it.
  • }{: this is just a shorthand way of writing END{}, of giving a block of code that will be executed after the file has been processed. So, }{ print $tot will print the total number of lines with exactly three fields with a value of 0.

With GNU grep or ripgrep

$ LC_ALL=C grep -c $'\t''0\b.*\b0\b.*\b0\b' ip.txt 
7

$ rg -c '\t0\b.*\b0\b.*\b0\b' ip.txt
7

where $'\t' will match tab character, thus working even if first column is 0.


Sample run with large file:

$ perl -0777 -ne 'print $_ x 1000000' ip.txt > f1
$ du -h f1
92M f1

$ time LC_ALL=C grep -c $'\t''0\b.*\b0\b.*\b0\b' f1 > f2
real    0m0.416s

$ time rg -c '\t0\b.*\b0\b.*\b0\b' f1 > f3  
real    0m1.271s

$ time LC_ALL=C awk 'gsub(/\t0/,"")==3{c++} END{print c+0}' f1 > f4
real    0m8.645s

$ time perl -ale '$tot++ if (grep{$_ == 0 } @F) == 3 }{ print $tot' f1 > f5
real    0m14.349s

$ time LC_ALL=C sed -n 's/\t0\>//4;t;s//&/3p' f1 | wc -l > f6
real    0m14.075s
$ time LC_ALL=C sed -n 's/\t0\>/&/3p' f1 | wc -l > f8    
real    0m6.772s

$ time LC_ALL=C awk '{ 
       k=0; 
       for(i=2;i<=NF;i++){ 
         if($i == 0){
             k++
         }
       }
       if(k==3){
         tot++
       }
      }
      END{
          print tot
      }' f1 > f7 
real    0m10.675s

Remove LC_ALL=C if file can contain non-ASCII characters. ripgrep is usually faster than GNU grep but in test run GNU grep was faster. As per ripgrep's author, (?-u:\b) can be used to avoid unicode word boundary, but that resulted in similar time for above case.


$ awk 'gsub(/\t0/,"")==3{c++} END{print c+0}' file
7