Find files that contain multiple keywords anywhere in the file

awk 'FNR == 1 { f1=f2=f3=0; };

     /one/   { f1++ };
     /two/   { f2++ };
     /three/ { f3++ };

     f1 && f2 && f3 {
       print FILENAME;
       nextfile;
     }' *

If you want to automatically handle gzipped files, either run this in a loop with zcat (slow and inefficient because you'll be forking awk many times in a loop, once for each filename) or rewrite the same algorithm in perl and use the IO::Uncompress::AnyUncompress library module which can decompress several different kinds of compressed files (gzip, zip, bzip2, lzop). or in python, which also has modules for handling compressed files.


Here's a perl version that uses IO::Uncompress::AnyUncompress to allow for any number of patterns and any number of filenames (containing either plain text or compressed text).

All args before -- are treated as search patterns. All args after -- are treated as filenames. Primitive but effective option handling for this job. Better option handling (e.g. to support a -i option for case-insensitive searches) could be achieved with the Getopt::Std or Getopt::Long modules.

Run it like so:

$ ./arekolek.pl one two three -- *.gz *.txt
1.txt.gz
4.txt.gz
5.txt.gz
1.txt
4.txt
5.txt

(I won't list files {1..6}.txt.gz and {1..6}.txt here...they just contain some or all of the words "one" "two" "three" "four" "five" and "six" for testing. The files listed in the output above DO contain all three of the search patterns. Test it yourself with your own data)

#! /usr/bin/perl

use strict;
use warnings;
use IO::Uncompress::AnyUncompress qw(anyuncompress $AnyUncompressError) ;

my %patterns=();
my @filenames=();
my $fileargs=0;

# all args before '--' are search patterns, all args after '--' are
# filenames
foreach (@ARGV) {
  if ($_ eq '--') { $fileargs++ ; next };

  if ($fileargs) {
    push @filenames, $_;
  } else {
    $patterns{$_}=1;
  };
};

my $pattern=join('|',keys %patterns);
$pattern=qr($pattern);
my $p_string=join('',sort keys %patterns);

foreach my $f (@filenames) {
  #my $lc=0;
  my %s = ();
  my $z = new IO::Uncompress::AnyUncompress($f)
    or die "IO::Uncompress::AnyUncompress failed: $AnyUncompressError\n";

  while ($_ = $z->getline) {
    #last if ($lc++ > 100);
    my @matches=( m/($pattern)/og);
    next unless (@matches);

    map { $s{$_}=1 } @matches;
    my $m_string=join('',sort keys %s);

    if ($m_string eq $p_string) {
      print "$f\n" ;
      last;
    }
  }
}

A hash %patterns is contains the complete set of patterns that files have to contain at least one of each member $_pstring is a string containing the sorted keys of that hash. The string $pattern contains a pre-compiled regular expression also built from the %patterns hash.

$pattern is compared against each line of each input file (using the /o modifier to compile $pattern only once as we know it won't ever change during the run), and map() is used to build a hash (%s) containing the matches for each file.

Whenever all the patterns have been seen in the current file (by comparing if $m_string (the sorted keys in %s) is equal to $p_string), print the filename and skip to the next file.

This is not a particularly fast solution, but is not unreasonably slow. The first version took 4m58s to search for three words in 74MB worth of compressed log files (totalling 937MB uncompressed). This current version takes 1m13s. There are probably further optimisations that could be made.

One obvious optimisation is to use this in conjunction with xargs's -P aka --max-procs to run multiple searches on subsets of the files in parallel. To do that, you need to count the number of files and divide by the number of cores/cpus/threads your system has (and round up by adding 1). e.g. there were 269 files being searched in my sample set, and my system has 6 cores (an AMD 1090T), so:

patterns=(one two three)
searchpath='/var/log/apache2/'
cores=6
filecount=$(find "$searchpath" -type f -name 'access.*' | wc -l)
filespercore=$((filecount / cores + 1))

find "$searchpath" -type f -print0 | 
  xargs -0r -n "$filespercore" -P "$cores" ./arekolek.pl "${patterns[@]}" --

With that optimisation, it took only 23 seconds to find all 18 matching files. Of course, the same could be done with any of the other solutions. NOTE: The order of filenames listed in the output will be different, so may need to be sorted afterwards if that matters.

As noted by @arekolek, multiple zgreps with find -exec or xargs can do it significantly faster, but this script has the advantage of supporting any number of patterns to search for, and is capable of dealing with several different types of compression.

If the script is limited to examining only the first 100 lines of each file, it runs through all of them (in my 74MB sample of 269 files) in 0.6 seconds. If this is useful in some cases, it could be made into a command line option (e.g. -l 100) but it has the risk of not finding all matching files.


BTW, according to the man page for IO::Uncompress::AnyUncompress, the compression formats supported are:

  • zlib RFC 1950,
  • deflate RFC 1951 (optionally),
  • gzip RFC 1952,
  • zip,
  • bzip2,
  • lzop,
  • lzf,
  • lzma,
  • xz

One last (I hope) optimisation. By using the PerlIO::gzip module (packaged in debian as libperlio-gzip-perl) instead of IO::Uncompress::AnyUncompress I got the time down to about 3.1 seconds for processing my 74MB of log files. There were also some small improvements by using a simple hash rather than Set::Scalar (which also saved a few seconds with the IO::Uncompress::AnyUncompress version).

PerlIO::gzip was recommended as the fastest perl gunzip in https://stackoverflow.com/a/1539271/137158 (found with a google search for perl fast gzip decompress)

Using xargs -P with this didn't improve it at all. In fact it even seemed to slow it down by anywhere from 0.1 to 0.7 seconds. (I tried four runs and my system does other stuff in the background which will alter the timing)

The price is that this version of the script can only handle gzipped and uncompressed files. Speed vs flexibility: 3.1 seconds for this version vs 23 seconds for the IO::Uncompress::AnyUncompress version with an xargs -P wrapper (or 1m13s without xargs -P).

#! /usr/bin/perl

use strict;
use warnings;
use PerlIO::gzip;

my %patterns=();
my @filenames=();
my $fileargs=0;

# all args before '--' are search patterns, all args after '--' are
# filenames
foreach (@ARGV) {
  if ($_ eq '--') { $fileargs++ ; next };

  if ($fileargs) {
    push @filenames, $_;
  } else {
    $patterns{$_}=1;
  };
};

my $pattern=join('|',keys %patterns);
$pattern=qr($pattern);
my $p_string=join('',sort keys %patterns);

foreach my $f (@filenames) {
  open(F, "<:gzip(autopop)", $f) or die "couldn't open $f: $!\n";
  #my $lc=0;
  my %s = ();
  while (<F>) {
    #last if ($lc++ > 100);
    my @matches=(m/($pattern)/ogi);
    next unless (@matches);

    map { $s{$_}=1 } @matches;
    my $m_string=join('',sort keys %s);

    if ($m_string eq $p_string) {
      print "$f\n" ;
      close(F);
      last;
    }
  }
}

Set record separator to . so that awk will treat whole file as a one line:

awk -v RS='.' '/one/&&/two/&&/three/{print FILENAME}' *

Similarly with perl:

perl -ln00e '/one/&&/two/&&/three/ && print $ARGV' *

For compressed files, you could loop over each file and decompress first. Then, with a slightly modified version of the other answers, you can do:

for f in *; do 
    zcat -f "$f" | perl -ln00e '/one/&&/two/&&/three/ && exit(0); }{ exit(1)' && 
        printf '%s\n' "$f"
done

The Perl script will exit with 0 status (success) if all three strings were found. The }{ is Perl shorthand for END{}. Anything following it will be executed after all input has been processed. So the script will exit with a non-0 exit status if not all the strings were found. Therefore, the && printf '%s\n' "$f" will print the file name only if all three were found.

Or, to avoid loading the file into memory:

for f in *; do 
    zcat -f "$f" 2>/dev/null | 
        perl -lne '$k++ if /one/; $l++ if /two/; $m++ if /three/;  
                   exit(0) if $k && $l && $m; }{ exit(1)' && 
    printf '%s\n' "$f"
done

Finally, if you really want to do the whole thing in a script, you could do:

#!/usr/bin/env perl

use strict;
use warnings;

## Get the target strings and file names. The first three
## arguments are assumed to be the strings, the rest are
## taken as target files.
my ($str1, $str2, $str3, @files) = @ARGV;

FILE:foreach my $file (@files) {
    my $fh;
    my ($k,$l,$m)=(0,0,0);
    ## only process regular files
    next unless -f $file ;
    ## Open the file in the right mode
    $file=~/.gz$/ ? open($fh,"-|", "zcat $file") : open($fh, $file);
    ## Read through each line
    while (<$fh>) {
        $k++ if /$str1/;
        $l++ if /$str2/;
        $m++ if /$str3/;
        ## If all 3 have been found
        if ($k && $l && $m){
            ## Print the file name
            print "$file\n";
            ## Move to the net file
            next FILE;
        }
    }
    close($fh);
}

Save the script above as foo.pl somewhere in your $PATH, make it executable and run it like this:

foo.pl one two three *