Search for text files where two different words exist (any order, any line)

With GNU tools:

find . -type f  -exec grep -lZ FIND {} + | xargs -r0 grep -l ME

You can do standardly:

find . -type f -exec grep -q FIND {} \; -exec grep -l ME {} \;

But that would run up to two greps per file. To avoid running that many greps and still be portable while still allowing any character in file names, you could do:

convert_to_xargs() {
  sed "s/[[:blank:]\"\']/\\\\&/g" | awk '
    {
      if (NR > 1) {
        printf "%s", line
        if (!index($0, "//")) printf "\\"
        print ""
      }
      line = $0
    }'
    END { print line }'
}

export LC_ALL=C
find .//. -type f |
  convert_to_xargs |
  xargs grep -l FIND |
  convert_to_xargs |
  xargs grep -l ME

The idea being to convert the output of find into a format suitable for xargs (that expects a blank (SPC/TAB/NL in the C locale, YMMV in other locales) separated list of words where single, double quotes and backslashes can escape blanks and each other).

Generally you can't post-process the output of find -print, because it separates the file names with a newline character and doesn't escape the newline characters that are found in file names. For instance if we see:

./a
./b

We've got no way to know whether it's one file called b in a directory called a<NL>. or if it's the two files a and b in the current directory.

By using .//., because // cannot appear otherwise in a file path as output by find (because there's no such thing as a directory with an empty name and / is not allowed in a file name), we know that if we see a line that contains //, then that's the first line of a new filename. So we can use that awk command to escape all newline characters but those that precede those lines.

If we take the example above, find would output in the first case (one file):

.//a
./b

Which awk escapes to:

.//a\
./b

So that xargs sees it as one argument. And in the second case (two files):

.//a
.//b

Which awk would leave as is, so xargs sees two arguments.

You need the LC_ALL=C so sed, awk (and some implementations of xargs) work for arbitrary sequences of bytes (even though that don't form valid characters in the user's locale), to simplify the blank definition to just SPC and TAB and to avoid problems with different interpretations of characters whose encoding contains the encoding of backslash by the different utilities.

If the files are in a single directory and their name don't contain space, tab, newline, *, ? nor [ characters and don't start with - nor ., this will get a list of files containing ME, then narrow that down to the ones that also contain FIND.

grep -l FIND `grep -l ME *`

With awk you could also run:

find . -type f  -exec awk 'BEGIN{cx=0; cy=0}; /FIND/{cx++}
/ME/{cy++}; END{if (cx > 0 && cy > 0) print FILENAME}' {} \;

It uses cx and cy to count for lines matching FIND and respectively ME. In the END block, if both counters > 0, it prints the FILENAME.
This would be faster/more efficient with gnu awk:

find . -type f  -exec gawk 'BEGINFILE{cx=0; cy=0}; /FIND/{cx++}
/ME/{cy++}; ENDFILE{if (cx > 0 && cy > 0) print FILENAME}' {} +

Search for text files where two different words exist (any order, any line)

Tags:

Search

Grep

Find

Related

Recent Posts