Optimizing GNU grep

No, there's no such thing. Generally the cost of starting grep (fork a new process, load the executable, shared library, dynamic linkage...) would be a lot greater than compiling the regexps, so this kind of optimisation would make little sense.

Though see Why is matching 1250 strings against 90k patterns so slow? about a bug in some versions of GNU grep that would make it particularly slow for a great number of regexps.

Possibly here, you could avoid running grep several times by feeding your chunks to the same grep instance, for instance by using it as a co-process and use a marker to detect the end. With zsh and GNU grep and awk implementations other than mawk:

coproc grep -E -f patterns -e '^@@MARKER@@$' --line-buffered
process_chunk() {
  { cat; echo @@MARKER@@; } >&p & awk '$0 == "@@MARKER@@"{exit};1' <&p
}
process_chunk < chunk1 > chunk1.grepped
process_chunk < chunk2 > chunk2.grepped

Though it may be simpler to do the whole thing with awk or perl instead.

But if you don't need the grep output to go into different files for different chunks, you can always do:

{
  cat chunk1
  while wget -qO- ...; done # or whatever you use to fetch those chunks
  ...
} | grep -Ef patterns > output

Tags:

Grep