Why is tail file | tr (pipeline) faster than sed or perl with many lines?

It boils down to the amount of work being done.

Your tail | tr command ends up doing the following:

in tail:
- read until a newline;
- output everything remaining, without caring about newlines;
in tr, read, without caring about newlines, and output everything apart from ‘"’ (a fixed character).

Your sed command ends up doing the following, after interpreting the given script:

read until a newline, accumulating input;
if this is the first line, delete it;
replace all double quotes with nothing, after interpreting the regular expression;
output the processed line;
loop until the end of the file.

Your Perl command ends up doing the following, after interpreting the given script:

read until a newline, accumulating input;
replace all double quotes with nothing, after interpreting the regular expression;
if this is not the first line, output the processed line;
loop until the end of the file.

Looking for newlines ends up being expensive on large inputs.

Mainly because perl and sed process each line separately.

If you let perl process the input by larger blocks, and simplify it a bit (see note), you can make it much faster -- yet nowhere as fast as tr:

time perl -ne ' { s/"//g; print if $. > 1 }' file.txt 1> /dev/null

real    0m0.617s
user    0m0.612s
sys     0m0.005s

time perl -pe 'BEGIN{<>;$/=\40960} s/"//g' file.txt >/dev/null

real    0m0.186s
user    0m0.177s
sys     0m0.009s

time tail -n +2 file.txt | tr -d \" 1> /dev/null

real    0m0.033s
user    0m0.031s
sys     0m0.023s

note: don't use perl -ne '... if $. > 1' or awk 'NR == 1 { ... } /foo/ { ... }'.

Use BEGIN{<>} and BEGIN{getline} instead.

After you have read the first line, you can be pretty darn sure that no subsequent line will be the first line anymore: no need to check again and again.

tail_lines() from tail.c:

      /* Use file_lines only if FD refers to a regular file for
         which lseek (... SEEK_END) works.  */

      if ( ! presume_input_pipe
           && S_ISREG (stats.st_mode)
           && (start_pos = lseek (fd, 0, SEEK_CUR)) != -1
           && start_pos < (end_pos = lseek (fd, 0, SEEK_END)))

This end_pos = lseek (fd, 0, SEEK_END) is where the contents of the file are skipped. In file_lines() there is backwards scan counting the newlines.

lseek() is quite a simple system call, to reposition the file offset for read/write.

Oh it seems I missed the subtlety in this Q ;) It is all about reading linewise vs. blockwise. Normally it is a good idea to combine several passes into one complex pass. But here the algorithm only needs the very first newline.

Ole's two-parted perl script with sysread() illustrates how he switches from searching for the first newline(s) to reading a maximum block.

When tail works normal backways, it reads the last block and counts the newlines. It prints from there or reads in the next-to-last block.

Why is tail file | tr (pipeline) faster than sed or perl with many lines?

Tags:

Performance

Perl

Sed

Text Processing

Related

Recent Posts