cat line X to line Y on a huge file

I suggest the sed solution, but for the sake of completeness,

awk 'NR >= 57890000 && NR <= 57890010' /path/to/file

To cut out after the last line:

awk 'NR < 57890000 { next } { print } NR == 57890010 { exit }' /path/to/file

Speed test (here on macOS, YMMV on other systems):

  • 100,000,000-line file generated by seq 100000000 > test.in
  • Reading lines 50,000,000-50,000,010
  • Tests in no particular order
  • real time as reported by bash's builtin time
 4.373  4.418  4.395    tail -n+50000000 test.in | head -n10
 5.210  5.179  6.181    sed -n '50000000,50000010p;57890010q' test.in
 5.525  5.475  5.488    head -n50000010 test.in | tail -n10
 8.497  8.352  8.438    sed -n '50000000,50000010p' test.in
22.826 23.154 23.195    tail -n50000001 test.in | head -n10
25.694 25.908 27.638    ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574    awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127    awk 'NR >= 57890000 && NR <= 57890010' test.in

These are by no means precise benchmarks, but the difference is clear and repeatable enough* to give a good sense of the relative speed of each of these commands.

*: Except between the first two, sed -n p;q and head|tail, which seem to be essentially the same.


If you want lines X to Y inclusive (starting the numbering at 1), use

tail -n "+$X" /path/to/file | head -n "$((Y-X+1))"

tail will read and discard the first X-1 lines (there's no way around that), then read and print the following lines. head will read and print the requested number of lines, then exit. When head exits, tail receives a SIGPIPE signal and dies, so it won't have read more than a buffer size's worth (typically a few kilobytes) of lines from the input file.

Alternatively, as gorkypl suggested, use sed:

sed -n -e "$X,$Y p" -e "$Y q" /path/to/file

The sed solution is significantly slower though (at least for GNU utilities and Busybox utilities; sed might be more competitive if you extract a large part of the file on an OS where piping is slow and sed is fast). Here are quick benchmarks under Linux; the data was generated by seq 100000000 >/tmp/a, the environment is Linux/amd64, /tmp is tmpfs and the machine is otherwise idle and not swapping.

real  user  sys    command
 0.47  0.32  0.12  </tmp/a tail -n +50000001 | head -n 10 #GNU
 0.86  0.64  0.21  </tmp/a tail -n +50000001 | head -n 10 #BusyBox
 3.57  3.41  0.14  sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #GNU
11.91 11.68  0.14  sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #BusyBox
 1.04  0.60  0.46  </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #GNU
 7.12  6.58  0.55  </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #BusyBox
 9.95  9.54  0.28  sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #GNU
23.76 23.13  0.31  sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #BusyBox

If you know the byte range you want to work with, you can extract it faster by skipping directly to the start position. But for lines, you have to read from the beginning and count newlines. To extract blocks from x inclusive to y exclusive starting at 0, with a block size of b:

dd bs="$b" seek="$x" count="$((y-x))" </path/to/file

The head | tail approach is one of the best and most "idiomatic" ways to do this:

X=57890000
Y=57890010
< infile.txt head -n "$Y" | tail -n +"$X"

As pointed out by Gilles in the comments, a faster way is

< infile.txt tail -n +"$X" | head -n "$((Y - X))"

The reason this is faster is the first X - 1 lines don't need to go through the pipe compared to the head | tail approach.

Your question as phrased is a bit misleading and probably explains some of your unfounded misgivings towards this approach.

  • You say you have to calculate A, B, C, D but as you can see, the line count of the file is not needed and at most 1 calculation is necessary, which the shell can do for you anyways.

  • You worry that piping will read more lines than necessary. In fact this is not true: tail | head is about as efficient as you can get in terms of file I/O. First, consider the minimum amount of work necessary: to find the X'th line in a file, the only general way to do it is to read every byte and stop when you count X newline symbols as there is no way to divine the file offset of the X'th line. Once you reach the *X*th line, you have to read all the lines in order to print them, stopping at the Y'th line. Thus no approach can get away with reading less than Y lines. Now, head -n $Y reads no more than Y lines (rounded to the nearest buffer unit, but buffers if used correctly improve performance, so no need to worry about that overhead). In addition, tail will not read any more than head, so thus we have shown that head | tail reads the fewest number of lines possible (again, plus some negligible buffering that we are ignoring). The only efficiency advantage of a single tool approach that does not use pipes is fewer processes (and thus less overhead).