Should I care about unnecessary cats?

The "definitive" answer is of course brought to you by The Useless Use of cat Award.

The purpose of cat is to concatenate (or "catenate") files. If it's only one file, concatenating it with nothing at all is a waste of time, and costs you a process.

Instantiating cat just so your code reads differently makes for just one more process and one more set of input/output streams that are not needed. Typically the real hold-up in your scripts is going to be inefficient loops and actuall processing. On most modern systems, one extra cat is not going to kill your performance, but there is ~~almost~~ always another way to write your code.

Most programs, as you note, are able to accept an argument for the input file. However, there is always the shell builtin < that can be used wherever a STDIN stream is expected which will save you one process by doing the work in the shell process that is already running.

You can even get creative with WHERE you write it. Normally it would be placed at the end of a command before you specify any output redirects or pipes like this:

sed s/blah/blaha/ < data | pipe

But it doesn't have to be that way. It can even come first. For instance your example code could be written like this:

< data \
    sed s/bla/blaha/ |
    grep blah |
    grep -n babla

If script readability is your concern and your code is messy enough that adding a line for cat is expected to make it easier to follow, there are other ways to clean up your code. One that I use a lot that helps make scripts easiy to figure out later is breaking up pipes into logical sets and saving them in functions. The script code then becomes very natural, and any one part of the pipline is easier to debug.

function fix_blahs () {
    sed s/bla/blaha/ |
    grep blah |
    grep -n babla
}

fix_blahs < data

You could then continue with fix_blahs < data | fix_frogs | reorder | format_for_sql. A pipleline that reads like that is really easy to follow, and the individual components can be debuged easily in their respective functions.

Here's a summary of some of the drawbacks of:

cat $file | cmd

over

< $file cmd

First, a note: there are (intentionally for the purpose of the discussion) missing double quotes around $file above. In the case of cat, that's always a problem except for zsh; in the case of the redirection, that's only a problem for bash or ksh88 and, for some other shells (including bash in POSIX mode) only when interactive (not in scripts).
The most often cited drawback is the extra process being spawned. Note that if cmd is builtin, that's even 2 processes in some shells like bash.
Still on the performance front, except in shells where cat is builtin, that also an extra command being executed (and of course loaded, and initialised (and the libraries it's linked to as well)).
Still on the performance front, for large files, that means the system will have to alternately schedule the cat and cmd processes and constantly fill up and empty the pipe buffer. Even if cmd does 1GB large read() system calls at a time, control will have to go back and forth between cat and cmd because a pipe can't hold more than a few kilobytes of data at a time.
Some cmds (like wc -c) can do some optimisations when their stdin is a regular file which they can't do with cat | cmd as their stdin is just a pipe then. With cat and a pipe, it also means they cannot seek() within the file. For commands like tac or tail, that makes a huge difference in performance as that means that with cat they need to store the whole input in memory.
The cat $file, and even its more correct version cat -- "$file" won't work properly for some specific file names like - (or --help or anything starting with - if you forget the --). If one insists on using cat, he should probably use cat < "$file" | cmd instead for reliability.
If $file cannot be open for reading (access denied, doesn't exist...), < "$file" cmd will report a consistent error message (by the shell) and not run cmd, while cat $file | cmd will still run cmd but with its stdin looking like it's an empty file. That also means that in things like < file cmd > file2, file2 is not clobbered if file can't be opened.

Or in other words you can choose the order in which the input and output files are opened as opposed to cmd file > file2 where the output file is always opened (by the shell) before the input file (by cmd), which is hardly ever preferable.

Note however that it won't help in cmd1 < file | cmd2 > file2 where cmd1 and cmd2 and their redirections are performed concurrently and independently and which you'd need to write as { cmd1 | cmd2; } < file > file2 or (cmd1 | cmd2 > file2) < file for instance to avoid file2 being clobbered and cmd1 and cmd2 being run if file can't be opened.

Putting <file on the end of a pipeline is less readable than having cat file at the start. Natural English reads from left to right.

Putting <file a the start of the pipeline is also less readable than cat, I would say. A word is more readable than a symbol, especially a symbol which seems to point the wrong way.

Using cat preserves the command | command | command format.

Should I care about unnecessary cats?

Tags:

Performance

Pipe

Cat

Shell Script

Related

Recent Posts