Distributing a script: Should I use /bin/gawk or /usr/bin/gawk for shebang?

If you don't need to pass arguments to the command then #!/usr/bin/env gawk is the way to go, however many kernels (including Linux) only accept a single argument to shebang programs.

Otherwise, you can make a polyglot program that is both a shell wrapper and the awk script. Here's one for awk.

#!/bin/sh
true + /; exec gawk -f "$0"; exit; / {}
# awk script starts here

Shell parsing:

  • true + /; — the command true (which does nothing) with two inert arguments + and /.
  • The call to gawk. This could be any shell snippet that doesn't contain newlines and where slashes are written \/ (the shell doesn't mind except inside quotes).
    The call uses exec to replace the shell with gawk instead of executing gawk as a subprocess.
  • exit; — exit the shell, in case gawk was not found. Anything after that is ignored, except that it should be valid shell syntax in case the shell tries to parse the whole line before starting to execute it.

Awk parsing:

  • The bit between slashes is a regular expression.
  • true + /REGEX/ — a condition. true is an undefined variable so its numeric value is 0, not that it matters.
  • {} — If said condition holds, do nothing.

Shebang wasn't meant to be that flexible. There may be some cases where having a second parameter works, I think FreeBSD is one of them.

gawk and most utilities that come with the OS are expected to be in /usr/bin/.

In the older UNIX days, it was common to have /usr/ mounted over NFS or some less expensive media to save local disk space and cost per workstation. /bin/ was supposed to have everything needed to boot in single user mode. Since /usr/ wasn't mounted on a reliable media, /bin/ included enough utilities to make it friendly enough for general administration and troubleshooting.

This was inherited in Linux initially, but as disk space is no longer an issue and in most cases /usr/ is in the root filesystem, the current trend is to move everything in /usr/bin (at least in the Linux world). So most utilities installed by a distro are expected to be found there. Even the most basic utilities, like cp, rm, ls etc (well, not yet).

Regarding the shebang choice. Traditionally, this is something the admins or users have to edit according to their environment. For all a developer knows, in other people's systems, the interpreter could be anywhere in the filesystem (eg /usr/local/bin, /opt/gawk-4.0.1/bin). Properly packaged scripts (rpm, deb etc) come with either a dependency on a distro package (ie. the interpreter has a known location) or a config script that setups the proper hashbang during installation.


Gilles' proposed solution is indeed a very good approach (finally have the reputation to vote in his post :) ).

In any case, as far as I understand the exec command, it makes the exit right after it unnecessary, actually unreachable, as the shell process is replaced out by awk.

In addition, in order to allow the awk script to access its invocation parameters, I would suggest some changes in the proposed solution:

#!/bin/sh
true + /; exec -a "$0" gawk -f "$0" -- "$@"; / {}
# awk script starts here

The -a "$0" allows the script to have access to its invocation name, otherwise it will always get a awk or gawk when accessing the ARGV[0] variable. Similarly, the "$@" allows the script to access the remaining parameters in the ARGV[1...N] array and the -- preceding it allows the script to receive -<something> arguments without gawk interpreting them is meant for it.

One thing to remember/consider is to add an exit(0); statement in the end of the BEGIN { ... } block of the awk script program, otherwise awk will threat all parameters passed to the script as input files. (Please note that it has nothing to do, at all, with the exit statement we removed from the true + ... line, this was a unreachable shell statement while this suggested exit is in the awk code) .

Tags:

Awk

Shebang

Env