'inf' in awk not working the way '-inf' does

The actual task is best solved by initializing your max/min values not by an imaginary "smallest" or "greatest" number (which may not be implemented in the framework you are using, in this case awk), but by initializing it using actual data. That way, it is always guaranteed to provide a meaningful result.

In your case, you can use the very first value you encounter (i.e. the entry in the first line) to initialize max and min, respectively, by adding a rule

NR==1{min=$1}

to your awk script. Then, if the first value is already the minimum, the subsequent test will not overwrite it, and in the end the correct result will be produced. The same holds for searches of the maximum value, so in combined searches, you can state

NR==1{max=min=$1}

As for the reason why your approach with inf didn't work with awk whereas -inf seemed to, @steeldriver has provided a good explanation in a comment to your question, which I will also summarize for the sake of completeness:

  • In awk, variables are "dynamically typed", i.e. everything can be a string or a number depending on use (but awk will "remember" what it was last used as and keep that information along for use in the next operation).
  • Whenever arithmetic operations involving a variable are found in the code, awk will try to interpret the content of that variable as a number and perform the operation, from where on the variable is typed as numerical if successful.
  • The default value for any variable that has not yet been assigned anything is the empty string, which is interpreted as 0 in arithmetic operations.
  • The variable name(*) inf has no special meaning in awk, hence when used just so, it is an empty variable that will evaluate to 0 in an arithmetic expression such as -inf. Therefore, the "maximum search" with the max variable initialized to -inf works if your data is all positive, because -inf is simply 0 (and as such, the smallest non-negative number).
  • In the "minimum search" problem, however, initializing min to inf will initialize the variable to the empty string, as no arithmetic operation is present that would warrant an automatic conversion of that empty string to a number.
  • Therefore, in the later comparisons

    if ($1<min) min=$1
    

    the input, $1, is compared with a string value, which is why awk treats $1 as a string, too, and performs a lexicographical comparison rather than a numerical one.

  • However, lexicographically, nothing is "smaller" than the empty string, and so min never gets assigned a new value. Therefore, in the END section, the statement

    print min
    

    prints the (still) empty string.

(*) see Stephen Kitt's answer on how a string with content "inf" can actually have a meaning in awk.


Your approach doesn’t work because inf doesn’t have a special meaning in GNU AWK in its default non-POSIX mode. As a result, it’s interpreted as a variable name, and since the variable hasn’t been set to anything, its value is 0 in an arithmetic context, and the empty string in a string context. Thus your code will only find the maximum value if it’s positive (since max is initialised in an arithmetic context), and won’t find the minimum value (since min is initialised in a string context); see AdminBee’s answer for details.

To determine the minimal and/or maximal values in a file (or stream), you should follow the advice given in AdminBee’s answer.

However, if you’re using GNU AWK, you can calculate log(0) to initialise your variables with positive or negative infinity, and use that in a manner similar to your approach:

BEGIN { max = log(0) }
$1 > max { max = $1 }
END { print max }
BEGIN { min = -log(0) }
$1 < min { min = $1 }
END { print min}

The only advantage of this approach compared to initialising the values from the first line, is it provides distinctive results when no values are processed — positive or negative infinity end up being reliable indicators that no value was seen. (There are other ways to determine this, including checking for an empty string as opposed to 0 when initialising from the first line.)

With GNU AWK in POSIX mode (POSIXLY_CORRECT=1), or other POSIX-compliant AWK interpreters such as mawk, providing "inf" as a string in an arithmetic context produces infinity, thanks to strtod:

BEGIN { max = "-inf" + 0 }
$1 > max { max = $1 }
END { print max }
BEGIN { min = "+inf" + 0 }
$1 < min { min = $1 }
END { print min}

There are, in fact, three values of infinity: -inf +inf and inf, and, to add more complexity to an issue that should be easy, in awk, there are quoted and unquoted code constants.

To show what I mean, try this (shell code in awk 4.2.1 (current Debian 10)):

for cmd in original-awk "busybox awk" mawk nawk awk; do
    printf '%-6.5s' "$cmd"
    $cmd 'BEGIN {
        a="-inf";b="+inf";c="inf";
        d= -inf ;e= +inf; f= inf;
        printf "-∞%4s%4s +∞%4s%4s ∞%4s%4s | -∞%4s%4s +∞%4s%4s ∞%4s%4s\n",a,a+0,b,b+0,c,c+0,d,d+0,e,e+0,f,f+0}
    ' file

To get:

bawk  -∞-inf-inf +∞+inf inf ∞ inf inf | -∞   0   0 +∞       0 ∞       0
busyb -∞-inf-inf +∞+inf inf ∞ inf inf | -∞   0   0 +∞   0   0 ∞       0
mawk  -∞-inf-inf +∞+inf inf ∞ inf inf | -∞   0   0 +∞   0   0 ∞       0
nawk  -∞-inf-inf +∞+inf inf ∞ inf   0 | -∞   0   0 +∞   0   0 ∞       0
gawk  -∞-inf-inf +∞+inf inf ∞ inf   0 | -∞   0   0 +∞   0   0 ∞       0

The table presents quoted and unquoted assignment to variables (abcdef).
For each case, the value as read by awk and as converted to number (var+0).

That says that a "-inf" stays as so even when numeric, a "+inf" gets converted to a numeric inf (without sign) and that a quoted "inf" might become either inf or 0 depending on the implementation (its 0 in nawk and gawk).

When unquoted, both -inf and +inf become 0 (except in bawk where +∞ is understood as the empty string "" and converts to 0).

Oddly enough, when unquoted, all inf are interpreted as the empty string.

But all unquoted -inf, +inf and inf become 0 when used as var+0.

So, for what you meant to do, you need quoted "-inf" and "+inf", never inf:

cat file | awk  '  BEGIN { max = "-inf"+0; min = "+inf"+0 }
                         { if ($1>max) max=$1
                           if ($1<min) min=$1
                         } 
                   END   { print min, max }
                '

Maybe, a easier (not portable0 way to understand it is to execute:

gawk 'BEGIN{
               a="-inf";b="+inf";c="inf";
               d= -inf ;e= +inf; f= inf;

               print a,   typeof(a),   b,   typeof(b),   c,   typeof(c)
               print a+0, typeof(a+0), b+0, typeof(b+0), c+0, typeof(c+0)

               print d,typeof(d),e,typeof(e),f,typeof(f)
               print d+0,typeof(d+0),e+0,typeof(e+0),f+0,typeof(f+0)
      }'

Which will print:

-inf string +inf string inf string
-inf number inf number 0 number
0 number 0 number  unassigned
0 number 0 number 0 number

Of course, the correct and portable solution is to give value to the max and min variables right from the start:

cat file | awk  '  NR==1 { min = max = $1 }
                         { if ($1>max) max=$1
                           if ($1<min) min=$1
                         } 
                   END   { print min, max }
                '

---

The description from the awk manual is:

  • With the --posix command-line option, gawk becomes “hands off.” String values are passed directly to the system library’s strtod() function, and if it successfully returns a numeric value, that is what’s used. By definition, the results are not portable across different systems. They are also a little surprising:
$ echo influence | gawk --posix '{ print $1 + 0 }'
  -| inf
$ echo 0xDeadBeef | gawk --posix '{ print $1 + 0 }'
  -| 3735928559
  • Without --posix, gawk interprets the four string values ‘+inf’, ‘-inf’, ‘+nan’, and ‘-nan’ specially, producing the corresponding special numeric values. The leading sign acts a signal to gawk (and the user) that the value is really numeric. Hexadecimal floating point is not supported (unless you also use --non-decimal-data, which is not recommended). For example:
$ echo nanny | gawk '{ print $1 + 0 }'
  -| 0
$ echo +nan | gawk '{ print $1 + 0 }'
  -| +nan
$ echo 0xDeadBeef | gawk '{ print $1 + 0 }'
  -| 0

gawk ignores case in the four special values. Thus, ‘+nan’ and ‘+NaN’ are the same.

Besides handling input, gawk also needs to print “correct” values on output when a value is either NaN or infinity. Starting with version 4.2.2, for such values gawk prints one of the four strings just described: ‘+inf’, ‘-inf’, ‘+nan’, or ‘-nan’. Similarly, in POSIX mode, gawk prints the result of the system’s C printf() function using the %g format string for the value, whatever that may be.

Tags:

Awk