How does awk '!a[$0]++' work?

Here is a "intuitive" answer, for a more in depth explanation of awk's mechanism see either @Cuonglm's

In this case, !a[$0]++, the post-increment ++ can be set aside for a moment, it does not change the value of the expression. So, look at only !a[$0]. Here:

a[$0]

uses the current line $0 as key to the array a, taking the value stored there. If this particular key was never referenced before, a[$0] evaluates to the empty string.

!a[$0]

The ! negates the value from before. If it was empty or zero (false), we now have a true result. If it was non-zero (true), we have a false result. If the whole expression evaluated to true, meaning that a[$0] was not set to begin with, the whole line is printed as the default action.

Also, regardless of the old value, the post-increment operator adds one to a[$0], so the next the same value in the array is accessed, it will be positive and the whole condition will fail.


Here is the processing:

  • a[$0]: look at the value of key $0, in associative array a. If it does not exist, automatically create it with an empty string.

  • a[$0]++: increment the value of a[$0], return the old value as value of expression. The ++ operator returns a numeric value, so if a[$0] was empty to begin with, 0 is returned and a[$0] incremented to 1.

  • !a[$0]++: negate the value of expression. If a[$0]++ returned 0 (a false value), the whole expression evaluates to true, and makes awk perform the default action print $0. Otherwise, if the whole expression evaluates to false, no further action is taken.

References:

  • Expression in awk
  • gawk - Increment and Decrement Operators

With gawk, we can use dgawk (or awk --debug with newer version) to debug a gawk script. First, create a gawk script, named test.awk:

BEGIN {                                                                         
    a = 0;                                                                      
    !a++;                                                                       
}

Then run:

dgawk -f test.awk

or:

gawk --debug -f test.awk

In debugger console:

$ dgawk -f test.awk
dgawk> trace on
dgawk> watch a
Watchpoint 1: a
dgawk> run
Starting program: 
[     1:0x7fe59154cfe0] Op_rule             : [in_rule = BEGIN] [source_file = test.awk]
[     2:0x7fe59154bf80] Op_push_i           : 0 [PERM|NUMCUR|NUMBER]
[     2:0x7fe59154bf20] Op_store_var        : a [do_reference = FALSE]
[     3:0x7fe59154bf60] Op_push_lhs         : a [do_reference = TRUE]
Stopping in BEGIN ...
Watchpoint 1: a
  Old value: untyped variable
  New value: 0
main() at `test.awk':3
3           !a++;
dgawk> step
[     3:0x7fe59154bfc0] Op_postincrement    : 
[     3:0x7fe59154bf40] Op_not              : 
Watchpoint 1: a
  Old value: 0
  New value: 1
main() at `test.awk':3
3           !a++;
dgawk>

You can see, Op_postincrement was executed before Op_not.

You can also use si or stepi instead of s or step to see more clearly:

dgawk> si
[     3:0x7ff061ac1fc0] Op_postincrement    : 
3           !a++;
dgawk> si
[     3:0x7ff061ac1f40] Op_not              : 
Watchpoint 1: a
  Old value: 0
  New value: 1
main() at `test.awk':3
3           !a++;

Tags:

Awk