Compare two files on specific columns only line by line

Basically you are doing line by line comparison on two files and on specific columns excluding some columns; that, all you can do with GNU awk for the word-boundaries support \< & \>:

awk -F, -v skip='2,4,7' 'BEGIN{ filetwo=ARGV[1]; ARGV[1]=""; };{
    getline lf2 <filetwo; split(lf2, arr, ",");
    for (i=1; i<=NF; i++) {
        if ( (skip !~ "\\<"i"\\>") && $i!=arr[i] ) {
            print "Line#"FNR, "Column#" i " is different in two files."; mismatch=1; };
    };
}; mismatch { print $0; print lf2; mismatch=0; };' file2 file1

Or in any awk versions:

awk -F, -v skip_cols='2,4,7' '
    BEGIN{ filetwo=ARGV[1]; ARGV[1]=""; split(skip_cols, skip, ","); };{
    getline lf2 <filetwo; split(lf2, arr, ",");
    for (i=1; i<=NF; i++) {
        if ( !(i in skip) && $i!=arr[i] ) {
            print "Line#"FNR, "Column#" i " is different in two files."; mismatch=1; };
    };
}; mismatch { print $0; print lf2; mismatch=0; };' file2 file1

explaining the code:

  • The BEGIN { ... } block:
    this execute at very first and once before awk want to read any input.

    • Using ARGV, filetwo=ARGV[1];:
      read second argument passed to the command (that is file2) and save that into filetwo variable; first argument ARGV[0] is awk itself and the third one ARGV[2] is file1.
    • after we read the parameter's value, with ARGV[1]="" we unset its value, so awk will not found that parameter (file2) for reading.
    • skip="2 4 7";:
      we defined a variable (see Assignment Expressions) skip and set with the columns number we want to ignore the later.
  • getline command- see Using getline into a Variable from a File:
    we are reading a line from the file2 and assign it to variable lf2 (note the above that filetwo variable now contains the name of the second argument we read from ARGV[1])

  • split() function:
    we split the line we read from file2 which is in lf2 variable on comma character , and store in into array called arr.; now every fields of that line addressed by arr[1] (first field), arr[2] (second field), arr[3] (third), etc.

  • Within for-loop statement we checks two things below:

    • The value of variable i that indicates column number is not seen ! ~ within skip variable value (skip !~ "\\<"i"\\>"; \< and \> are word boundaries anchors, GNU awk specific, so i=2 will not match on 22); next
    • checking that value of column from file1 is not equal with the same column of file2 with same indexes: $i!=arr[i]; if those were not same print the mismatched line number FNR and the diff column index i and also set a control variable mismatch=1.
  • mismatch { print ... }: print both lines from file1 followed by line from file2 in lf2 only if mismatch was detected and mismatch variable was set within if statement; and reset the variable mismatch=0 for next line.


If I understand correctly :

  • you want to do a for loop on all fields : for(i=1;i<=NF;i++) { ... }
  • and inside: you want to SKIP when i is one of 4 values (in awk, "continue" will bypass the rest of the current for loop and go to the next iteration

A simple way: if you want to be able to skip fields, you can do this by using the following technique

BEGIN { skip[2]++; skip[3]++; skip[22]++; skip[23]++ }

....
 for(i=1;i<=NF;i++) {
   if (i in skip) { continue ; rem="Will skip for values defined in skip array indexes" }
   ...

Instead of defining "skip" from a BEGIN section, you could also have a file with the 4 indexes to be skipped (1 on each line), and read that file using the NR==FNR condition, populating the skip array with this, and then when NR!=FNR (when reading the source file) you use the above method to skip those fields.