How to remove duplicate values based on multiple columns

Remove lines of which column 3, 4, 5 is the same:

awk '!($3==$4&&$4==$5)' data_file

Remove lines which has the same 3,4,5 columns with other line:

awk '!seen[$3,$4,$5]++' data_file

update for n columns

Remove lines of which column 3, 4, ...n is the same:

awk 'v=0;{for(i=4;i<=NF;i++) {if($i!=$3) {v=1; break;}}} v' data_file
  • v=0 reset v to 0 for every record
  • for(i=4;i<=NF;i++) {if($i!=$3) {v=1; break;}} loop from 4th column to last one, set v to 1 and break if it's not the same as 3rd column
  • v print if v is not 0.

Remove lines which has the same 3,4,...n columns with other line:

awk '(l=$0) && ($1=$2=""); !seen[$0]++ {print l}' data_file
  • (l=$0) && ($1=$2="") backup original line, empty 1st and 2nd columns, rebuild $0. This expression always evaluated to false, so it won't print anything. Note that && take precedence over =, that's why you need to () them;
  • !seen[$0]++ {print l} usual seen trick, print original line if it's unseen before.