awk + How do I find duplicates in a column?

This will give you the duplicated codes

awk -F, 'a[$5]++{print $5}'

if you're only interested in count of duplicate codes

awk -F, 'a[$5]++{count++} END{print count}'

To print duplicated rows try this

awk -F, '$5 in a{print a[$5]; print} {a[$5]=$0}'

This will print the whole row with duplicates found in col $5:

awk -F, 'a[$5]++{print $0}'

This is the less memory aggressive i can guess:

$ cat infile
country,latitude,longitude,name,code
AD,42.546245,1.601554,Andorra,376
AE,23.424076,53.847818,United Arab Emirates,971
AF,33.93911,67.709953,Afghanistan,93
AG,17.060816,-61.796428,Antigua and Barbuda,1
AI,18.220554,-63.068615,Anguilla,1
AL,41.153332,20.168331,Albania,355
AM,40.069099,45.038189,Armenia,374
AN,12.226079,-69.060087,Netherlands Antilles,599
AO,-11.202692,17.873887,Angola,355

$ awk -F\, '$NF in a{if (a[$NF]!=0){print a[$NF];a[$NF]=0}print;next}{a[$NF]=$0}' infile
AG,17.060816,-61.796428,Antigua and Barbuda,1
AI,18.220554,-63.068615,Anguilla,1
AL,41.153332,20.168331,Albania,355
AO,-11.202692,17.873887,Angola,355

NOTE: I have included another duplicate for testing purposes.

Tags:

Awk