Sort not sorting lines with a pipe '|' in it correctly

sort is locale aware, so depending on your LC_COLLATE setting (which is inherited from LANG) you may get different results:

$ LANG=C sort sort_fail.csv 
241|212|20810378
241|213|20810376
24|121|2810172
column_a|column_b|column_c

$ LANG=en_US sort sort_fail.csv
241|212|20810378
24|121|2810172
241|213|20810376
column_a|column_b|column_c

This can cause problems in scripts, because you may not be aware of what the calling locale is set to, and so may get different results.

It's not uncommon for scripts to force the setting needed

e.g.

$ grep 'LC.*sort' /bin/precat
      LC_COLLATE=C sort -u | prezip-bin -z "$cmd: $2"

Now what's interesting, here, is the | character looks odd.

But that's because the default rule for en_US, which derives from ISO, says

$ grep 007C /usr/share/i18n/locales/iso14651_t1_common
<U007C> IGNORE;IGNORE;IGNORE;<j> # 142 |

Which means the | character is ignored and the sort order would be as if the character doesn't exist..

$ tr -d '|' < sort_fail.csv | LANG=C sort
24121220810378
241212810172
24121320810376
column_acolumn_bcolumn_c

And that matches the "unexpected" sorting you are seeing.

The work arounds are to use -n (to force numeric sorts), or to use the field separator (as you did) or to use the C locale.

Tags:

Sort