Issues of using sort and comm

Per the comm manual, "Before `comm' can be used, the input files must be sorted using the collating sequence specified by the `LC_COLLATE' locale."

And the sort manual: "Unless otherwise specified, all comparisons use the character collating sequence specified by the `LC_COLLATE' locale.

Therefore, and a quick test confirms, the LC_COLLATE order comm expects is provided by the sort's default order, dictionary sort.

sort can sort files in a variety of manners:

  • -d: Dictionary order - ignores anything but whitespace and alphanumerics.
  • -g: General numeric - alpha, then negative numbers, then positive.
  • -h: Human-readable - negative, alpha, positive. n < nk = nK < nM < nG
  • -n: Numeric - negative, alpha, positive. k,M,G, etc. are not special.
  • -V: Version - positive, caps, lower, negative. 1 < 1.2 < 1.10
  • -f: Case-insensitive.
  • -R: Random - shuffle the input.
  • -r: Reverse - usually used with one of dghnV

There are other options, of course, but these are the ones you're likely to see or need.

Your test shows that the default sort order is probably -d, dictionary order.

  d   |   g   |   h   |   n   |   V 
------+-------+-------+-------+-------
  1   |  a    | -1G   | -10   |  1
 -1   |  A    | -1k   | -5    |  1G
  10  |  z    | -10   | -1    |  1g
 -10  |  Z    | -5    | -1g   |  1k
  1.10| -10   | -1    | -1G   |  1.2
  1.2 | -5    | -1g   | -1k   |  1.10
  1g  | -1    |  a    |  a    |  5
  1G  | -1g   |  A    |  A    |  10
 -1g  | -1G   |  z    |  z    |  A
 -1G  | -1k   |  Z    |  Z    |  Z
  1k  |  1    |  1    |  1    |  a
 -1k  |  1g   |  1g   |  1g   |  z
  5   |  1G   |  1.10 |  1G   | -1
 -5   |  1k   |  1.2  |  1k   | -1G
  a   |  1.10 |  5    |  1.10 | -1g
  A   |  1.2  |  10   |  1.2  | -1k
  z   |  5    |  1k   |  5    | -5
  Z   |  10   |  1G   |  10   | -10