Sorting in Ruby using the Unicode collation algorithm

I got very close using this algorithm with the icunicode gem.

require 'icunicode'

def database_sort_key(key)
  key.gsub(/\s+/,'').unicode_sort_key
end

array.sort_by { |v|
  [database_sort_key(v), v.unicode_sort_key]
}

First we sort using the unicode sort key with whitespace removed. Then if those are the same we sort by the unicode sort key of the original value.

This works around a weakness in unicode_sort_key: it doesn't consider spaces to be weak.

2.4.4 :007 > "fo p".unicode_sort_key.bytes.map { |b| b.to_s(16) }
 => ["33", "45", "4", "47", "1", "8", "1", "8"] 
2.4.4 :008 > "foo".unicode_sort_key.bytes.map { |b| b.to_s(16) }
 => ["33", "45", "45", "1", "7", "1", "7"]

Note that the space in fo p is as important as any other character. This results in 'fo p' < 'foo' which is incorrect. We work around this by first stripping out spaces before generating the key.

2.4.4 :011 > "fo p".gsub(/\s+/, '').unicode_sort_key.bytes.map { |b| b.to_s(16) }
 => ["33", "45", "47", "1", "7", "1", "7"] 
2.4.4 :012 > "foo".gsub(/\s+/, '').unicode_sort_key.bytes.map { |b| b.to_s(16) }
 => ["33", "45", "45", "1", "7", "1", "7"]

Now 'foo' < 'fo p' which is correct.

But because of the normalization we might have values which appear to be the same after whitespace has been stripped, fo o should be less than foo. So if the database_sort_keys are the same, we compare their plain unicode_sort_keys.

There's a few edge cases where it's wrong. foo should be less than fo o but this gets it backwards.

Here's it is as Enumerable methods.

module Enumerable
  # Just like `sort`, but tries to sort the same as the database does
  # using the proper Unicode collation algorithm. It's close.
  #
  # Differences in spacing, cases, and accents are less important than
  # character differences.
  #
  # "foo" < "fo p" o vs p is more important than the space difference
  # "Foo" < "fop" o vs p is more important than is case difference
  # "föo" < "fop" o vs p is more important than the accent difference
  #
  # It does not take a block.
  def sort_like_database(&block)
    if block_given?
      raise ArgumentError, "Does not accept a block"
    else
      # Sort by the database sort key. Two different strings can have the
      # same keys, if so sort just by its unicode sort key.
      sort_by { |v| [database_sort_key(v), v.unicode_sort_key] }
    end
  end

  # Just like `sort_by`, but it sorts like `sort_like_database`.
  def sort_by_like_database(&block)
    sort_by { |v|
      field = block.call(v)
      [database_sort_key(field), field.unicode_sort_key]
    }
  end

  # Sort by the unicode sort key after stripping out all spaces. This provides
  # a decent simulation of the Unicode collation algorithm and how it handles
  # spaces.
  private def database_sort_key(key)
    key.gsub(/\s+/,'').unicode_sort_key
  end
end

Does your use case allow for simply delegating the sorting to Postgres, rather than trying to recreate it in Ruby?

Part of the difficultly here is that there isn't a single correct sorting method, but any variable elements can result in fairly large discrepancies in the final sort order e.g. see the section on variable weighting.

For example, a gem like twitter-cldr-rb has a fairly robust implementation of the UCA, and is backed by a comprehensive test suite - but against the non-ignorable test cases, which is different from the Postgres implementation (Postgres appears to use the shift-trimmed variant).

The sheer number of test cases means that you have no guarantee that one working solution will match the Postgres sort order in all cases. E.g. will it handle en/em dash correctly, or even emojis? You could fork and modify the twitter-cldr-rb gem, but I suspect this won't be a small undertaking!

If you need to handle values that don't exist in the database, you can ask Postgres to sort them in a lightweight fashion by using a VALUES list:

sql = "SELECT * FROM (VALUES ('de luge'),('de Luge'),('de-luge'),('de-Luge'),('de-luge'),('de-Luge'),('death'),('deluge'),('deLuge'),('demark')) AS t(term) ORDER BY term ASC"
ActiveRecord::Base.connection.execute(sql).values.flatten

It will obviously result in a round-trip to Postgres, but should be very speedy nonetheless.

Sorting in Ruby using the Unicode collation algorithm

Tags:

Ruby

Postgresql

Unicode

Collation

Related

Recent Posts