Sorting the characters in a UTF-16 string in Java

We can't use char for Unicode, because Java's Unicode char handling is broken.

In the early days of Java, Unicode code points were always 16-bits (fixed size at exactly one char). However, the Unicode specification changed to allow supplemental characters. That meant Unicode characters are now variable widths, and can be longer than one char. Unfortunately, it was too late to change Java's char implementation without breaking a ton of production code.

So the best way to manipulate Unicode characters is by using code points directly, e.g., using String.codePointAt(index) or the String.codePoints() stream on JDK 1.8 and above.

Additional sources:

  • The Unicode 1.0 Standard, Chapter 2 (pg. 10 and 22)
  • Supplementary Characters in the Java Platform (Sun/Oracle)

I looked around for a bit and couldn't find any clean ways to sort an array by groupings of two elements without the use of a library.

Luckily, the codePoints of the String are what you used to create the String itself in this example, so you can simply sort those and create a new String with the result.

public static void main(String[] args) {
    int[] utfCodes = {128531, 128557, 128513};
    String emojis = new String(utfCodes, 0, 3);
    System.out.println("Initial String: " + emojis);

    int[] codePoints = emojis.codePoints().sorted().toArray();
    System.out.println("Sorted String: " + new String(codePoints, 0, 3));
}

Initial String: 😓😭😁

Sorted String: 😁😓😭

I switched the order of the characters in your example because they were already sorted.


If you are using Java 8 or later, then this is a simple way to sort the characters in a string while respecting (not breaking) multi-char codepoints:

int[] codepoints = someString.codePoints().sort().toArray();
String sorted = new String(codepoints, 0, codepoints.length);

Prior to Java 8, I think you either need to use a loop to iterate the code points in the original string, or use a 3rd-party library method.


Fortunately, sorting the codepoints in a String is uncommon enough that the clunkyness and relative inefficiency of the solutions above are rarely a concern.

(When was the last time you tested for anagrams of emojis?)