How to remove surrogate characters in Java?

why not simply

for (int i = 0; i < query.length(); i++) 
    char c = query.charAt(i);
    if(!isHighSurrogate(c) && !isLowSurrogate(c))
        sb.append(c);

you probably should replace them with "?", instead of out right erasing them.


Java strings are stored as sequences of 16-bit chars, but what they represent is sequences of unicode characters. In unicode terminology, they are stored as code units, but model code points. Thus, it's somewhat meaningless to talk about removing surrogates, which don't exist in the character / code point representation (unless you have rogue single surrogates, in which case you have other problems).

Rather, what you want to do is to remove any characters which will require surrogates when encoded. That means any character which lies beyond the basic multilingual plane. You can do that with a simple regular expression:

return query.replaceAll("[^\u0000-\uffff]", "");

Here's a couple things:

  • Character.isSurrogate(char c):

    A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit.

  • Checking for pairs seems pointless, why not just remove all surrogates?

  • x == false is equivalent to !x

  • StringBuilder is better in cases where you don't need synchronization (like a variable that never leaves local scope).

I suggest this:

public static String removeSurrogates(String query) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < query.length(); i++) {
        char c = query.charAt(i);
        // !isSurrogate(c) in Java 7
        if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
            sb.append(firstChar);
        }
    }
    return sb.toString();
}

Breaking down the if statement

You asked about this statement:

if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
    sb.append(firstChar);
}

One way to understand it is to break each operation into its own function, so you can see that the combination does what you'd expect:

static boolean isSurrogate(char c) {
    return Character.isHighSurrogate(c) || Character.isLowSurrogate(c);
}

static boolean isNotSurrogate(char c) {
    return !isSurrogate(c);
}

...

if (isNotSurrogate(c)) {
    sb.append(firstChar);
}