Why is Java String.length inconsistent across platforms with unicode characters?

You have to be careful about specifying the encodings:

  • when you compile the Java file, it uses some encoding for the source file. My guess is that this already broke your original String literal on compilation. This can be fixed by using the escape sequence.
  • after you use the escape sequence, the String.length are the same. The bytes inside the String are also the same, but what you are printing out does not show that.
  • the bytes printed are different because you called getBytes() and that again uses the environment or platform-specific encoding. So it was also broken (replacing unencodable smilies with question mark). You need to call getBytes("UTF-8") to be platform-independent.

So to answer the specific questions posed:

Same byte length, different String length. Why?

Because the string literal is being encoded by the java compiler, and the java compiler often uses a different encoding on different systems by default. This may result in a different number of character units per Unicode character, which results in a different string length. Passing the -encoding command line option with the same option across platforms will make them encode consistently.

Why "\uD83D\uDE42" ends up being encoded as 0x3F on the Windows machine is beyond me...

It's not encoded as 0x3F in the string. 0x3f is the question mark. Java puts this in when it is asked to output invalid characters via System.out.println or getBytes, which was the case when you encoded literal UTF-16 representations in a string with a different encoding and then tried to print it to the console and getBytes from it.

But then that means string literals are encoded differently on different platforms?

By default, yes.

Also... where is the byte sequence C3B0C5B8E284A2E2809A coming from to represent the smiley in Windows?

This is quite convoluted. The "" character (Unicode code point U+1F642) is stored in the Java source file with UTF-8 encoding using the byte sequence F0 9F 99 82. The Java compiler then reads the source file using the platform default encoding, Cp1252 (Windows-1252), so it treats these UTF-8 bytes as though they were Cp1252 characters, making a 4-character string by translating each byte from Cp1252 to Unicode, resulting in U+00F0 U+0178 U+2122 U+201A. The getBytes("utf-8") call then converts this 4-character string into bytes by encoding them as utf-8. Since every character of the string is higher than hex 7F, each character is converted into 2 or more UTF-8 bytes; hence the resulting string being this long. The value of this string is not significant; it's just the result of using an incorrect encoding.


You didn't take into account, that getBytes() returns the bytes in the platform's default encoding. This is different on windows and centOS.

See also How to Find the Default Charset/Encoding in Java? and the API documentation on String.getBytes().