How to convert a String to Bytearray

I suppose C# and Java produce equal byte arrays. If you have non-ASCII characters, it's not enough to add an additional 0. My example contains a few special characters:

var str = "Hell ö € Ω ð„ž";
var bytes = [];
var charCode;

for (var i = 0; i < str.length; ++i)
{
    charCode = str.charCodeAt(i);
    bytes.push((charCode & 0xFF00) >> 8);
    bytes.push(charCode & 0xFF);
}

alert(bytes.join(' '));
// 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30

I don't know if C# places BOM (Byte Order Marks), but if using UTF-16, Java String.getBytes adds following bytes: 254 255.

String s = "Hell ö € Ω ";
// now add a character outside the BMP (Basic Multilingual Plane)
// we take the violin-symbol (U+1D11E) MUSICAL SYMBOL G CLEF
s += new String(Character.toChars(0x1D11E));
// surrogate codepoints are: d834, dd1e, so one could also write "\ud834\udd1e"

byte[] bytes = s.getBytes("UTF-16");
for (byte aByte : bytes) {
    System.out.print((0xFF & aByte) + " ");
}
// 254 255 0 72 0 101 0 108 0 108 0 32 0 246 0 32 32 172 0 32 3 169 0 32 216 52 221 30

Edit:

Added a special character (U+1D11E) MUSICAL SYMBOL G CLEF (outside BPM, so taking not only 2 bytes in UTF-16, but 4.

Current JavaScript versions use "UCS-2" internally, so this symbol takes the space of 2 normal characters.

I'm not sure but when using charCodeAt it seems we get exactly the surrogate codepoints also used in UTF-16, so non-BPM characters are handled correctly.

This problem is absolutely non-trivial. It might depend on the used JavaScript versions and engines. So if you want reliable solutions, you should have a look at:

  • https://github.com/koichik/node-codepoint/
  • http://mathiasbynens.be/notes/javascript-escapes
  • Mozilla Developer Network: charCodeAt
  • BigEndian vs. LittleEndian

Update 2018 - The easiest way in 2018 should be TextEncoder

let utf8Encode = new TextEncoder();
utf8Encode.encode("abc")
// Uint8Array [ 97, 98, 99 ]

Caveats - The returned element is a Uint8Array, and not all browsers support it.


In C# running this

UnicodeEncoding encoding = new UnicodeEncoding();
byte[] bytes = encoding.GetBytes("Hello");

Will create an array with

72,0,101,0,108,0,108,0,111,0

byte array

For a character which the code is greater than 255 it will look like this

byte array

If you want a very similar behavior in JavaScript you can do this (v2 is a bit more robust solution, while the original version will only work for 0x00 ~ 0xff)

var str = "Hello竜";
var bytes = []; // char codes
var bytesv2 = []; // char codes

for (var i = 0; i < str.length; ++i) {
  var code = str.charCodeAt(i);
  
  bytes = bytes.concat([code]);
  
  bytesv2 = bytesv2.concat([code & 0xff, code / 256 >>> 0]);
}

// 72, 101, 108, 108, 111, 31452
console.log('bytes', bytes.join(', '));

// 72, 0, 101, 0, 108, 0, 108, 0, 111, 0, 220, 122
console.log('bytesv2', bytesv2.join(', '));

If you are looking for a solution that works in node.js, you can use this:

var myBuffer = [];
var str = 'Stack Overflow';
var buffer = new Buffer(str, 'utf16le');
for (var i = 0; i < buffer.length; i++) {
    myBuffer.push(buffer[i]);
}

console.log(myBuffer);

Tags:

Javascript