C#: Convert Japanese text encoding in shift-JIS and stored as ASCII into UTF-8

some strings stored in the database as ASCII

It isn't ASCII, about none of the characters in ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð are ASCII. Encoding.ASCII.GetBytes(text) is going to produce a lot of huh? characters, that's why you got all those question marks.

The core issue is that the bytes in the dbase column were read with the wrong encoding. You used code page 1252:

var badstringFromDatabase = "ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð";
var hopefullyRecovered = Encoding.GetEncoding(1252).GetBytes(badstringFromDatabase);
var oughtToBeJapanese = Encoding.GetEncoding(932).GetString(hopefullyRecovered);

Which produces "チャネルパートナーの選択"

This is not going to be completely reliable, code page 1252 has a few unassigned codes that are used in 932. You'll end up with a garbled string from which you cannot recover the original byte value anymore. You'll need to focus on getting the data provider to use the correct encoding.


As per the other answer, I'm pretty sure you're using ANSI/Default encoding not ASCII.

The following examples seem to get you what you're after.

var japaneseEncoding = Encoding.GetEncoding(932);

// From file bytes
var fileBytes = File.ReadAllBytes(@"C:\temp\test.html");
var japaneseTextFromFile = japaneseEncoding.GetString(fileBytes);
japaneseTextFromFile.Dump();

// From string bytes
var textString = "ƒ`ƒƒƒlƒ‹ƒp[ƒgƒi[‚Ì‘I‘ð";
var textBytes = Encoding.Default.GetBytes(textString);
var japaneseTextFromString = japaneseEncoding.GetString(textBytes);
japaneseTextFromString.Dump();

Interestingly I think I need to read up on Encoding.Convert as it did not produce the behaviour I expected. The GetString methods seem to only work if I pass in bytes read in the Encoding.Default format - if I convert to the Japanese encoding beforehand they do not work as expected.

Tags:

C#

Encoding