How to check the charset of string in Java?

I recommend Apache.tika CharsetDetector, very friendly and strong.

CharsetDetector detector = new CharsetDetector();
detector.setText(yourStr.getBytes());
detector.detect();  // <- return the result, you can check by .getName() method

Further, you can convert any encoded string to your desired one, take utf-8 as example:

detector.getString(yourStr.getBytes(), "utf-8");

Strings in java, AFAIK, do not retain their original encoding - they are always stored internally in some Unicode form. You want to detect the charset of the original stream/bytes - this is why I think your String.toBytes() call is too late.

Ideally if you could get the input stream you are reading from, you can run it through something like this: http://code.google.com/p/juniversalchardet/

There are plenty of other charset detectors out there as well


I had the same problem. Tika is too large and juniversalchardet do not detect ISO-8859-1. So, I did myself and now is working well in production:

public String convert(String value, String fromEncoding, String toEncoding) {
  return new String(value.getBytes(fromEncoding), toEncoding);
}

public String charset(String value, String charsets[]) {
  String probe = StandardCharsets.UTF_8.name();
  for(String c : charsets) {
    Charset charset = Charset.forName(c);
    if(charset != null) {
      if(value.equals(convert(convert(value, charset.name(), probe), probe, charset.name()))) {
        return c;
      }
    }
  }
  return StandardCharsets.UTF_8.name();
}

Full description here: Detect the charset in Java strings.