What is XML BOM and how do I detect it?

For a ANSI XML file it should actually be removed. If you want to use UTF-8 you don't really need it. Only for UTF-16 and UTF-32 it is needed.

The Byte-Order-Mark (or BOM), is a special marker added at the very beginning of an Unicode file encoded in UTF-8, UTF-16 or UTF-32. It is used to indicate whether the file uses the big-endian or little-endian byte order. The BOM is mandatory for UTF-16 and UTF-32, but it is optional for UTF-8.

(Source: https://www.opentag.com/xfaq_enc.htm#enc_bom)

Regarding the question on how detect this in java.

Check the following answer to this question: Java : How to determine the correct charset encoding of a stream and if you now want to determine the BOM yourself (at your own risk) check for example this code Java Tip: How to read a file and automatically specify the correct encoding.

Basically just read in the first few bytes yourself and then determine if you may have found a BOM.


The byte order mark is likely to be one of these byte sequences:

     UTF-8 BOM: ef bb bf 
  UTF-16BE BOM: fe ff 
  UTF-16LE BOM: ff fe 
  UTF-32BE BOM: 00 00 fe ff 
  UTF-32LE BOM: ff fe 00 00 

These are the variously encoded forms of the Unicode codepoint U+FEFF. This can be expressed as a Java char literal using '\uFEFF' (Java char values are implicitly UTF-16). Since U+FEFF isn't in most encodings, it is not possible for this BOM codepoint to be encoded by them. (More on encoding the BOM using Java here.)

When it comes to BOMs and XML, they are optional (see also the Unicode BOM FAQ). Detection of encoding in XML is relatively straightforward if the encoding is specified in the declaration. Always make sure that the XML declaration (<?xml version="1.0" encoding="UTF-8"?>) matches the encoding used to write the document. If you are strict about this, parsers should be able to interpret your documents correctly. (XML spec on encoding detection.)

I advocate encoding as Unicode wherever possible (see also the 10 Commandments of Unicode). That said, XML allows the representation of any Unicode character via escape entities (e.g. 'A' could be represented by &#x0041;), so it isn't necessarily a requirement to avoid data loss.


Do not insert a BOM in a utf-8 file: if two such files are merged, you end up with a BOM in the middle which might break an applicaton, or cause an xml parser to throw an exception.

Tags:

Java

Xml