Stripping Invalid XML characters in Java

I have a similar problem when parsing content of an Australian export tariffs into an XML document. I cannot use solutions suggested here such as: - Use an external tool (a jar) invoked from command line. - Ask Australian Customs to clean up the source file.

The only method to solve this problem at the moment is to iterate through the entire content of the source file, character by character and test if each character does not belong to the ascii range 0x00 to 0x1F inclusively. It can be done, but I was wondering if there is a better way using Java methods for type String.

EDIT I found a solution that may be useful to others: Use Java method String#ReplaceAll to replace or remove any undesirable characters in XML document.

Example code (I removed some necessary statements to avoid clutter):

BufferedReader reader = null;
...
String line = reader.readLine().replaceAll("[\\x00-\\x1F]", "");

In this example I remove (i.e. replace with an empty string), non-printable characters within range 0x00 to 0x1F inclusively. You can change the second argument in method #replaceAll() to replace characters with the string your application requires.


I used Xalan org.apache.xml.utils.XMLChar class:

public static String stripInvalidXmlCharacters(String input) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < input.length(); i++) {
        char c = input.charAt(i);
        if (XMLChar.isValid(c)) {
            sb.append(c);
        }
    }

    return sb.toString();
}

I use the following regexp that seems to work as expected for the JDK6:

Pattern INVALID_XML_CHARS = Pattern.compile("[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\uD800\uDC00-\uDBFF\uDFFF]");
...
INVALID_XML_CHARS.matcher(stringToCleanup).replaceAll("");

In JDK7 it might be possible to use the notation \x{10000}-\x{10FFFF} for the last range that lies outside of the BMP instead of the \uD800\uDC00-\uDBFF\uDFFF notation that is not as simple to understand.


I haven't used this personally but Atlassian made a command line XML cleaner that may suit your needs (it was made mainly for JIRA but XML is XML):

Download atlassian-xml-cleaner-0.1.jar

Open a DOS console or shell, and locate the XML or ZIP backup file on your computer, here assumed to be called data.xml

Run: java -jar atlassian-xml-cleaner-0.1.jar data.xml > data-clean.xml

This will write a copy of data.xml to data-clean.xml, with invalid characters removed.

Tags:

Java

Xml