xml.parsers.expat.ExpatError: not well-formed (invalid token)

Python 3

One Liner

data: dict = xmltodict.parse(ElementTree.tostring(ElementTree.parse(path).getroot()))

Helper for .json and .xml

I wrote a small helper function to load .json and .xml files from a given path. I thought it might come in handy for some people here:

import json
import xml.etree.ElementTree

def load_json(path: str) -> dict:  
    if path.endswith(".json"):
        print(f"> Loading JSON from '{path}'")
        with open(path, mode="r") as open_file:
            content = open_file.read()

        return json.loads(content)
    elif path.endswith(".xml"):
        print(f"> Loading XML as JSON from '{path}'")
        xml = ElementTree.tostring(ElementTree.parse(path).getroot())
        return xmltodict.parse(xml, attr_prefix="@", cdata_key="#text", dict_constructor=dict)

    print(f"> Loading failed for '{path}'")
    return {}

Notes

  • if you want to get rid of the @ and #text markers in the json output, use the parameters attr_prefix="" and cdata_key=""

  • normally xmltodict.parse() returns an OrderedDict but you can change that with the parameter dict_constructor=dict

Usage

path = "my_data.xml"
data = load_json(path)
print(json.dumps(data, indent=2))

# OUTPUT
#
# > Loading XML as JSON from 'my_data.xml' 
# {
#   "mydocument": {
#     "@has": "an attribute",
#     "and": {
#       "many": [
#         "elements",
#         "more elements"
#       ]
#     },
#     "plus": {
#       "@a": "complex",
#       "#text": "element as well"
#     }
#   }
# }

Sources

  • ElementTree.tostring()
  • ElementTree.parse()
  • xmltodict
  • json.dumps()

I think you forgot to define the encoding type. I suggest that you try to initialize that xml file to a string variable:

import xml.etree.ElementTree as ET
import xmltodict
import json


tree = ET.parse('your_data.xml')
xml_data = tree.getroot()
#here you can change the encoding type to be able to set it to the one you need
xmlstr = ET.tostring(xml_data, encoding='utf-8', method='xml')

data_dict = dict(xmltodict.parse(xmlstr))

In my case the file was being saved with a Byte Order Mark as is the default with notepad++

I resaved the file without the BOM to plain utf8.