Algorithmic complexity of XML parsers/validators

Rob Walker is right: the problem isn't specified in enough detail. Considering just parsers (and ignoring the question of whether they perform validation), there are two main flavors: tree-based—think DOM—and streaming/event-based—think SAX (push) and StAX (pull). Speaking in huge generalities, the tree-based approaches consume more memory and are slower (because you need to finish parsing the whole document), while the streaming/event-based approaches consume less memory and are faster. Tree-based parsers are generally considered easier to use, although StAX has been heralded as a huge improvement (in ease-of-use) over SAX.


I think there are too many variables involved to come up with a simple complexity metric unless you make a lot of assumptions.

A simple SAX style parser should be linear in terms of document size and flat for memory.

Something like XPath would be impossible to describe in terms of just the input document since the complexity of the XPath expression plays a huge role.

Likewise for schema validation, a large but simple schema may well be linear, whereas a smaller schema that has a much more complex structure would show worse runtime performance.

As with most performance questions the only way to get accurate answers is to measure it and see what happens!


If I was faced with that problem and couldn't find anything on google I would probably try to do it my self.

Some "back-of-an-evelope" stuff to get a feel for where it is going. But it would kinda need me to have an idea of how to do a xml parser. For non algorithmical benchmarks take a look here:

  • http://www.xml.com/pub/a/Benchmark/exec.html
  • http://www.devx.com/xml/Article/16922
  • http://xerces.apache.org/xerces2-j/faq-performance.html