Reading Huge XML File using StAX and XPath

If performance is an important factor, and/or the document size is large (both of which seem to be the case here), the difference between an event parser (like SAX or StAX) and the native Java XPath implementation is that the latter builds a W3C DOM Document prior to evaluating the XPath expression. [It's interesting to note that all Java Document Object Model implementations like the DOM or Axiom use an event processor (like SAX or StAX) to build the in-memory representation, so if you can ever get by with only the event processor you're saving both memory and the time it takes to build a DOM.]

As I mentioned, the XPath implementation in the JDK operates upon a W3C DOM Document. You can see this in the Java JDK source code implementation by looking at com.sun.org.apache.xpath.internal.jaxp.XPathImpl, where prior to the evaluate() method being called the parser must first parse the source:

  Document document = getParser().parse( source );

After this your 10GB of XML will be represented in memory (plus whatever overhead) — probably not what you want. While you may want a more "generic" solution, both your example XPath and your XML markup seem relatively simple, so there doesn't seem to be a really strong justification for an XPath (except perhaps programming elegance). The same would be true for the XProc suggestion: this would also build a DOM. If you truly need a DOM you could use Axiom rather than the W3C DOM. Axiom has a much friendlier API and builds its DOM over StAX, so it's fast, and uses Jaxen for its XPath implementation. Jaxen requires some kind of DOM (W3C DOM, DOM4J, or JDOM). This will be true of all XPath implementations, so if you don't truly need XPath sticking with just the events parser would be recommended.

SAX is the old streaming API, with StAX newer, and a great deal faster. Either using the native JDK StAX implementation (javax.xml.stream) or the Woodstox StAX implementation (which is significantly faster, in my experience), I'd recommend creating a XML event filter that first matches on element type name (to capture your <txn> elements). This will create small bursts of events (element, attribute, text) that can be checked for your matching user values. Upon a suitable match you could either pull the necessary information from the events or pipe the bounded events to build a mini-DOM from them if you found the result was easier to navigate. But it sounds like that might be overkill if the markup is simple.

This would likely be the simplest, fastest possible approach and avoid the memory overhead of building a DOM. If you passed the names of the element and attribute to the filter (so that your matching algorithm is configurable) you could make it relatively generic.


It's definitely a use case for XProc with a streaming and parallel processing implementation like QuiXProc (http://code.google.com/p/quixproc)

In this situation, you will have to use

  <p:for-each>
    <p:iteration-source select="//transactions/txn"/>
    <!-- you processing on a small file -->
  </p:for-each>

You can even wrapp each of the resulting transformation with a single line of XProc

  <p:wrap-sequence wrapper="transactions"/>

Hope this helps


We regularly parse 1GB+ complex XML files by using a SAX parser which does exactly what you described: It extracts partial DOM trees that can be conveniently queried using XPATH.

I blogged about it here - It's using a SAX not a StAX parser, but may be worth a look at.


Stax and xpath are very different things. Stax allows you to parse a streaming XML document in a forward direction only. Xpath allows parsing in both directions. Stax is a very fast streaming XML parser, but, if you want xpath, java has a separate library for that.

Take a look at this question for a very similar discussion: Is there any XPath processor for SAX model?

Tags:

Java

Xml

Xpath

Stax