Determining if at leaf node with SAX parser

Let's start with some basic definitions:

An XML document is an ordered, labeled tree. Each node of the tree is an XML element and is written with an opening and closing tag.

( from here ). The great part about that: it means that XML files have a very regular, simple structure. For example, the definition of leaf node is just that: a node that doesn't have any children.

Now: that endElement() method is invoked whenever a SAX parser encounters a closing tag of a node. Assuming that your XML has valid content, that also means that the parser gave you a corresponding startElement() call before!

In other words: all the information you need to determine if you are "ending" a leaf node are available to you:

  • you were told which elements are "started"
  • you are told which elements end

Take this example:

<outer>
  <inner/>
</outer>

This will lead to such a sequence of events/callbacks:

  • event: start element outer
  • event: start element inner
  • event: end element inner
  • event: end element outer

So, "obviously", when your parser remembers the history of events, determining which of inner or outer is a leaf node is straight forward!

Thus, the answer is: no, you don't need a DOM parser. In the end, the DOM is constructed from the very same information anyway! If the DOM parser can deduce the "scope" of objects, so can your SAX parser.

But just for the record: you still need to carefully implement your data structures that keep track of "started", "open" and "ended" tags, for example to correctly determine that this one:

<outer> <inner> <inner/> </inner> </outer>

represents two non-leafs (outer and the first inner), and one leaf node (the inner inner).


From an implementation standpoint, you can do this using only a single boolean flag, tracking whether or not an element is a potential leaf node. The flag will always be true whenever you enter an element, but only the first actual leaf node ending element will have leaf node logic applied to it.

This flag can be reset repeatedly whenever a startElement is applied.

If multiple leaf nodes are at the same level, you will get consecutive isLeafNode flags set.

The logical reasoning behind this is can be viewed if we imagine the XML as a stack. startElements are pushes onto the stack. The first pop off the stack after a push will be a leaf node. Subsequent pops would not be leafs, but this is reset if another push is performed.

private boolean isLeafNode = false;

public void startElement(String uri, String localName, String qName, Attributes attributes) {
    isLeafNode = true;
}

public void endElement(String uri, String localName, String qName) {
    if(isLeafNode) {
        //do leaf node logic
    }

    isLeafNode = false;
}

So, for the following XML, the leaf nodes are as follows.

<foo>
    <bar>Leaf</bar>
    <baz>
        <bop>Leaf</bop>
        <beep>Leaf</beep>
        <blip>
            <moo>Leaf</moo>
        </blip>
    </baz>
</foo>