XML Pull Parsing and Enums: like chocolate and peanut butter
May 31st, 2007 |
.
There comes a time in every developer’s life when they need to write code that processes some XML. Lately, we’ve seen the proliferation of APIs that make XML processing easier, like JAXB (Java API for XML Binding). However, when speed and scale are required, chances are you’re going to need to roll your own processor. Before I continue, let me clear up some terminology, when I say “processor”, I mean the code of yours that’s wrapped around a SAX (tutorial), DOM (tutorial), or an XPP (tutorial) parser, not the guts of the parser itself.
At the end of the day, that’s the interesting part of what you’re doing – the grammar of your data model rather than the minutiae of start and end tags. Building a processor is the interface between the data interchange format and the internal data model of your application.
Click through for a tour of XML parsers and a look at a novel technique for encoding processors that use pull parsers (as usual, we’ve included a WebStart demo, as well as a jar file containing the compiled example along with all of its source code).
XML Parsers
Each of the three most common types of XML parsers has its own strengths and weaknesses. DOM, in particular, does not scale well. The DOM technique of reading the entire document into memory and creating first-class language objects for every explicit and implied element makes it slow and resource intensive. Allegorically, this is balanced against the ease of programming against the DOM API, although I personally find the DOM API to be a bit cumbersome. The upshot is that DOM is not well suited for large documents (because of its memory requirements) or applications that need low-latency and overall throughput (since you a) don’t see the DOM until the whole thing is done being built and b) objects are expensive to construct).
SAX Parsing scales very well, but it uses a callback pattern which can be awkward to code against for large grammars. This problem has been addressed in various ways. Gianluigi Colaiacomo of IBM wrote this article detailing a method that ends up looking a lot like pull parsing for SAX: Simplify document handler programs with the SAX parser.
In the case of XML Pull Parsing (excellent tech brief on XPP here), it’s optimized for parsing tasks that require all elements in a document to be processed and leads to an event stream style of programming that is much more intuitive and easy-to-follow than SAX. In the past few years, the work on pull parsing has come together into a standard called StAX, or Streaming API for XML.
StAX has more or less replaced XmlPull as the standard for XML streaming, but the XPP parsers still exist and are very, very fast. XPP became our parser of choice when looking at things like real-time serialization of data and high-speed import of object graphs from XML for our integration interfaces.
Life On Earth
First, let’s set the stage: rather than going into the detail and complexity of the Palantir XML formats, I’ve designed a somewhat simpler example to make this all easier to digest. Note that the LifeOnEarth schema document, source code, and example LifeOnEarth instance document are included in the JAR file.
In this example, what will we be encoding? Organisms! That’s right, biological taxonomy. The biological taxonomy is the set of official Latin names for all the living things in the world. It’s basically a straight hierarchy, with one notable exception: Phylum vs. Division. Plant biologists use the term division to describe the same broad morphological grouping that their zoological counterparts call a phyla.
This is great for our example: after all, who wants to have a straight hierarchy? We need something a little out of the ordinary to make things interesting.
The XML schema: lifeOnEarth
The idea here is that we’ll build an XML Schema document describing an instance document format for encoding information about the biological taxonomy. We make extensive use of complexType elements in the schema to afford us a sort of code reuse. The Class->Family->Order->Genus->Species hierarchy logically repeated underneath Phylum and Division in the taxonomy model. However, so as to be as DRY as possible, we use complex type declarations instead of defining complex types in-line when declaring the elements . It’s a lot like using declared types instead of anonymous inner classes in Java.
Using the complexType elements, we define the names, attributes, and allowed members for every step of the hierarchy. We then define the document itself by declaring a single top level element, lifeOnEarth:
<xsd:element name="lifeOnEarth" type="tns:lifeOnEarth"></xsd:element>
You can see that the element lifeOnEarth is of complexType lifeOnEarth; the same name in different namespaces. The type lifeOnEarth is defined thusly:
<xsd:complextype name="lifeOnEarth">
<xsd:sequence>
<xsd:element name="domain" type="tns:domain" maxoccurs="unbounded"></xsd:element>
</xsd:sequence>
</xsd:complextype>
An element of type lifeOnEarth contains one or more elements (sequence, in XSD parlance) of type domain. An element of type domain contains a sequence of elements of type kingdom. So on and so forth, all the way down to species (with the double hierarchy for the phylum/division duality).
Here’s the schema document:
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.palantirtech.com/schema/examples/lifeOnEarth"
xmlns:tns="http://www.palantirtech.com/schema/examples/lifeOnEarth"
elementFormDefault="qualified">
<xsd:annotation>
<xsd:documentation>This schema is based on alpha taxonomy, he science of describing, categorizing and naming organisms. For a good overview, see http://en.wikipedia.org/wiki/Taxon</xsd:documentation>
</xsd:annotation>
<xsd:complexType name="lifeOnEarth">
<xsd:sequence>
<xsd:element name="domain" type="tns:domain" maxOccurs="unbounded"></xsd:element>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="species">
<xsd:attribute name="commonName" type="xsd:string"/>
<xsd:attribute name="name" type="xsd:string"/>
</xsd:complexType>
<xsd:complexType name="genus">
<xsd:sequence minOccurs="0">
<xsd:element name="species" type="tns:species" maxOccurs="unbounded"></xsd:element>
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string"/>
</xsd:complexType>
<xsd:complexType name="family">
<xsd:sequence minOccurs="0">
<xsd:element name="genus" type="tns:genus" maxOccurs="unbounded"></xsd:element>
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string"/>
</xsd:complexType>
<xsd:complexType name="order">
<xsd:sequence minOccurs="0">
<xsd:element name="family" type="tns:family" maxOccurs="unbounded"></xsd:element>
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string"/>
</xsd:complexType>
<xsd:complexType name="class">
<xsd:sequence minOccurs="0">
<xsd:element name="order" maxOccurs="unbounded" type="tns:order"></xsd:element>
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string"/>
</xsd:complexType>
<xsd:complexType name="phylum">
<xsd:sequence minOccurs="0">
<xsd:element name="class" type="tns:class" maxOccurs="unbounded"></xsd:element>
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string"/>
</xsd:complexType>
<xsd:complexType name="division">
<xsd:sequence minOccurs="0">
<xsd:element name="class" type="tns:class" maxOccurs="unbounded"></xsd:element>
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string"/>
</xsd:complexType>
<xsd:complexType name="kingdom">
<xsd:choice maxOccurs="unbounded" minOccurs="0">
<xsd:element name="phylum" type="tns:phylum"></xsd:element>
<xsd:element name="division" type="tns:division"></xsd:element>
</xsd:choice>
<xsd:attribute name="name" type="xsd:string"/>
</xsd:complexType>
<xsd:complexType name="domain">
<xsd:sequence minOccurs="0">
<xsd:element name="kingdom" type="tns:kingdom" maxOccurs="unbounded"></xsd:element>
</xsd:sequence>
<xsd:attribute name="name" type="xsd:string"/>
</xsd:complexType>
<xsd:element name="lifeOnEarth" type="tns:lifeOnEarth"></xsd:element>
</xsd:schema>
The upside of making the schema document comes from being able to leverage existing validation mechanisms. Using a validator, any document can be verified to be a well-form instance of the schema. This allows the processor to be able to assume valid documents, greatly reducing the error handling burden on the developer when writing the processor. Additionally, if you’re writing both sides of the XML transaction, you can validate the output of the renderer to perform early error detection. And finally, the validator is the perfect thing for using in unit testing related to your XML processing. Sun included a simple validation example in the Javadocs for the javax.xml.validation package. For a more in-depth look at using validation, see LifeOnEarthSchema.java, part of this example that shows how create a shared schema object (using the singleton pattern) from an XSD file loaded from the classpath.
Encoding the XML schema’s FSA (peanut butter)

So it turns out that processing XML, just like any parsing task, can be represented by Finite State Automata. Finite state automatas (FSAs) can be encoded by a list of valid states and the valid transitions between states. Since XML is nicely hierarchical, encoding the information is fairly straightforward. A tag name maps to a state in the FSA and its valid transitions are to any of its parents or any of its children.
With this in mind, I sat down to design our XML formats for integration, realizing that I needed a way to represent all the states the parser could be in as it parsed the document. That context would drive how the parser events were handled and the routing of parsed data to the proper places.
So I started thinking: “I’ll bet I can do this really cleanly with enums.” Here’s what I came up with:
We encode all the states of the parser into a Java Enum. The Enum constructor takes two arguments: one or more ParserStates that are valid parent states and the text that corresponds to the text representation of the XML tag in the document. Since most states only have one parent, we overload the constructor for convenience and readability to only take a single state instead of an array.
After the constructors, we define an instance method that will check that a passed state is a valid transition from this state and some static members and initialization to allow us to map strings passed by the XML parser into the Enum objects.
Here’s the Enum’s definition:
private enum ParserState {
GROUND(new ParserState[]{},""),
LIFE_ON_EARTH(GROUND,"lifeOnEarth"),
DOMAIN(LIFE_ON_EARTH,"domain"),
KINGDOM(DOMAIN,"kingdom"),
PHYLUM(KINGDOM,"phylum"),
DIVISION(KINGDOM,"division"),
CLASS(new ParserState[]{DIVISION,PHYLUM},"class"),
ORDER(CLASS,"order"),
FAMILY(ORDER,"family"),
GENUS(FAMILY,"genus"),
SPECIES(GENUS,"species");
/**
* Array of parent states to this one.
*/
ParserState parents[] = new ParserState[]{};
/**
* Tag name for this state.
*/
String tagName = null;
/**
* Constructor for ParserState with a single parent state.
* @param parent
* @param tagName
*/
ParserState(ParserState parent,String tagName){
this.parents = new ParserState[]{parent};
this.tagName = tagName;
}
/**
* Constructor for ParserState with a multiple parent states.
* @param parents
* @param tagName
*/
ParserState(ParserState[] parents,String tagName){
this.parents = parents;
this.tagName = tagName;
}
/**
* Checks whether it is valid to transition to the
* specified state from this state.
* @param newState
* @return
*/
public boolean checkValid(ParserState toState){
if(this.equals(toState))
return true;
for(int i = 0; i < toState.parents.length ; i++){
if(toState.parents[i].equals(this)){
return true;
}
}
for(int i = 0; i < this.parents.length; i++){
if(this.parents[i].equals(toState)){
return true;
}
}
return false;
}
public String getTagName() {
return tagName;
}
// End enum instance methods and variables
// Static methods, variable, and intializations
static Map<string,ParserState> tagLookup = new HashMap<string,ParserState>();
/*
* This code executes after the enums have been constructed.
*
* Because of order of execution when initializing an enum,
* you can't call static functions in an enum constructor.
* (They are constructed before static initialization).
*
* Instead, we use a static initializer to populate the lookup
* hashmap after all the enums are constructed.
*/
static {
for(ParserState state : ParserState.values()){
registerState(state);
}
}
/**
* Maps a tag name to a ParserState
* @param tagName
* @return the ParserState for that tag.
*/
public static ParserState lookupStateForTag(String tagName){
return tagLookup.get(tagName);
}
private static void registerState(ParserState state){
tagLookup.put(state.tagName, state);
}
}
XML Pull Parsing (chocolate)
The idea with XML Pull Parsing is that the document can be represented as a stream of events describing the document content. You pull the events off the stream as you process them (hence the name). These events come in five flavors: START_DOCUMENT, END_DOCUMENT, START_TAG, END_TAG, TEXT.
The tags names are pretty easy to understand intuitively, except for TEXT, which is the event for free text inside of an element.
An example document like this:
<foo> bar <baz>jones</baz> bleargh</foo>
… would produce events in the following order: START_DOCUMENT, START_TAG, TEXT, START_TAG, TEXT, END_TAG, TEXT, END_TAG, END_DOCUMENT.
Pretty much every XML pull parser-based processor has a loop that looks like this at their core:
protected void processDocument() throws XmlPullParserException, IOException, PalantirException {
// pull first event
int eventType = parser.getEventType();
do { // core loop
switch(eventType){
case XmlPullParser.START_DOCUMENT:
log.debug("Start document");
break;
case XmlPullParser.START_TAG:
processStartElement();
break;
case XmlPullParser.END_TAG:
processEndElement();
break;
case XmlPullParser.TEXT:
processText();
break;
// never called, here for completeness
case XmlPullParser.END_DOCUMENT:
log.debug("End document");
break;
}
eventType = parser.next();
} while (eventType != XmlPullParser.END_DOCUMENT);
}
So the next logical thing to take a look at is the processStartElement() method. It turns out that here’s where the real peanut-butter & chocolate synergy comes into play. In this case, we call getName() on the parser object and it will tell us the name of the tag just started.
We want to:
- Check that it’s a valid state transition.
- Change the state of the parser.
- Dispatch to the appropriate method to deal with this kind of tag.
Here’s the method (note that stateStack is of type Stack<ParserState>):
public void processStartElement() throws PalantirException, XmlPullParserException {
String tagName = parser.getName();
// check for state transition
ParserState newState = ParserState.lookupStateForTag(tagName);
if(newState != null){
// we know it's a valid tag
if(parserState.checkValid(newState)){
// change FSA state
stateStack.push(parserState);
parserState = newState;
}
else{
// invalid transition
String message = "Got illegal state transition in parser, suspect malformed XML.\n" +
"At line " + parser.getLineNumber() + ", col " + parser.getColumnNumber() + ".\n" +
"Invalid state transition: " + parserState.toString() + " -> " + newState.toString();
log.error(message);
throw new PalantirException(message);
}
}
else{
// unknown tag: ignore
}
// dispatch to handling method
dispatch();
}
You can see how we leverage the encoding in the ParserState Enum to make this a very simple method. Since we’re encoding tag-to-state mappings, we can easily lookup the appropriate state for this tag (go back and look at the static initializer if I lost you there). Once we know the proposed new state, we can check whether or not the state transition is a valid one, throwing an exception on an invalid state (which implies a malformed document). Finally, we call dispatch(), which is nothing but a big switch statement using all the possible states from the Enum:
private void dispatch() throws XmlPullParserException, PalantirException {
// logging statements removed for readability
switch (parserState) {
case DOMAIN:
processDomain();
break;
case KINGDOM:
processKingdom();
break;
case PHYLUM:
processPhylum();
break;
case DIVISION:
processDivision();
break;
case CLASS:
processClass();
break;
case FAMILY:
processFamily();
break;
case ORDER:
processOrder();
break;
case GENUS:
processGenus();
break;
case SPECIES:
processSpecies();
break;
default:
break;
}
}
The methods for doing the processing for TEXT and END_TAG methods are much simpler but also leverage the dispatch() method:
public void processEndElement() throws XmlPullParserException, PalantirException {
dispatch();
// already know to be valid, since transition
// validity is bi-directional
parserState = stateStack.pop();
}
public void processText() throws XmlPullParserException, PalantirException {
dispatch();
}
What you end up with is a very straightforward, readable, and performant dispatching framework in your processor. The code practically writes itself! All that’s left to do is to define each of the tag processing methods, which will have to understand how to process the attributes (START_TAG), any free text (TEXT), and finally close up any data structures which the tag data is being recorded to. They will all loosely follow this form:
void processTag() throws XmlPullParserException {
switch (parser.getEventType()) {
case XmlPullParser.START_TAG:
// handle attributes and tag existence here
break;
case XmlPullParser.END_TAG:
// handle close-up here
break;
case XmlPullParser.TEXT:
// handle free text here
break;
}
}
This ends up being really nice since all the logic about each tag is encapsulated in a single method, facilitating debugging, modifications, and having other people read and actually understand your code (not the easiest thing to do with SAX Parser callbacks!).
Putting it all together
So the demo that shows this off parses the instance document that’s filled with entries like this:
<tns:domain name='Bacteria'>
<tns:kingdom name='Monera'>
<tns:phylum name='Proteobacteria'>
<tns:class name='Proteobacteria'>
<tns:order name='Enterobacteriales'>
<tns:family name='Enterobacteriaceae'>
<tns:genus name='Escherichia'>
<tns:species name='E. coli' commonName="E. coli"/>
</tns:genus>
</tns:family>
</tns:order>
</tns:class>
</tns:phylum>
</tns:kingdom>
</tns:domain>
It then spits out strings describing each species to our Swing-based console, like this:
Lifeform 'E. coli': Bacteria : Monera : Proteobacteria : Proteobacteria : Enterobacteriales : Enterobacteriaceae : Escherichia : E. coli
Check it out and let me know about any questions or improvements or errors in the comments.







lets have some new content guys! it doesnt have to be a huge post like this one.
and keep up the good work
June 28th, 2007 at 5:22 am
You want want to look at vtd-xml, the latest and most advanced parsing/indexing/XPath engine that offers a lot of cool features
vtd-xml
November 28th, 2009 at 5:54 pm
That does look really interesting! At the moment, we’re happy with performance of our pull parser, but the ability to use XPath over large documents could be really useful — we had to build our own validator using the pull parser to run some checks on our documents that aren’t handled by normal XML Schema validation.
November 29th, 2009 at 3:56 pm
Great article. I’m thinking through some ways of dealing with poorly written XML handling from 11 years ago and this is thought provoking. Th XML is so bad they even wrote their own custom parsers!
In your code, why not just have the enum do it’s own processing?
switch (parserState) {
06. case DOMAIN:
07. processDomain();
Instead, just throw away the switch state and create a process method and do this:
parserState.process()
An Enum in Java can have an abstract method definition requiring all declared Enums to implement it. If you have an “abstract void process()” method declared, each of your Enums would have to implement a method for handling that call.
So if they all have process method, then each Enum executes its own processing code. It is similar to a Command class executing whatever it’s logic is; the caller need not be aware of the Command doing the executing. It is only aware of the type.
The switch statement isn’t needed and loses a bit of the power of it.
May 21st, 2010 at 4:17 pm