XML Parsing - The new way

Qudus · October 20, 2010, 7:10pm

Dear Community,

many of you will certainly have to deal with XML parsing here and there. Basically there are three ways of XML parsing in Java. The DOM approach, a SAX parser and byte code manipulation approaches like JIBX and such.

I definitely don’t like the JIBX way. So it’s out for me. JDOM is nice for some smaller XMLs, but keeps the nature of a quick’n’dirty solution for me, since it is extremely memory consuming and pushes everything to the memory and puts it into lists, etc. even the parts, that I don’t need. And then accessing a child element is not even done in O(1), but O(n), since not only the names, but all the namespaces have to be compared. XML namespaces are the most useless thing in the XML world anyway. Though there will be opposing opinions.

I like the SAX parser approach. But there are two disadvantages.

Initializing the parser takes a lot of lines of code.
In the startElement() methods, etc. I have to know, where I am in the XML hierarchy to decide, what to do with a certain element.

I have written some code, that drastically simplifies the whole process. Have a look at the code here.

How does it work? Let’s take a look.

Disadvantage 1 is addressed by provinding a SimpleXMLParser class, that selects a certain SAXParser (part of the JRE) and initializes it. Of course this restricts you to a single parser implementation. But hey, why do we need more, if one works just fine?

Now for disadvantage 2.

Let’s say, your XML looks like this (omitting the header).
###################################

    <cats>
        <cat name="Muschi" />
        <cat name="Pussy" />
    </cats>
</pats>

###################################

So to parse only the dogs out of this data, you have to write an XML handler, that checks in the startElement() method, if the current Element is a “dog” element AND it is parented by a “dogs” element AND this is in a “pats” element AND this is in a “root” element, which IS actually a root element. Ok, these checks have to be done in any case. But we can reduce the number and costs of these checks and we can reduce the necessary knowledge of the parser, that only wants to get the dogs from the XML.

So you would implement a SimpleXMLHandlerDelegate. The onElementStarted() method would look like this:
###################################
@Override
protected void onElementStarted( XMLPath path, String name, Object object, Attributes attributes ) throws SAXException
{
// Notice, that we’re querying for level 0 here!
if ( ( path.getLevel() == 0 ) && name.equals( “dog” ) ) // This could even be skipped, if you have designed the XML yourself and know for sure, that only dog elements are in here.
{
System.out.println( “Found a dog called “” + attributes.getValue( “name” ) + “”.” );
}
}
###################################

This is everything, the dogs parser needs to do and know.

Now we need a parent handler, that navigates to the dogs and then delegates to our dogs handler. This would be a SimpleXMLHandler implementation with the onElementStarted() method as follows.
###################################
@Override
protected void onElementStarted( XMLPath path, String name, Object object, Attributes attributes ) throws SAXException
{
if ( path.isAt( false, “root”, “pats” ) && name.equals( “dogs” ) )
{
delegate( dogsHandler );
}
}
###################################

Isn’t this simple? We could also tune the code a little bit to ged rid of some String compares. But this needs a little more code, but it’s worth it. All you have to do is overriding the getPathObject() method in our root handler as follows.
###################################
private static enum RootElements
{
root;
}

private static enum Level1Elements
{
pats;
}

private static enum Level2Elements
{
dogs,
cats,
;
}

@Override
protected Object getPathObject( XMLPath path, String element )
{
if ( path.getLevel() == 0 )
{
try
{
return ( RootElements.valueOf( element ) );
}
catch ( Throwable t )
{
return ( new Object() );
}
}
else if ( path.isAtByObjects( false, RootElements.root ) )
{
try
{
return ( Level1Elements.valueOf( element ) );
}
catch ( Throwable t )
{
return ( new Object() );
}
}
else if ( path.isAtByObjects( false, RootElements.root, Level1Elements.pats ) )
{
try
{
return ( Level2Elements.valueOf( element ) );
}
catch ( Throwable t )
{
return ( new Object() );
}
}
}

@Override
protected void onElementStarted( XMLPath path, String name, Object object, Attributes attributes ) throws SAXException
{
if ( object == Level2Elements.dogs ) // Simplified and cheaper test
{
delegate( dogsHandler );
}
}
###################################

There’s also a SimpleXMLWriter, that encapsulates an inverse SAX parser and lets you add elements and data in a very easy way, by simply calling the writeElement() method.

On a side note there’s also a very powerful ini file parser and writer in JAGaToo. If you’re interested, have a look here.

What do you think? Please add comments and critics.

Marvin

Orangy_Tang · October 20, 2010, 9:22pm

Unless I’m reading you wrong, you have to hard-code the depth of the elements at which you expect them?

Personally I think part of the power of xml is having xml fragments with common handling appear at various points (and depths) within an xml tree. How does your api deal with this?

It also seems that this hardcoding would add an extra maintenance burden and make things more fragile. It’s a neat idea though, for certain kinds of xml it would probably simplify things quite a bit.

Qudus · October 20, 2010, 9:33pm

Yes of course. It’s the same with a DOM approach and JIBX should be even more hardcoded.

Of course with JDOM you can navigate to a certain subtree, get the element and then delegate further processing to some code, that doesn’t know about the element’s parents.
Now with my solution you navigate the the known subtree and delegate to the next handler, which doesn’t need to know and cannot know anything about where the parent handler was. That’s the overall point here ;).

Through the deletate handlers as described in the initial posting.

Well, you can always scan the whole XML data even through my API and identify an element only by it’s name or maybe one parent. It’s up to you. The clue in my API is, that it provides the current XML path out of the box, which you would have to code by yourself when using plain SAX. And it provides mechanisms to tue the performance (element objects, see above) and especially the delegate handlers.

Marvin

i30817 · October 20, 2010, 11:53pm

You do know about STAX don’t you?

Instead of a callback handler, you control the iteration, result:
much simpler code when the data you want to combine is spread over many subtags.

It’s not in memory either.

JL235 · October 25, 2010, 7:52am

Maybe I’m missing something, but to be honest I see your solution needing pages of code for navigating a 14 line XML file.

For getting all ‘dog’ nodes within ‘dogs’ I’d much rather write something like:

List<String> myDogs = new ArrayList<String>();
XML parser = new XML( myXMLFile );

parser.map( "* dogs dog", new XMLMatcher() {
    public void onMatch( XML node ) {
        myDogs.add( node.getAttribute("name") );
    }
};

Note that is just pseudo code. The string describes what node I am after (the “* dogs dog”), the XMLMatcher holds the code for what I want to do and your library is left to parse the XML in any way it wants.

Mr_Gol · October 25, 2010, 8:56am

There must be at least 20 ways of parsing XML

Has anyone ever used XPath? I know it’s supposed to be a query language for XML, but I’ve never seen it used anywhere in production, despite it being supported by nearly all programming languages’ standard libraries.

markus.borbely · October 25, 2010, 11:36am

Yes, if you have to find that one node/list/attribute etc… it’s for you. Instead of parsing the xml and storing all data in your own structure, you can just query the dom for that tiny bit of information you need.

deepthought · November 4, 2010, 10:41pm

personally i would recommend XOM. it works great!

Nate · November 5, 2010, 3:45am

+1 for XOM (if forced to use XML).

cylab · November 5, 2010, 11:04am

You don’t see much xml processing applications, do you?

I used XPath a lot and since XPath is the main query language in XSL, everyone else does.

i30817 · November 6, 2010, 1:09pm

I’m using apache digester now.

It is much nicer than doing it manually though there are some gotchas

if you have some tag structure that is substring of another tag structure and you put in two listeners, one to each, say:
book/author
and
book/author/pseudonym

The first callback will be called with a empty string even if you’re reading the second type at the time.
2) something trippy involving call order in a specific type of callback (relating to the stack design)
3) stupid function names. I mean bad, though this is a fault of apache generally, for some reason, i’ve found.

The library allows javabeans binding.