(((This whole section is draft)))
posted: September 14, 2010

Using Java to Parse XML with SAX

I'm sure everyone knows the main reasons to use SAX to parse XML (speed, less memory, large documents, etc). Let me add a couple of items into the mix. If you are developing a library to be reused, then by using SAX, you don't introduce any new dependencies to a project that uses your library. This is because, everything needed to parse SAX is built into the standard Java libraries. Also, if you are developing for mobile platforms, like Android or Blackberry, you'll be able to conserve resources better.

Generally, the best mindset to take when parsing with SAX is to think of your Handler as a little state machine. Your place in the document, determines what actions to take. It helps that a DefaultHandler is completely event driven. One thing to keep in mind, because the DefaultHandler is like a state machine, by definition, it won't be thread safe. So, if you have parsing happening in multiple threads, be sure that each thread has its is own Handler.

For a demo, let's start with a very simple rss parser. It will generate a list of beans that look like the following


public class RssItem {
  private String pubDate;
  private String title;
  private String description;
  private String link;
  private String author;

  /* getters and setters follow */
}

Note: depending on the application, I'd generally recommend creating immutable instances of data classes instead of the common getter/setter type bean.

Ok, time to really get started.

Although I'm not a strict follower of TDD, I like to use it when creating parsers. It can help to ensure features don't get broken while working on features.

To get started, I'll use the rss feed from my site and a little test. Here's the project layout so far:


`-- sax-demo-one
    |-- pom.xml
    `-- src
        |-- main
        |   `-- java
        |       `-- com
        |           `-- willcode4beer
        |               `-- demo
        |                   `-- sax
        |                       `-- RssItem.java
        `-- test
            |-- java
            |   `-- com
            |       `-- willcode4beer
            |           `-- demo
            |               `-- sax
            |                   `-- RssHandlerTest.java
            `-- resources
                `-- com
                    `-- willcode4beer
                        `-- demo
                            `-- sax
                                `-- samplerss.xml

The Unit Test

First, let's have the setup perform the quintessential parsing operation:


public class RssHandlerTest {

  private final SAXParserFactory factory = SAXParserFactory.newInstance();
  private RssHandler handler;

  @Before
  public void loadUpTheData() throws Exception {
    handler = new RssHandler();
    InputStream in = this.getClass().getResourceAsStream("samplerss.xml");
    SAXParser parser = factory.newSAXParser();
    parser.parse(in, handler);
  }

Immediately, the test will fail to compile because the RssHandler class doesn't exist. So, create it (note, in Eclipse, you could press ctrl+1 and take the option to create the class). It should extend org.xml.sax.helpers.DefaultHandler


package com.willcode4beer.demo.sax;
import org.xml.sax.helpers.DefaultHandler;

public class RssHandler extends DefaultHandler {
}

I would like the RssHandler to return a List of RssItem objects when it's finished parsing. So, I'll start with what I want (in the test) and use auto correct to fill it in.


  @Test
  public void validateInterface() {
    List<RssItem> items = handler.getItems();
    assertNotNull(items);
  }

Back to the RssHandler. I'll create the collection, and have the getItems() method return it.


public class RssHandler extends DefaultHandler {

  private List<RssItem> items = new LinkedList<RssItem>();

  public List<RssItem> getItems() {
    return Collections.unmodifiableList(items);
  }

Time to Start Really Parsing

Ok, the test passes. Woohoo. But, we aren't really doing anything yet. I'm going to skip ahead a little. Instead of doing the obvious, if-test for the tags as the first step (TDD style), I'm going to jump into one of my common techniques for dealing with tags.

Inside the RssHandler class, I'll create an interface (yea, because Java doesn't support closures).


  private interface TagWorker {
    void handleTag(String data);
  }

Next, I'll create a map to map tag names to actions to perform and override the endElement method to use it.


  private RssItem currentItem = new RssItem();

  Map<String,TagWorker> tagWorkers = new HashMap<String,TagWorker> (){{
    put("item",new TagWorker() {
      @Override public void handleTag(String data) {
        items.add(currentItem);
        currentItem = new RssItem();
      }
    });
  }};

  @Override
  public void endElement(String uri, String localName, String qName) throws SAXException {
    TagWorker worker = tagWorkers.get(qName);
    if (worker != null) {
      worker.handleTag(null);
    }
  }

This basically adds our current working RssItem object to the list, and creates a new working RssItem everytime the closing "item" tag is found.

So, we can follow this up with a test to verify the count of "item"'s in the rss feed.


  @Test
  public void validateItemCount() {
    List<RssItem> items = handler.getItems();
    assertThat(items.size(),is(32));
  }

Collecting the text from the tags is just a matter of adding TagWorker items to the map to collect the data for the tags.

In the handler, we'll add a java.lang.StringBuilder to collect the data, override the characters() method, and modify the endElement method.


  private StringBuilder charData = new StringBuilder();

  @Override
  public void endElement(String uri, String localName, String qName) throws SAXException {
    TagWorker worker = tagWorkers.get(qName);
    if (worker != null) {
      worker.handleTag(charData.toString().trim());
    }
    charData.delete(0, charData.length());
  }

  @Override
  public void characters(char[] ch, int start, int length) throws SAXException {
    charData.append(ch, start, length);
  }

Dealing With More Complex Documents

When dealing with deeply nested XML documents, especially when tag names are re-used, before you start creating lots of flag variables consider creating several small handlers. Then, in the primary handler, use a stack (for the current active handler), as you parse deeper, push handlers on the stack. As you come back up, pop them off. The primary handler will bascially delegate it's event methods to whatever handler happens to be sitting on top of the stack.

Resources

Stuff to read:

I know this was pretty quick. I have the source available for download if you'd like to give it a try.
sax-demo-one.zip

I made an update to the code in the source zip. Instead of using the simple value object described in this page, it now uses a builder that returns an immutable object. If anything isn't clear, feel free to email with any questions.

As usual, feel free to send feedback to: feedback@willcode4beer.com

Author

by: Paul E Davis


Sponsors:

About willCode4Beer