(((This whole section is draft)))

Java Developer's Guide to Parsing

This little series about parsing is a result of the programming exercises that I implemented at work. The one in particular was a result of Dave Thomas' Code Kata on Data Munging.

When I assembled the exercise, I threw out a handful of hints. Most of these were suggestions about different ways to parse the data. The data munging in the exercise is pretty simple. But, parsing text is one of those things all developers hate (except maybe Perl programmers). Fortunately, the current java API's give a lot of options for parsing.

In the old days, we would have just used a StringTokenizer. The StringTokenizer is so well known and used that, I'm not even going to cover it here. If you've never used it, the entry in the java almanac on parsing strings covers it pretty well.

Issues

There are a few issues to keep in mind when developing a parser. The importance of any one depends upon your situation and the application being developed. Just make sure to address them, even if you are ultimately going to ignore it.

I18N

The biggest overlooked issue with parsing is internationalization (I18N). Often the use of operations like if( val >= 'a' && val <= 'z'){... overlook even things close to home like nyay and accented characters. Let alone other languages with completely different character sets.

Reversibility

Reversibility basically means that you could use the same classes used for parsing to also output data in the same format. In the core API, reversibility is addressed by the Format implementation classes. Reversibility is also addressed in most other libraries for parsing XML, CSV, and TLD. If you want to both read and write a particular format, dont forget about reversibility.

The Techniques

Each of the techniques is shown on the menu on in the upper left of the page.

XML

There are very few good reasons to implement your own XML parser. Many commercial and open source ones are available. They are mature, perform well, and are well tested. I'll cever a few basic techniques here.

CSV

Good 'ol comma separated values. Let's just parse using the Jakarta Commons CSV library. (commons.apache.org/sandbox/csv/)

Tab Delimited Data

And oldie but, still in widespread usage. ...

Regular Expressions

A favorite of system admins and Perl programmers. Infinitely flexible, notoriously hard to maintain. However, if you know them, you can do an amazing amount in a very short amount of time.

Message Format

A reversible method for parsing and generating formated text. The built in MessageFormat can be very finicky when parsing. When implementing your own Format class, you can make it behave as you desire. Though tedious to write, if you are providing a library for others to use, they will appreciate it.

Scanner

The Scanner class is very useful for reading what you want a piece at a time. The built in type conversion methods are unbelievably useful.

CharBuffer

High performance but, you will have to do all of the heavy lifting.

Java NIO

There is a whole lot more that the just the CharBuffer. The other classes are great for parsing through binary data.

Script Engines

Is the Ruby guy laughing at the work used to parse through 80 character wide financial flat files? Why not call a little jRuby script from you code to do the work? Let me show how scripting languages are becoming first class citizens in the Java world.