My first project at Barcap involves automating a manual process, essentially downloading a bunch of files with similar information in different formats and storing it in an internal CSV format. The internal version will probably be put into a database at some point, but CSV is good enough for version 1.

Attempt 1: the class hierarchy

Downloading files is easy, so the real fun happens when trying to find an elegant way to read them in their many forms. My first API was based on a class hierarchy of what I called ‘deserialisers’, essentially just customised CSV file readers. The main interface looked like this:

public interface Deserialiser<T extends AbstractRecord> {
    public List<T> deserialise(File input);
}

That provides all the flexibility required to read all the different formats. I added a mechanism to skip extraneous header rows and fitted everything by iterating on the lines in the file and parsing each one individually.

public abstract class CSVDeserialiser<T extends RecordBase>
        implements Deserialiser<T> {
    public List<T> deserialise(File input) {
        ..
        // for each line in input
        records.add(parseLine(line));
        ..
        return records;
    }

    // Override to parse different formats.
    protected abstract T parseLine(final String line);
}

There are a number of problems with the first attempt though. Rigid class hierarchies quickly become very difficult to maintain, a problem I noticed around the time I had to read 5 different formats, one of which combined data from two fundamentally different kinds of data, which needed further separation.

Another problem is memory usage. While the garbage collector does an admirable job of cleaning up after shoddy code for the most part, this API requires reading the whole file and storing all of it’s records in one object. If the file grows in the future, this may cause problems on some machines.

Attempt 2: a more abstract data type

It was at this point that I remembered about the existence of a couple of third party libraries that may have useful classes. At the time I was looking for a convenient string splitter (org.apache.commons.lang and com.google.commons.base, both useful in different scenarios). While I was digging around in the Guava documentation, I found the Table<R, C, V>, a data structure easiest to describe as a two dimensional mapped table (think spreadsheet). So I decided to have a go at building the table instead of a collection of records. The table code was a lot easier to write and refactor without relying on complex type hierarchies, but it still suffers from the memory limitations of the so-called deserialiser. A doubly hash-mapped table only makes the memory situation worse, by needing to maintain hashes of all the objects, of which there could be many thousands. Something more robust was needed.

Attempt 3: custom iterators

While thinking about the memory problem, I decided the most sensible way to read a file line by line is to deal with the data inline by writing a wrapper for a standard input reader or writing a custom iterator. Sine the Iterator API is less ugly for a client, that’s the approach I took.

public abstract class AbstractRecordIterator<T extends AbstractRecord>
        implements Iterator<T>, Closeable {
    public AbstractRecordIterator(File input) {
        ..
    }

    public abstract T next();

    public final void remove() {
        throw new UnsupportedOperationException("remove is not supported");
    }

    public void close() {
        ..
    }

    protected void finalize() {
        try {
            close();
        } finally {
            super.close();
        }
    }
}

Now we’ve solved the memory problem and the code is still quite simple to work with and maintain. There are caveats on how the iterator can be used safely, as we don’t want to leave any file handles open when the JVM exits, but there’s at least a clean API to work with. It helps that we can now easily wrap it in an Iterable<T> to enable for-each loops, although this comes with the usual problems with finalizers, that they may never be called under some circumstances.

While I'm the first to admit that this design is far from perfect, it checks all the boxes for simplicity and ease of use without exhibiting scalability problems with the files I'm testing. It is also really easy to add support for a more input formats, as they nearly all use some sort of CSV file.