Ian Gilham

jpdfstat - A Tool for Working With PDF Documents

11 February 2011

A while back, I was working on a PDF tool called Documenity. The idea for the tool grew out of some hacking I did in C# using a proprietary API while working at Tek Translation International. We would frequently receive huge batches of PDF files for analysis, and would need a provide a localisation cost estimate based on the number of words, pages and images in the files.

Since PDF documents are frequently constructed so poorly that counting images simply doesn’t provide useful data, we came up with a heuristic, saying that if a page had at least one image and less than a certain maximum word count, then it was a ‘graphical page’ that could be treated separately from textual pages for billing purposes. This approach can be applied before or after running the documents through an Optical Character Recognition (OCR) engine, providing good visibility into the quality of the heuristic.

The tool essentially took a big pile of PDF files and output a CSV file with all the stat counts. I later provided some convenience functions for extracting text for analysis with a Translation Memory (TM) and for extracting images for localisation and editing. Image extraction proved to be a problem not worth solving in many cases, due to the way documents were constructed. I suspect that converting a so-called ‘graphical page’ into a single image would be a more useful approach but never got that far.

My work for Tek was proprietary so I did not take any of the code with. I did however have a go at starting from scratch and building at least some of the simpler features for my Documenity project. I had a few talks with another localisation company about using it in their own automated analysis and quoting platform but it never really went anywhere.

Since then, I’ve been at university so I hadn’t given the project much thought. I picked up the idea again briefly to play with various different PDF document libraries in the summer of 2010, but stopped when I got a job. I looked at it again last week and noticed how much my code sucked so I set about reworking the design and finally building a useful tool.

My new PDF tool is called jpdfstat (it’s horrible, I know) and is written in Java using the Apache PDFBox library, which I have found to be pretty good. My code is still only half implemented and barely tested, but I think the overall design is approaching usability now, so I’ve created an open source project on bitbucket to share it.

What’s next? I want to build a small library of documents for regression testing and build out the unit test suite before thinking too much about the library implementation. I also intend to write a simple command line application to expose some of the more useful functionality. The original tool was created to solve a problem, so my main focus will be to cover all the same bases again before I start thinking about what else might be useful.