Thanks. I am trying to see if this kind of thing is of general interest. The parser is in Python running on Google App Engine and unfortunately GAE has a very outdated version of lxml, which is limiting. If there is enough interest, I'll move to AWS EC2, deploy all the latest libraries and improve the code :)