Also, what are the limitations of abiword for doc/docx files?
Where do you get your doc files?
Are they the just ones submitted to your site, or is there a pastebin or similar repo of doc files?
I am curious for more details about why Tika wasn't good enough. Please explain.
Will take a look at pandoc. Thanks for suggesting.
How could you additionally parse the information to extract structured data? For example; names of candidates, addresses, previous employers, job titles held.
A simple 30 line flex/yacc combo will work effectively at a high ninety percentile
I use wvHtml for doc->html, wvPDF for doc->pdf, but antiword for doc->txt. To convert .docx, .xls, .xlsx, and WordPerfect files to HTML, I use OpenOffice, by way of jodconverter. For ODF files, I use OdfConverter. Conversion of Excel files to .csv files uses xls2csv. For PowerPoint files, I use ppthtml to convert to html, and catppt to convert to text. For Lotus 1-2-3 files (I added this after downloading some historical telecom data from the FCC!), I use ssconvert.
Any conversion that results in an HTML file (e.g. doc or pdf to html) I bundle all the images into a single file using the data: url scheme. To do this, I wrote a utility called pagecan: http://afiler.com/pagecan/
Could we see some code or a demo?