HTML preview for doc, docx, pdf & rtf (opens in new tab)

(recruiterbox.com)

44 points_raghu15y ago15 comments

15 comments

15 comments · 9 top-level

sushi15y ago· 2 in thread

UX Suggestion: Please hyperlink the Blog text besides the Recruiterbox logo. It's underlined so users expect it to be a link.

p4bl015y ago

Also, a <title> tag would be useful :-).

But apart from this, I now I'll face this very problem soon (well, for a relatively fluctuant value of "soon"), so thank you very much for sharing this _raghu!

grease15y ago

Thanks, yeah will do this :)

dpapathanasiou15y ago· 2 in thread

How would you compare abiword for doc/docx conversion versus antiword (http://www.winfield.demon.nl/)?

Also, what are the limitations of abiword for doc/docx files?

_raghuOP15y ago

Haven't tried antiword. As of now I find abiword pretty stable for both doc and docx. I need more data but I found a few cases where it just hanged while converting. There is no specific pattern to when the program hangs. For now I am logging such cases and timing out the conversion in 3 seconds.

dpapathanasiou15y ago

Thanks.

Where do you get your doc files?

Are they the just ones submitted to your site, or is there a pastebin or similar repo of doc files?

bravura15y ago· 1 in thread

You should also consider 'pandoc', written in Haskell, for converting between markup formats: http://johnmacfarlane.net/pandoc/

I am curious for more details about why Tika wasn't good enough. Please explain.

_raghuOP15y ago

Tika is very good at converting documents to plain text. Very reliable too. The problem for us was that, most resumes have a lot of formatting in them. For example candidates use tables to structure data. When such a resume is converted to plain text using tika, it looks jumbled.

Will take a look at pandoc. Thanks for suggesting.

jamesshamenski15y ago· 1 in thread

Million Dollar Question:

How could you additionally parse the information to extract structured data? For example; names of candidates, addresses, previous employers, job titles held.

earle15y ago

That's been done across online job boards since 1996 when we launched hotjobs. Resumes, although varying aesthetically contain a pretty ridged structure that presents itself well to localized extraction. This allows easy term extraction for searching across a very large data set quickly.

A simple 30 line flex/yacc combo will work effectively at a high ninety percentile

afiler15y ago

Prompted by downloading a .doc file from Qwest only to find out that inside was a monospaced text file, I set up a small, nearly UI-free site for doing document conversions. http://doc.mar.cx/<url>; gives an HTML or other sensible rendering of an url (e.g. http://doc.mar.cx/http://www.itu.int/dms_pub/itu-t/oth/02/02... ) and http://doc.mar.cx/<extension>/<url>; attempts to convert the url into the format with the given extension (e.g. http://doc.mar.cx/txt/http://www.itu.int/dms_pub/itu-t/oth/0... ).

I use wvHtml for doc->html, wvPDF for doc->pdf, but antiword for doc->txt. To convert .docx, .xls, .xlsx, and WordPerfect files to HTML, I use OpenOffice, by way of jodconverter. For ODF files, I use OdfConverter. Conversion of Excel files to .csv files uses xls2csv. For PowerPoint files, I use ppthtml to convert to html, and catppt to convert to text. For Lotus 1-2-3 files (I added this after downloading some historical telecom data from the FCC!), I use ssconvert.

Any conversion that results in an HTML file (e.g. doc or pdf to html) I bundle all the images into a single file using the data: url scheme. To do this, I wrote a utility called pagecan: http://afiler.com/pagecan/

kalmi1015y ago

Based on the title I expected some html5 magic for converting binary files into html in the browser.

tucosan15y ago

How about trying out calibre http://calibre-ebook.com It can do all kinds of conversions from a number of formats, it is quite reliable, and it can be run headless.

Jakob15y ago

Please add a candidate delete function. I sent an email with candidate with multiple attachments and Recruiterbox created multiple candidates by mistake.

nopal15y ago

There's really not much here.

Could we see some code or a demo?

j / k navigate · click thread line to collapse

15 comments

15 comments · 9 top-level

sushi15y ago· 2 in thread

UX Suggestion: Please hyperlink the Blog text besides the Recruiterbox logo. It's underlined so users expect it to be a link.

p4bl015y ago

Also, a <title> tag would be useful :-).

But apart from this, I now I'll face this very problem soon (well, for a relatively fluctuant value of "soon"), so thank you very much for sharing this _raghu!

grease15y ago

Thanks, yeah will do this :)

dpapathanasiou15y ago· 2 in thread

How would you compare abiword for doc/docx conversion versus antiword (http://www.winfield.demon.nl/)?

Also, what are the limitations of abiword for doc/docx files?

_raghuOP15y ago

dpapathanasiou15y ago

Thanks.

Where do you get your doc files?

Are they the just ones submitted to your site, or is there a pastebin or similar repo of doc files?

bravura15y ago· 1 in thread

You should also consider 'pandoc', written in Haskell, for converting between markup formats: http://johnmacfarlane.net/pandoc/

I am curious for more details about why Tika wasn't good enough. Please explain.

_raghuOP15y ago

Will take a look at pandoc. Thanks for suggesting.

jamesshamenski15y ago· 1 in thread

Million Dollar Question:

How could you additionally parse the information to extract structured data? For example; names of candidates, addresses, previous employers, job titles held.

earle15y ago

A simple 30 line flex/yacc combo will work effectively at a high ninety percentile

afiler15y ago

kalmi1015y ago

Based on the title I expected some html5 magic for converting binary files into html in the browser.

tucosan15y ago

How about trying out calibre http://calibre-ebook.com It can do all kinds of conversions from a number of formats, it is quite reliable, and it can be run headless.

Jakob15y ago

Please add a candidate delete function. I sent an email with candidate with multiple attachments and Recruiterbox created multiple candidates by mistake.

nopal15y ago

There's really not much here.

Could we see some code or a demo?

j / k navigate · click thread line to collapse