Show HN: Extract text from any pdf in the browser (opens in new tab)

(textractor.app)

12 pointsblaydator6y ago9 comments

9 comments

7 comments · 1 top-level

blaydatorOP6y ago· 6 in thread

Hi Hackers,

Often I get pdfs which I want to extract text from and paste it somewhere else. Not all PDFs are always well constructed and a lot of them are scanned ones. Unfortunately Mac's Preview or other classic PDF viewers can not extract text from those.

So I have built a minimalist website to extract text from any PDFs, scanned ones as well. It uses OCR to extract text and the user can highlight specific areas on the document to extract from. The extraction is made locally by the browser thanks to the awesome Tesseract.js library.

I would love to have your feedback before adding more features (zoom setting, improve areas selections, png/jpeg support, mobile support, offline support, ...).

saradhi6y ago

We do a lot of this. Honestly, better than what I thought after seeing the title (there are many alike posts). Slick interface. Do not worry about the mobile support - the mobile traffic, not even 3%, does not come for extraction service, they may visit for info, nothing more than that.

blaydatorOP6y ago

Thanks ! Yes I hope title doesn't feel clickbait too much, but I haven't find something else to describe it simply.

nicolasmahe6y ago

Really nice and easy to use UI. The text extraction is working well, and the area selection is great

I would definitely suggest adding image support.

Also, I noticed the function to keep line breaks put sometimes 2 line breaks instead of one.

Good job :)

blaydatorOP6y ago

Thanks! Image support is on its way :)

robineisenberg6y ago

Works really well!

Just tried it in french with some accents and I was impressed to see a perfect OCR.

Are you using tesseract? Are you planning on open-sourcing this? Could you tell us a bit more about the stack behind this?

In any case great project, well done!

blaydatorOP6y ago

Thanks ! Indeed I'am using Tesseract.js [1] which wraps an WASM port of the Tesseract OCR Engine (v4). Stack is pretty simple : Vue.js, Vuetify[2] (absolutely fond of it) and Tesseract and hosted on Netlify (with Analytics [3] so no trackers).

I will probably open sourced it :)

[1] https://github.com/naptha/tesseract.js [2] https://vuetifyjs.com/en/ [3] https://www.netlify.com/products/analytics/

1 more reply

j / k navigate · click thread line to collapse

9 comments

7 comments · 1 top-level

blaydatorOP6y ago· 6 in thread

Hi Hackers,

I would love to have your feedback before adding more features (zoom setting, improve areas selections, png/jpeg support, mobile support, offline support, ...).

saradhi6y ago

blaydatorOP6y ago

Thanks ! Yes I hope title doesn't feel clickbait too much, but I haven't find something else to describe it simply.

nicolasmahe6y ago

Really nice and easy to use UI. The text extraction is working well, and the area selection is great

I would definitely suggest adding image support.

Also, I noticed the function to keep line breaks put sometimes 2 line breaks instead of one.

Good job :)

blaydatorOP6y ago

Thanks! Image support is on its way :)

robineisenberg6y ago

Works really well!

Just tried it in french with some accents and I was impressed to see a perfect OCR.

Are you using tesseract? Are you planning on open-sourcing this? Could you tell us a bit more about the stack behind this?

In any case great project, well done!

blaydatorOP6y ago

I will probably open sourced it :)

[1] https://github.com/naptha/tesseract.js [2] https://vuetifyjs.com/en/ [3] https://www.netlify.com/products/analytics/

1 more reply

j / k navigate · click thread line to collapse