Often I get pdfs which I want to extract text from and paste it somewhere else. Not all PDFs are always well constructed and a lot of them are scanned ones. Unfortunately Mac's Preview or other classic PDF viewers can not extract text from those.
So I have built a minimalist website to extract text from any PDFs, scanned ones as well. It uses OCR to extract text and the user can highlight specific areas on the document to extract from. The extraction is made locally by the browser thanks to the awesome Tesseract.js library.
I would love to have your feedback before adding more features (zoom setting, improve areas selections, png/jpeg support, mobile support, offline support, ...).
I would definitely suggest adding image support.
Also, I noticed the function to keep line breaks put sometimes 2 line breaks instead of one.
Good job :)
Just tried it in french with some accents and I was impressed to see a perfect OCR.
Are you using tesseract? Are you planning on open-sourcing this? Could you tell us a bit more about the stack behind this?
In any case great project, well done!
I will probably open sourced it :)
[1] https://github.com/naptha/tesseract.js [2] https://vuetifyjs.com/en/ [3] https://www.netlify.com/products/analytics/