undefined | Better HN

0 pointsthemanmaran1y ago0 comments

> Only if the PDFs are unknown or were created by way of a cellphone camera, multifunction office device, etc should you need to reach for OCR.

It's always safer to OCR on every file. Sometimes you'll have a "clean" pdf that has a screenshot of an Excel table. Or a scanned image that has already been OCR'd by a lower quality tool (like the built in Adobe OCR). And if you rely on this you're going to get pretty unpredictable results.

It's way easier (and more standardized) to run OCR on every file, rather than trying to guess at the contents based on the metadata.

0 comments

1 comments · 1 top-level

bob10291y ago

It's not guessing if the form is known and you can read the information directly.

This is a common scenario at many banks. You can expect nearly perfect metadata for anything pushed into their document storage system within the last decade.

1 more reply

j / k navigate · click thread line to collapse