undefined | Better HN

0 pointsthrowaway449610mo ago0 comments

I have explained the details in other comments, have a look. But you can start by looking at pdftotext from Poppler, it is ready to go for 60-70% of cases with -layout flag, with bbox-layout you get even more details.

0 comments

3 comments · 1 top-level

dotancohen10mo ago· 2 in thread

Thank you. Even with box layout one can not even know that there is a coherent word or phrase to extract, without visually inspecting the PDF beforehand. I've been there, fighting with it right in the CLI and finding that there is no way to even progress to a script.

The advantage of the OCR method is that it effectively performs that visual inspection. That's why it is preferable for PDFs of disparate origin.

throwaway4496OP10mo ago

What kind of semantics can you infer from the text of OCRing a bitmap that you can't infer from the text generated directly from the PDF? Is it the lack of OCR mistakes? The hallucinations? Or something else?

dotancohen10mo ago

In the cases that I've seen, the PDF software does not generate text strings. It generates individual letters. It is up to any application to try to figure out how those individual letters relate to one another.

1 more reply

j / k navigate · click thread line to collapse