undefined | Better HN

0 pointsthrowaway44969mo ago0 comments

Hard no.

LLMs aren't going to magically do more than what your PDF rendering engine does, rastering it and OCR'ing doesn't change anything. I am amazed at how many people actually think it is a sane idea.

0 comments

protomikron9mo ago

I think there is some kind of misunderstanding. Sure, if you get somehow structured, machine-generated PDFs parsing them might be feasible.

But what about the "scanned" document part? How do you handle that? Your PDF rendering engine probably just says: image at pos x,y with size height,width.

So as parent says you have to OCR/AI that photo anyway and it seems that's also a feasible approach for "real" pdfs.

throwaway4496OP9mo ago

Okay, this sounds like "because some part of the road is rough, why don't we just drive in the ditch along the road way all the way, we could drive a tank, that would solve it"?

Macha9mo ago

My experience is that “text is actually images or paths” is closer to the 40% case than the 1% case.

So you could build an approach that works for the 60% case, is more complex to build, and produces inferior results, but then you still need to also build the ocr pipeline for the other 40%. And if you’re building the ocr pipeline anyway and it produces better results, why would you not use it 100%?

j / k navigate · click thread line to collapse