undefined | Better HN

0 pointspiterrro1d ago0 comments

can someone explain how is this different than feeding the VLM model one page at a time?

0 comments

1 comments · 1 top-level

I did not know this until today... vision models don't hunt for the white space between letters like old OCR does. They just chop the whole image into a fixed grid of equal squares (patches) and treat each square like a word in a sentence.

This will help if you want to dig deeper - https://vectree.io/c/how-ocr-works-traditional-pipelines-vs-...

j / k navigate · click thread line to collapse