I did not know this until today... vision models don't hunt for the white space between letters like old OCR does. They just chop the whole image into a fixed grid of equal squares (patches) and treat each square like a word in a sentence.
This will help if you want to dig deeper - https://vectree.io/c/how-ocr-works-traditional-pipelines-vs-...