I tried using a VLM to recognize handwritten text in genealogical sources, and it made up names and dates that sort of fit the vibe of the document when it couldn’t read the text! They sounded right for the ethnicity and time period but were entirely fake. There’s no way to ground the model using the source text when the model is your OCR.
Confidence intervals are a red herring. And only as good as the code interpreting them. If the OCR model gives you back 500 words all ranging from 0.70 to 0.95 confidence, what do you do? Reject the entire document if there's a single value below 0.90?
If so you'd be passing every single document to a human review, and might as well not run the OCR. But if you're not rejecting based on CI, then you're exposed to just as much risk as using an LLM.
If you try to pitch hallucinations to these fields, they'll just choose 100% manual instead. It's a non-starter.
That's not true. LLMs and OCR have very different failure modes. With LLMs, there is unbounded potential for hallucination, and the entire document is at risk. For example: if something in the lower right-hand corner of the page takes the model to a sparsely sampled part of the latent space, it can end up deciding that it makes sense to rewrite the document title! Or anything else. LLMs also have a pernicious habit of "helpfully" completing partial sentences that appear at the beginning or end of a page of text.
With OCR, errors are localized and have a greater chance of being detected when read.
I think for a lot of cases, the best solution is to fine-tune a model like LayoutLM, which can classify the actual text tokens in a document (whether obtained from OCR or a native text layer) using visual and spatial information. Then, there are no hallucinations and you can use uncertainty information from both the OCR (if used) and the text classification. But it does mean that you have to do the work of annotating data and training a model, rather than prompt engineering...
In VLM/LLM powered methods, the missing/misred data will be hallucinated and you can't know whether something scanned correctly or not. I personally scan and OCR tons of personal documents, I prefer "gibberish" rather than "hallucinations", because they're easier to catch.
We had this problem before [0], on some Xerox scanners and copiers. Results will be disastrous. It's not a question of if, but when.
I personally tried Gemini and OpenAI's models for OCR, and no, I won't continue using them further.
[0]: https://www.theregister.com/2013/08/06/xerox_copier_flaw_mea...
> If the OCR model gives you back 500 words all ranging from 0.70 to 0.95 confidence, what do you do? Reject the entire document if there's a single value below 0.90?
No, of course not. You have a human review the words/segments with low confidence.
For example, Llamaparse(https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse...) uses LLMs for PDF text extraction but faces hallucination problems. See this issue for more details: https://github.com/run-llama/llama_parse/issues/420.
For those interested, try LLMWhisperer(https://unstract.com/llmwhisperer/) for OCR. It avoids LLMs, eliminates hallucination issues, and preserves the input document layout for better context.
Examples of extracting complex layout:
The website you linked says it uses LLMs?
Fine-tuning does help, though.
It is an absolute miracle.
It is transmutating a picture into JSON.
I never thought this would be possible in my lifetime.
But that is different from what your interlocutor is discussing.
(There's slightly more errors if I ask it to add numbers but this isn't OCR and a bit more of a reach, although it is very good at this too regardless).
Many hallucinations can be avoided by telling it to use null if there is no number present.
Results for VLM accuracy & precision are not good. https://arxiv.org/html/2406.04470v1#S4
are we talking tesseract or something?
The output could be in Markdown, which is easily turned into a PDF. You would have to break up the input PDF into pages to avoid running out of output window.
Currently I'm using tesseract - it works, it's fast, but it also makes mistakes; it would be also great if it could discern tabular data and put them in ascii or markdown tables. I've tried docling, but it feels like a bit of an overkill. It seems to be slower - remember, I need to be able to grab the text from the screenshot very quickly. I have only tried default settings, maybe tweaking it would improve things.
Can anyone share some thoughts on this? Thanks!
That said, VLMs are extremely powerful visual learners with LLM-like reasoning capabilities making them more versatile than OCR for practically all imaging domains.
In a matter of a few years, I think we'll essentially see models that are more cost-performant via distillation, quantization and the multitude of tricks you can do to reduce the inference overhead.
Are they? I'm seeing figures around 80 watts at rest, and 150 when exercising. The brain itself only uses about 20 watts [1]. That's 1/35 of a single H100's power consumption (700 watts - which doesn't even take into account the energy required to cool the data center, the humans who build and maintain it, ...).
[1]: https://www.humanbrainproject.eu/en/follow-hbp/news/2023/09/...
Maybe this necessary step can be improved and altered with a VLM. There is also the preprocessing where the image get its perspective corrected. Not sure how well a VLM performs here.
As you said, I think combining these techniques will be the most efficient way forward.
Though if it accidentally "traces" one of the few exceptions, then you've potentially committed a crime, and the big difficulty in typeface detection you mention increases those odds. That said, there are so few exceptions that even if the model couldn't properly identify a font, it might be able to identify whether a font is likely to have a design patent.
I do think getting an AI to create a high quality vector font from a potentially low-res raster graphic is going to be quite challenging though. Raster to vector tools I've tried in the past left a bit to be desired.
1. https://www.copyright.gov/comp3/chap900/ch900-visual-art.pdf
> As a general rule, typeface, typefont, lettering, calligraphy, and typographic ornamentation are not registrable. 37 C.F.R. § 202.1(a), (e). These elements are mere variations of uncopyrightable letters or words, which in turn are the building blocks of expression. See id. The Office typically refuses claims based on individual alphabetic or numbering characters, sets or fonts of related characters, fanciful lettering and calligraphy, or other forms of typeface. This is true regardless of how novel and creative the shape and form of the typeface characters may be.
> There are some very limited cases where the Office may register some types of typeface, typefont, lettering, or calligraphy, such as the following:
> • Pictorial or graphic elements that are incorporated into uncopyrightable characters or used to represent an entire letter or number may be registrable. Examples include original pictorial art that forms the entire body or shape of the typeface characters, such as a representation of an oak tree, a rose, or a giraffe that is depicted in the shape of a particular letter.
> • Typeface ornamentation that is separable from the typeface characters is almost always an add-on to the beginning and/or ending of the characters. To the extent that such flourishes, swirls, vector ornaments, scrollwork, borders and frames, wreaths, and the like represent works of pictorial or graphic authorship in either their individual designs or patterned repetitions, they may be protected by copyright. However, the mere use of text effects (including chalk, popup papercraft, neon, beer glass, spooky-fog, and weathered-and-worn), while potentially separable, is de minimis and not sufficient to support a registration.
> The Office may register a computer program that creates or uses certain typeface or typefont designs, but the registration covers only the source code that generates these designs, not the typeface, typefont, lettering, or calligraphy itself. For a general discussion of computer programs that generate typeface designs, see Chapter 700, Section 723.
I'm building a "never fill out paperwork again" app, if anyone is interested, would be happy to chat!
Also, VLMs are end-to-end trainable, unlike OCR+LLM solutions (that are trained separately), so it’s clear that these approaches scale much better for domain-specific use cases or verticals.
If you need something like this, it's definitely good enough that you should consider kicking the tires.
It gives you an idea of where today's models fail (Gemini Flash, OpenAI gpt4o+mini, open-source ones like Llama 3.2 Vision, Qwen VL 2.5 etc).
Vision model did the trick so well it's not even funny to discuss anything further.
"This is a picture of Apple product box. Find and return only the serial number of the product as found on a label. Return 'none' if no serial number can be found".
VLM highlights:
- Handwriting. Being contextually aware helps here. i.e. they read the document like a human would, interpreting the whole word/sentence instead of character by character
- Charts/Infographics. VLMs can actually interpret charts or flow diagrams into a text format. Including things like color coded lines.
Traditional OCR highlights:
- Standardized documents (e.x. US tax forms that they've been trained on)
- Dense text. Imagine textbooks and multi column research papers. This is the easiest OCR use case, but VLMS really struggle as the number of output tokens increase.
- Bounding boxes. There still isn't really a model that gives super precise bounding boxes. Supposedly Gemini and Qwen were trained for it, but they don't perform as well as traditional models.
There's still a ton of room for improvement, but especially with models like Gemini the accuracy/cost is really competitive.
As you mentioned there are a few caveats to VLMs that folks are typically unaware of (not at all exhaustive, but the ones you highlighted):
1. Long-form text (dense): Token limits of 4/8K mean that dense pages may go over limits of the LLM outputs. This requires some careful work to make them work as seamlessly as OCR.
2. Visual grounding a.k.a. bounding boxes are definitely one of those things that VLMs aren't natively good at (partly because the cross-entropy losses used aren't really geared for bounding box regression). We're definitely making some strides here [1] to improve that so you're going to get an experience that is almost as good as native bounding box regression (all within the same VLM). [1]
[1] https://colab.research.google.com/github/vlm-run/vlmrun-cook...
May also be interested in Allen AI's OCR tool olmOCR they just released too [1][2]. They say "convert a million PDF pages for only $190 USD".
[1] https://github.com/allenai/olmocr [2] https://arxiv.org/abs/2502.18443
Income Expenses 200 100
On one document, and
Income Expenses 20 0100
On others.
There's no shortage of products that tried to solve this problem from scratch (or by piggybacking on other projects) and called it a day without worrying about the huge problem that is quality and parseability.
The most robust players just give you the coordinates of a glyph and you are on your own: Textract, PDFBox.
There's a client who had a startup idea that involved analyzing pdfs, I used textract, but it was too cumbersome and unreliable.
Maybe I can reach out to see if he wants to give it anothee go with this!
At the end of the day its just schemas. You can decide for yourself if its work upgrading to a larger, more expensive model.
If it's not intended for digital documents, where are the screenshots with fold marks, slipping lines, lighting gradients, thumbs, etc etc.
Question: What tools/libs are people using to accurately detect square/rectangle objects in images?
I've used VNDetectRectangle [1] in Swift but it's not as accurate as I'd like it to be, even with preprocessing.
[1]: https://developer.apple.com/documentation/vision/vndetectrec...
- recognizing/recreating exact font used
- helping align/rotate source
Not to hallucinate gibberish when source lacks enough data.
[1]: https://www.runpulse.com/blog/why-llms-suck-at-ocr [2]: https://news.ycombinator.com/item?id=42966958#42977527