ChatGPT just inferred that I wanted the actual full names of the items (aka "flour" instead of "our").
Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.
I laugh every time I hear someone tell me how great VLMs are for serious work by themselves. They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.
So are human beings. Meaning we've been working around this issue since forever, we're not suddenly caught up in a new thing here.
The correct (or at least humanly-expected) process would be to identify the presence of mangled word, determine what its missing suffixes could have been, and if some candidate is a clear contextual winner (e.g. "fried chicken" not "dried chicken") use that.
However I wouldn't be surprised if the LLM is doing something like "The OCR data is X. Repeat to me what the OCR data is." That same process could also corrupt things, because it's a license to rewrite anything to look more like its training data.
[0] If that's not true, then it means I must have a supernatural ability to see into the future and correctly determine the result of a coin toss in advance. Sure, the power only works 50% of the time, but you should still worship me for being a major leap in human development. :p
Something I may have believed until I got married. Now I know that "fnu cwken" obviously means "fresh broccoli, because what else could it mean, did I say something about buying chicken, obviously this is not chicken since I asked you to go to produce store and they DON'T SELL CHICKEN THERE".
Seriously though, I'm mostly on the side of "huge success" here, but LLMs sometimes really get overzealous with fixing what ain't broke.
If you claim that you guess correctly 50% of the time then you are, from a Bayesian perspective, starting with a reasonable prior.
You then conflate the usefulness of some guessing skill with logic and statistics.
How this relates to an LLM is that the priors are baked into the LLM so statistics is all that is required to make an educated guess about the contents of a poorly written grocery list. The truthfulness of this guess is contingent on events outside of the scope of the LLM.
How often, applying a scalar value to the statistical outcome of an event, is very important. If your claim is that LLMs are wrong 5O% of the time then you need to update your priors based on some actual experience.
To even have a chance at doing it you'd need to start the training from scratch with _huge_ penalties for filling in missing information and a _much_ larger vision component to the model.
See an old post I made on what you need to get above sota OCR that works today: https://news.ycombinator.com/item?id=42952605#42955414
Odd timing, too given flash 2.0 release and its performance on this problem.
I played with OCR post-correction algorithms an invented on method myself in 1994, but haven't worked in that space since. Initial Tesseract and GPT-4o experiments disappoint. Any pointers (papers, software) & collab. suggestions welcome.
(Tesseract managed to get 3 fields out of a damaged label, while PaddleOCR found 35, some of them barely readable even for a human taking time to decypher them)
> We have hundreds of examples like this queued up, so let us know if you want some more!
Link to it then, let people verify.
I've pushed a lot of financial tables through Claude, and it gives remarkable accuracy (99%+) when the text size is legible to a mid-40s person like me. Gpt-4o is far less accurate.
[1]: https://cdn.prod.website-files.com/6707c5683ddae1a50202bac6/...
99.8%+ on first pass
Even with the best OCR, and high resolution scans, you might not get this due to:- the quality of the original paper documents, and
- the language
I have non-English documents for which I'd love to have 99% accuracy!
I suppose Gemini or Claude could fail with scans or handwritten pages. But that's a smaller (and different) set of use cases than just OCR. Most PDFs (in healthcare, financial services, insurance) are digital.
PI tno Name Time 3.5 km 18 C (cont.)
MEN B (39) 3(34) 4(52) 5(53) 6(54) 7(55)
8(40) 9(57)
12(60) 13(61) 14(62) 15(63) 16(47)
17(48) 18(100)
1(51) 2(33)
10(58) 11(59)
The first column is offset vertically which mixes up information and is wrong.I'm building a traditional OCR pipeline (for which I'm looking for beta testers! ;-) and this is what it outputs:
PI tno Name Time
MEN B (39) 3.5 km 18 C (cont.)
1 (51) 2 (33) 3 (34) 4 (52) 5 (53) 6 (54) 7 (55) 8 (40) 9 (57)
10 (58) 11 (59) 12 (60) 13 (61) 14 (62) 15 (63) 16 (47) 17 (48) 18 (100)
Finish
13 425 Peter Hodkinson 11:40 0:48 +0: 06 (21) 1:29 +0: 13 (28) 1:58 +0: 13 (24) 2:44 +0: 18 (23) 3:38 +0: 20 (19) 4:28 +0: 22 (18) 5:05 +0: 23 (17) 5:36 +0: 26 (17) 6:19 +0: 29 (19)
Great Britain 0:48 +0: 06 (21) 0:41 +0: 09 (30) 0:29 +0: 01 (4) 0:46 +0: 07 (22) 0:54 +0: 02 (5) 0:50 +0: 03 (7) 0:37 +0: 02 (10) 0:31 +0: 03 (11) 0:43 +0: 05 (20)
6:47 +0: 28 (17) 7:02 +0: 29 (17) 8:21 +0: 38 (16) 8:41 +0: 39 (16) 9:00 +0: 41 (16) 9:13 +0: 42 (16) 9:43 +0: 42 (16) 10:36 +0: 43 (14) 11:32 +0: 41 (13)
0:28 +0: 02 (8) 0:15 +0: 01 (4) 1:19 +0: 11 (16) 0:20 +0: 03 (15) 0:19 +0: 02 (4) 0:13 +0: 02 (11) 0:30 +0: 01 (2) 0:53 +0: 01 (3) 0:56 0:00 (1)
11:40 +0: 40 (13)
0:08 +0: 00 (8)
(edit: line wrap messes it all up... still I think my version is better ;-)Again, that image is fuzzy. If the argument is that these generic models don't work well with scans or handwritten content, I can perhaps agree with that. But that's a much smaller subset of PDFs.
Ingesting PDFs and why Gemini 2.0 changes everything
> This week, there was a viral blog about Gemini 2.0 being used for complex PDF parsing, leading many to the same hypothesis we had nearly a year ago at this point. Data ingestion is a multistep pipeline, and maintaining confidence from these nondeterministic outputs over millions of pages is a problem.
The actual conclusion is that they make classes of errors that traditional OCR programs either don't make, or make in different ways.
It still fails on this today (the "bdbdffdf" part). Not allowed to share a chat with a picture it seems, my prompt was to upload the file below and "Image to text please.". Just the free 4o model, maybe the paid stuff is better.
Is there any small model that would do this effectively, with pure text extraction (without going for any kind of formatting or whatnot)?
It is a 2b vision model that runs anywhere and can object detect, point, query, and more.
A fun threat to read for the current hype cycle.
You can tell who is working in the field by the fact they don't use VLMs for OCR and who isn't because they think it's a solved problem.
A question to the authors.
Do you have resources to train any VLMs from scratch? They aren't quite the bests the sota LLMs are and I think they can be made a lot more useful with:
1). Better training data.
2). Larger vision parts of the model.
In short: 2d attention is not something that anyone's doing at scale - that I know of - and is a no brainer for understanding images.
We never had the budget to do it but I do have some notes somewhere on a 2d context free grammar to generate arbitrarily nested rows/columns and a css styling that got applied to the xhtml output of the grammar. It dynamically generated as much high quality synthetic data as you wanted - but the IBM and similar data sets were plenty big enough for what we could do even on specialist models.
It depends on what you're doing really. I thought that we'd done pretty well, then someone on HN reached out with a table that spanned 50 pages and I just gave up.
Feel free to drop an email if you'd like a quick chat. I find the state of table models particularly abysmal for how important they are.
One note - there was a callout at the end to "stay tuned" for a follow-up post about the actual solution. I may have missed it, but I don't see any way to actually sign up to the blog or newsletter or anything. That's a shame - I'd love to follow this topic and product (and potentially have a few real-world use cases for it).
The problem comes from the vision part. Either (a) the ViT architecture needs a rework, or (b) the vision models need more training on tasks of the "copy this" nature versus the "do this" nature.
fully agree on the last point, the vit architecture will need some working on for this — microsoft’s been doing some excellent research on this lately
> LLMs process images through high-dimensional embeddings, essentially creating abstract representations that prioritize semantic understanding over precise character recognition.
This isn't true. CLIP and its derivatives don't prioritize semantic understanding. They are trained contrastively, which (very roughly speaking) means they need to be able to differentiate similar images. If two images are just white with a few words, the only way to differentiate them is to include the text in the embedding.
Pretrained CLIP models do tend to be a bit lossy in this department, but not by as much as you would think considering they boil an entire image down to something on the order of 768 floats.
> Each step in this pipeline optimizes for semantic meaning while discarding precise visual information.
Again, that ... doesn't make any sense. It's a bit foolhardy to even say _what_ the models do, given that not even the most brilliant ML researchers know. But in broad _hypothesis_, the CLIP pipeline is optimizing being able to pair images with captions amongst a large number of possibilities. Which, again, requires them to surface all kinds of information from the image, and often times requires surfacing specific text from the image. How else would it differentiate powerpoint slides? Math problems in images? Etc.
> Fixed patch sizes may split individual characters
This doesn't matter. We know from empirical evidence. But even if it _did_, there's plenty of vision models that use overlapping patches.
> Position embeddings lose fine-grained spatial relationships
This isn't true. The model is fully aware of the position of pixels within patches, and the position embedding is merely to tell it the position of the patches themselves within the image. Therefore it can derive the absolute position of every pixel, if it needs to. In fact, we have proof they can and do.
> losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.
You get confidence scores for free because the model is explicitly trained to provide cosine similarity scores.
OWLv2 is a CLIP based open vocabulary bounding box model (from Google, makers of Gemini). It's finetuned from a standard, pretrained CLIP model. Nothing really special about the vision architecture; just that it gets finetuned to output bounding boxes. And it beats the pants off YOLO while being open vocabulary to boot. So not only are CLIP-like models capable of outputting bounding boxes, but OWLv2 was trained with human-in-the-loop processes and outputs confidence scores.
Oh and there's Florence, which is a VLM trained on bounding boxes.
> Favor common words over exact transcription
Nothing about LLMs indicates that. In fact, pretrained LLMs favor exact transcription.
> "Correct" perceived errors in the source document
Which OCR systems need to do to be useful for many applications. I get the argument that LLMs are a blackbox in this regard, which is a legitimate criticism, but correcting mistakes is not fundamentally the issue. It's better to say that LLMs _blindly_ correct issues. Whereas, perhaps, one could say a traditional OCR system can report "this is my exact transcription, I corrected it to this" and have various knobs to tweak thresholds. But there's no reason VLMs can't do that too.
> Merge or reorder information based on learned patterns
LLMs are perfectly capable of regurgitating data verbatim. That's perhaps the first thing they learn to do to get loss down. That's what all long context models are benchmarked against.
> Produce different outputs for the same input due to sampling
You can turn off sampling, and then they are deterministic. Or you can output the logits to the user, which gives you effectively confidence scores on its transcription.
And a well trained LLM for this task isn't really "probabilistic" in the sense that its outputs are completely different each time. If it's trained and prompted specifically to transcribe a document, that's what it's going to do. Any variations in output at that point are a result of real vagaries either in the document, vision, or the user request.
If a user wants consistency, they merely need to ask for it. Or the VLM needs to be trained better. In either case, these models are _capable_ of it.
It's most important to note here that, outside of pretrained LLMs, all LLMs that users interact with are Reinforcement trained. So while they were next token prediction trained during _pretraining_, they get trained to seek reward in production. That vastly trims the logits and focuses the model explicitly on performing tasks. Well trained, production LLMs only really put probability fields around tokens that are legitimately valid for the task at hand (bounded by the LLM's intelligence, of course).
> Unlike traditional OCR systems that fail obviously when uncertain, LLMs make educated guesses that appear plausible but may be entirely wrong.Consider the sequence "rn" versus "m". To a human reader scanning quickly, or an LLM processing image patches, these can appear nearly identical. The model, trained on vast amounts of natural language, will tend toward the statistically more common "m" when uncertain.
Again, LLMs don't just regurgitate the most "common" stuff. They are context specific. Besides, it's the vision module that would be making the differentiation here between rn and m. A vision module that is likely neither better nor worse than the vision modules traditional OCR systems are using. (Of course, the LLM may process the vision module's output and notice that perhaps it mis-transcribed "rn" vs "m" and "correct" it. But correct it based on _context_ not on some simplistic statistical model as suggested.)
> There’s a great paper from July 2024 (millennia ago in the world of AI) titled “Vision language models are blind” that emphasizes shockingly poor performance on visual tasks a 5 year old could do
Absolutely. I work in this field, and these vision models are not at the same level as their language counterparts. Due in large part to a lack of good data, good training processes, and good benchmarks. The Cambrian-1 paper is quite insightful here, as it studies the vision benchmarks themselves (https://arxiv.org/abs/2406.16860). The TLDR is that most of the vision benchmarks are actually just text benchmarks, and performance barely degrades when the model is blinded. I've found the same to be true of almost all publicly available training datasets for vision models, which is likely why these models don't learn good, robust visual understandings.
That doesn't really speak to the fundamental capabilities of the vision models. It speaks to the lack of training them well. So, if a model is explicitly trained to do OCR using lots of high quality ground truth data (which is easy to get and generate), then their performance can, and does, excel.
---
Now, all of that said, I also don't agree with the prior post this post is in response to. I work with VLMs a lot as part of my research, and I can assure you that they are nowhere near human level on OCR. They can exceed human performance in very specific tasks at the moment, but that's about it.
Are they better than other OCR offerings? As of this moment, I would tend to trust someone who does OCR for a living, so if Pulse says VLMs aren't as good as their solution, I would probably trust that over someone else saying VLMs work for their specific application. And VLMs _absolutely_ come with a myriad of caveats. They aren't as reliable as a more mechanical OCR system. Expect something like GPT4o to completely glitch 1 in every 10,000 queries. And expect them to be "weird". GPT4o will tend to not fully follow instructions maybe 1 in 100 times, so you might get your document back in the wrong format, or have "Sure, I can help with that!" at the start of your document, etc. Gemini tends to have better instruction following, but I don't have a good assessment of its reliability yet.
If I, personally, had a small project that needed OCR, I'd use Tesseract if it's just PDFs or something like that with printed text. If it's something with weird fonts, fancy stuff, handwriting, math formulas, etc. I might give Gemini a try. If it's mission critical, pay an expert to do it, whether that's in-house or paying a service explicitly built for the purpose.
---
NOTE: One thing that got glossed over in the article is that VLMs are not trained on the "embeddings" of the vision model, per se. CLIP processes the images as N number of tokens across L number of layers. At the end, you have N embeddings. For traditional CLIP, the last (or first) embedding is used as the result. Modern CLIPs average the embeddings together. Tomato, tomato.
VLMs are not trained on that single embedding from CLIP. The "head" gets stripped off, and the VLMs get trained on all N processed tokens from CLIP. So they have access to much more information. The vision models also get finetuned during the training of the VLM, and, importantly, CLIP architectures use skip connections throughout. So there is a direct path for the LLM to access pretty much anything from the vision model that it needs, and optimize for any information it needs.
The size of the embedded information given to the LLM, then, is almost about the same as the number of pixels from the source image. For example it might be something like a 384x384x3 image (442,368 dimensions) getting baked down into something like a 150,000 dimensional vector. So it's really not a fundamentally lossy process at that point.
My goal was to run an OCR model locally and extract text from scanned PDFs.
Many models could not even be run. Among those that did run, thanks to Ollama, provided very poor experience. Like llava-llama3, phi3.5 vision, etc.
What worked really well, but still not up to the mark- Surya [0].
It works perfectly on screenshots from true text PDFs, but not from scanned PDFs. Also has much better performance for English than Indian languages.
And the 5 square variation as well.
So perhaps it is just a question of how much compute you are willing to throw at it
This article is essentially correct.
https://arxiv.org/abs/2311.06242
https://huggingface.co/blog/finetune-florence2
https://blog.roboflow.com/florence-2-ocr/
https://www.assemblyai.com/blog/florence-2-how-it-works-how-...
I don't personally deal with any OCR tasks, so maybe I misread the room, but it sounded promising, and I have seen some continuing interest in it online elsewhere.
In addition to the architectural issues mentioned in OP's article that are faced by most SOTA LLMs, I also expect that current SOTA LLMs like Gemini 2.0 Flash aren't being trained with very many document OCR examples... for now, it seems like the kind of thing that could benefit from fine-tuning on that objective, which would help emphasize to the model that it doesn't need to try to solve any equations or be helpful in any smart way.
Nice post and response to the previous one.
It’s important to remember that the use cases for VLMs and document parsers are often different. VLMs definitely take a different approach than layout detection and OCR. They’re not mutually exclusive. VLMs are adaptable with prompting, eg please pull out the entries related to CapEx and summarize the contributions. Layout parsers and OCR are often used for indexing and document automation. Each will have their own place in an enterprise stack.
Except for a very special kind of bug:
https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
>Xerox scanners/photocopiers randomly alter numbers in scanned documents
> Fixed patch sizes may split individual characters
> Position embeddings lose fine-grained spatial relationships, losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.
The author suggests that the standard ViT architecture is poorly suited for OCR because patches do not respect character boundaries and that the positional embeddings only embed the locations of patches, which are 16x16 pixels.
My mental model is that a token is a memory slot where computation results can be stored or retrieved from. There is no reason why we should want the layout of these memory slots must mimic the layout of the document, except at the very first layer, because then we don't have to think too hard about how to encode the document.
I fed my system the first image in the post [0] and got the text below in return.
I will be looking for beta testers next week... Email if interested!
VEH YR MAKE MODEL IDENTIFICATION TYPE SYM ST TER USE CLASS ALARM
2 02 HOND CIVIC EX 1HGEM22952L086006 PP 18 IL 37 L 887120
LOSS PAYEE THAT APPLIES: 2
3.02 HYUN SONATA / GL KMHWF25S72A671544 PP 16 IL 37 P 887120
H NO. COVERAGE DESCRIPTION LIABILITY LIMIT (S) DEDUCTIBLE PREMIUM
2 Preferred Extra Auto
Bodily Injury $ 250,000 / $ 500,000 $ 92.00
Property Damage $ 100,000 $ 43.00
Medical Payments $ 5,000 $ 13.00
Uninsured Motorist $ 250,000 / $ 500,000 $ 62.00
Undinsured Motor.-BI $ 250,000 / $ 500,000 INCL
Collision $ 500 $ 141.00
Other than Collision $ 250 $ 92.00
TOTAL FOR UNIT 2 $ 443.00
3- Preferred Extra Auto
Bodily Injury $ 250,000 / $ 500,000 $ 92.00
Property Damage $ 100,000 $ 43.00
Medical Payments $ 5,000 $ 13.00
Uninsured Motorist $ 250,000 / $ 500,000 $ 62.00
Undinsured Motor. BI $ 250,000 / $ 500,000 INCL
Collision $ 500 $ 136.00
Other than Collision $ 250 $ 90.00
TOTAL FOR UNIT 3 $ 436.00
DRIVER INFORMATION
DR VEH SEX MAR BIRTH G / S PRIN DVR LIC NO. NAME PTS
[0] https://i.imgur.com/sLWQoFG.jpegDisclaimer: I'm the founder and CEO.
I cofounded DocuPanda to handle this issue specifically. Call me biased, but I do believe it's the best solution out there.
This is a confusing way to describe attention and gets a bit off topic, the attention mechanism is not really what’s causing any of the issues in the article.
and even today it can never be done right
because SOME resumes are just so f** up.