Why LLMs still have problems with OCR (opens in new tab)

(runpulse.com)

218 pointsritvikpandey211y ago145 comments

Document ingestion and the launch of Gemini 2.0 caused a lot of buzz this week. As a team building in this space, this is something we researched thoroughly. Here’s our take: ingestion is a multistep pipeline, and maintaining confidence from LLM nondeterministic outputs over millions of pages is a problem.

Why LLMs still have problems with OCR

(runpulse.com)

218 pointsritvikpandey211y ago145 comments

145 comments

123 comments · 45 top-level

michaelbuckbee1y ago· 19 in thread

I took a picture of a grocery list and then pasted it into ChatGPT to have it written out and it worked flawlessly...until I discovered that I'd messed up the picture when I took it at an angle and had accidentally cut off the first character or two of the bottom half of the list.

ChatGPT just inferred that I wanted the actual full names of the items (aka "flour" instead of "our").

Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.

llm_trw1y ago

This is great until it hallucinates rows in a report with company assets that don't exist - why wouldn't a mining company own some excavation equipment? - and pollutes all future searches with fake data right from the start.

I laugh every time I hear someone tell me how great VLMs are for serious work by themselves. They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.

osigurdson1y ago

It does seem that companies are able to get reliability in narrow problem domains via prompts, evals and fine tuning.

2 more replies

ritvikpandey21OP1y ago

we completely agree - mechanistic interpretability might help keep these language models in check, but it’s going to be very difficult to run this on closed source frontier models. im excited to see where that field progresses

dyauspitr1y ago

You can literally ask it to mark characters that are not clear instead of inferring them.

paulsutter1y ago

Unit tests work remarkably well

1 more reply

gcanyon1y ago

> They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.

So are human beings. Meaning we've been working around this issue since forever, we're not suddenly caught up in a new thing here.

3 more replies

Terr_1y ago

I'm in the "failure" camp, because the true correctness of an answer comes from how it was reached. [0]

The correct (or at least humanly-expected) process would be to identify the presence of mangled word, determine what its missing suffixes could have been, and if some candidate is a clear contextual winner (e.g. "fried chicken" not "dried chicken") use that.

However I wouldn't be surprised if the LLM is doing something like "The OCR data is X. Repeat to me what the OCR data is." That same process could also corrupt things, because it's a license to rewrite anything to look more like its training data.

[0] If that's not true, then it means I must have a supernatural ability to see into the future and correctly determine the result of a coin toss in advance. Sure, the power only works 50% of the time, but you should still worship me for being a major leap in human development. :p

TeMPOraL1y ago

> I'm in the "failure" camp, because the true correctness of an answer comes from how it was reached.

Something I may have believed until I got married. Now I know that "fnu cwken" obviously means "fresh broccoli, because what else could it mean, did I say something about buying chicken, obviously this is not chicken since I asked you to go to produce store and they DON'T SELL CHICKEN THERE".

Seriously though, I'm mostly on the side of "huge success" here, but LLMs sometimes really get overzealous with fixing what ain't broke.

2 more replies

williamcotton1y ago

On your epistemology, if you correctly guess the outcome of a random event then the statement, even if contingent on an event that did not yet occur, is still true. The same goes for every incorrect guess.

If you claim that you guess correctly 50% of the time then you are, from a Bayesian perspective, starting with a reasonable prior.

You then conflate the usefulness of some guessing skill with logic and statistics.

How this relates to an LLM is that the priors are baked into the LLM so statistics is all that is required to make an educated guess about the contents of a poorly written grocery list. The truthfulness of this guess is contingent on events outside of the scope of the LLM.

How often, applying a scalar value to the statistical outcome of an event, is very important. If your claim is that LLMs are wrong 5O% of the time then you need to update your priors based on some actual experience.

kaonwarb1y ago

To consider: do we overestimate what we know about how we humans reach an answer? (Humans are very capable of intuitively reading scrambled text, for example, as long as the beginning and ending of each word remains correct.)

afro881y ago

The correct way to handle it is to ask the user if it's not clear, like a real assistant would

nodamage1y ago

I once did something similar with a recipe from a cookbook where the recipe started at the bottom of one page and continued onto the next page. It correctly identified the first few ingredients present in the photo of the first page but then proceeded to hallucinate another half-dozen or so ingredients in order to generate a complete recipe.

ritvikpandey21OP1y ago

yup, this is a pretty common occurrence in using LLMs for data extraction. For personal use (trying to load a receipt) it’s great that the LLM filled in info. For production systems which need high quality, near 100% extraction accuracy, inferring results is a failure. Think medical record parsing, financial data, etc These hallucinations occur quite frequently, and we haven’t found a way to minimize this through prompt eng.

llm_trw1y ago

It's not possible with current gen models.

To even have a chance at doing it you'd need to start the training from scratch with _huge_ penalties for filling in missing information and a _much_ larger vision component to the model.

See an old post I made on what you need to get above sota OCR that works today: https://news.ycombinator.com/item?id=42952605#42955414

amelius1y ago

Maybe ask it to return the bounding box of every glyph.

1 more reply

jmartin26831y ago

My experiences have been the same… that is to say nothing like what is reported here. This is more pitch than info.

Odd timing, too given flash 2.0 release and its performance on this problem.

Buttons8401y ago

Failed as a machine, succeeded as an intelligence. Intelligences don't make good machines though.

davidhs1y ago

I recently took a picture of the ingredients list on my multivitamin and fed it into ChatGPT o1-pro (at least that was the config.) and it made up ingredients and messed up the quantities.

daveguy1y ago

Sounds like a xerox.

jll291y ago· 9 in thread

In case any scientist actually working on adaptive OCR is reading this, I was given a post-WWII newspaper archive (PDF scans, 1945-2006, German language) that I would like to OCR with the highest quality, compute demands are not an issue, I've got an army of A100s available.

I played with OCR post-correction algorithms an invented on method myself in 1994, but haven't worked in that space since. Initial Tesseract and GPT-4o experiments disappoint. Any pointers (papers, software) & collab. suggestions welcome.

zeograd1y ago

I tried https://github.com/PaddlePaddle/PaddleOCR for my own use case (scanline images of parcel labels) and it beat Tesseract by an order of magnitude.

(Tesseract managed to get 3 fields out of a damaged label, while PaddleOCR found 35, some of them barely readable even for a human taking time to decypher them)

moffkalast1y ago

When I was doing OCR for some screenshots last year I managed to get it done with tesseract, but just barely. When looking for alternatives later on I found something called Surya on github which people claim does a lot better and looks quite promising. I've had it bookmarked for testing forever but I haven't gotten around to actually doing it. Maybe worth a try I guess?

ianhawes1y ago

Surya is on par with cloud vision offerings.

ritvikpandey21OP1y ago

would love to give this a shot with pulse! feel free to reach out to me at ritvik [at] trypulse [dot] ai, and i’d be very curious to give these a run! in general, i’m happy to give some general advice on algos/models to fine-tune for this task

sumedh1y ago

Are you targeting business or consumers?

I cannot find the pricing page.

1 more reply

patcon1y ago

Pls contact archive.org about adopting this digital archive once it exists (they also have a bad habit of accepting physical donations, if you are nearby)

ahoka1y ago

I’m very far from an expert, but had good luck with EasyOCR when fiddling with such things.

pbhjpbhj1y ago

If it's a large enough corpus I imagine it's worth fine tuning to the specific fonts/language used?

mdbmdb1y ago

I would love to get access to that archive!

jeswin1y ago· 6 in thread

If Pulse (which is a competing product, the premise of which is threatened by both closed and open models) wants to dispute the post earlier this week, it should provide samples which fail in Claude and Gemini. The image [1] in the post is low-resolution and fuzzy. Claude's user manual specifically says: "Images uploaded on Claude.ai can be up to 30MB, and up to 8000x8000 pixels. We recommend avoiding small or low resolution images where possible."

> We have hundreds of examples like this queued up, so let us know if you want some more!

Link to it then, let people verify.

I've pushed a lot of financial tables through Claude, and it gives remarkable accuracy (99%+) when the text size is legible to a mid-40s person like me. Gpt-4o is far less accurate.

[1]: https://cdn.prod.website-files.com/6707c5683ddae1a50202bac6/...

sgc1y ago

99%+ is terrible in the OCR world. 99.8%+ on first pass, and 99.99%+ (1/10k characters error) at the end of the process - which includes human reviewers in the loop - is ok, but the goal is higher fidelity than that. If we are throwing billions at the problem, I would expect at least another 9 on that.

rahimnathwani1y ago

  99.8%+ on first pass

Even with the best OCR, and high resolution scans, you might not get this due to:

- the quality of the original paper documents, and

- the language

I have non-English documents for which I'd love to have 99% accuracy!

1 more reply

davedx1y ago

Ha hi Jeswin! I was itching to reply to this post too, I wonder why…

jeswin1y ago

Dave! Our sample sizes were large enough, and tables complex enough to opine on this.

I suppose Gemini or Claude could fail with scans or handwritten pages. But that's a smaller (and different) set of use cases than just OCR. Most PDFs (in healthcare, financial services, insurance) are digital.

bambax1y ago

Using that image and the following prompt on Gemini 2.0 Flash "please do ocr of the attached file and output ascii following the layout of the original as faithfully as possible" outputs something that isn't bad but not perfect:

  PI tno Name             Time            3.5 km   18 C   (cont.)
  MEN B (39)                                                  3(34)         4(52)         5(53)         6(54)         7(55)         
  8(40)         9(57)
                                                               12(60)        13(61)        14(62)        15(63)        16(47)        
  17(48)       18(100)
                                                  1(51)         2(33)
                                                  10(58)        11(59)

The first column is offset vertically which mixes up information and is wrong.

I'm building a traditional OCR pipeline (for which I'm looking for beta testers! ;-) and this is what it outputs:

  PI      tno Name                       Time
  
  MEN   B (39)                                                         3.5 km       18 C          (cont.)
                                                   1 (51)                  2 (33)                 3 (34)                  4 (52)                  5   (53)                  6 (54)                  7 (55)                  8 (40)                 9 (57)
                                                  10 (58)                 11 (59)                12 (60)                 13 (61)                 14 (62)                 15 (63)                16 (47)                 17 (48)                18 (100)
                                                  Finish
  
  13     425  Peter  Hodkinson          11:40       0:48   +0: 06 (21)      1:29  +0: 13 (28)      1:58   +0: 13 (24)      2:44   +0: 18 (23)      3:38   +0: 20 (19)     4:28    +0: 22 (18)     5:05   +0: 23 (17)      5:36   +0: 26 (17)      6:19   +0: 29 (19)
              Great  Britain                        0:48   +0: 06 (21)      0:41  +0: 09 (30)      0:29   +0: 01 (4)       0:46   +0: 07 (22)      0:54   +0: 02 (5)      0:50    +0: 03 (7)      0:37   +0: 02 (10)      0:31   +0: 03 (11)      0:43   +0: 05 (20)
                                                    6:47   +0: 28 (17)     7:02   +0: 29 (17)      8:21   +0: 38 (16)      8:41   +0: 39 (16)      9:00   +0: 41 (16)     9:13    +0: 42 (16)     9:43   +0: 42 (16)     10:36   +0: 43 (14)     11:32   +0: 41 (13)
                                                    0:28   +0: 02 (8)      0:15   +0: 01 (4)       1:19   +0: 11 (16)      0:20   +0: 03 (15)      0:19   +0: 02 (4)      0:13   +0: 02 (11)      0:30   +0: 01 (2)       0:53   +0: 01 (3)       0:56    0:00  (1)
                                                   11:40   +0: 40 (13)
                                                    0:08   +0: 00 (8)

(edit: line wrap messes it all up... still I think my version is better ;-)

jeswin1y ago

I usually say something like: ".. output it as hierarchical json". For better accuracy, we can run the output through another model.

Again, that image is fuzzy. If the argument is that these generic models don't work well with scans or handwritten content, I can perhaps agree with that. But that's a much smaller subset of PDFs.

password43211y ago· 4 in thread

As opposed to the discussion 2 days ago with 400+ comments:

Ingesting PDFs and why Gemini 2.0 changes everything

https://news.ycombinator.com/item?id=42952605

h0l0cube1y ago

FTA:

> This week, there was a viral blog about Gemini 2.0 being used for complex PDF parsing, leading many to the same hypothesis we had nearly a year ago at this point. Data ingestion is a multistep pipeline, and maintaining confidence from these nondeterministic outputs over millions of pages is a problem.

password43211y ago

Yes and per the poster's opening comment:

https://news.ycombinator.com/item?id=42966958#42966959

1 more reply

_ea1k1y ago

That's what I thought too, but apparently the title is pure, absolute, rage-inducing clickbait.

The actual conclusion is that they make classes of errors that traditional OCR programs either don't make, or make in different ways.

dang1y ago

I assume you mean the title of the current thread? I've attempted to make it less baity now.

1 more reply

osigurdson1y ago· 4 in thread

ChatGPT is also still hilariously bad at drawing diagrams - universally producing a silly cartoon with misspelled words. The rate of improvement over the past two years is effectively zero.

catlifeonmars1y ago

Why would you use ChatGPT to draw diagrams? It’s a generative language model. Just because you can doesn’t mean it’s the best tool for the job.

osigurdson1y ago

Why not include some suggested tools in your comment? "you are an idiot for using ChatGPT" (paraphrased) isn't very helpful.

Logge1y ago

That's DALL3 which is not an LLM

osigurdson1y ago

Good point. I probably knew that at one time but now leverage it via chatgpt so forgot. Does anyone know if there is an AI wall with text to image?

1 more reply

codingwagie1y ago· 4 in thread

Really? I have been using 4o, and its flawless at OCR

ritvikpandey21OP1y ago

give it a shot with a few of the examples in the blog! or better yet, find some financial statements from Goldman/morgan Stanley and run it through the model.

sumedh1y ago

Check the output again, there will be small mistakes if your text is large enough.

phatfish1y ago

I used it once, was given a screenshot that contained a SHA1 hash and needed it in text. Maybe this is a case where ChatGPT can do a small task quickly for me and save me squinting?

It still fails on this today (the "bdbdffdf" part). Not allowed to share a chat with a picture it seems, my prompt was to upload the file below and "Image to text please.". Just the free 4o model, maybe the paid stuff is better.

https://postimg.cc/m1jNPL0j

8n4vidtmkvmk1y ago

Amusingly it tried to write a Python script to OCR it first, decided there were errors and tried to correct it.... it did correct some stuff and nearly got it, but I was able to spot an error 3/4 through with my eyeballs after a couple minutes.

https://i.imgur.com/UuO3JxM.png

thorum1y ago· 3 in thread

This seems like a problem that will quickly fall to the new reinforcement learning methods introduced by DeepSeek. Just build a system to synthetically render a few million pages of insanely complex, hard-to-parse documents with different layouts along with a JSON description of what the correct OCR should be, mix in some human annotated datasets, then do RL against a verifier that insists on 100% accuracy.

singularity20011y ago

I still don't get the reinforcement part here. Wouldn't that be normal training against the data set? Like how would you modify the normal MNIST training to be reinforcement learning

barrenko1y ago

not an expert - yes, what would usually just be called training, with LLMs here is called RL. You do end up writing a sort of a reward function, so I guess it is RL.

hodapp1y ago

You are right; the advanced in DeepSeek-R1 used RL almost solely because of the chain-of-thought sequences they were generating and training it on.

kyriakos1y ago· 3 in thread

I find that LLMs can read text off product label photos I can't even read myself.

AlphaAndOmega01y ago

If you don't know what the text says, do you have access to some other form of ground truth? Because otherwise you don't know if they're reading illegible labels correctly!

kyriakos1y ago

I can know what the text says cause I have the actual product available :) but you are right if the llm can't read it will fill in the gap with hallucinations probably

ritvikpandey21OP1y ago

yes they usually can! we delved into the mathematics behind this a bit in the blog, but tldr the LLMs are making educated guesses based on the embedding similarities - which can be detrimental for ocr systems.

julienchastang1y ago· 3 in thread

I've had limited but good experience (with both English and French text) with Tesseract, then getting ChatGPT to fix problems with clever prompting (e.g., pretend you are an expert OCR corrector, blah blah, blah).

ritvikpandey21OP1y ago

for most (text-dense) documents without much layout differences, these small prompt eng tricks work pretty well! scaling this to complex layouts and 1000+ page docs, we found the models don’t stick to their instructions. perhaps there’s some work to be done with 1M+ context length models so they don’t lose layout memory.

pbhjpbhj1y ago

Do any models use some sort of context pruning to keep the [most] relevant parts of the context?

What single documents are you processing that are 1000+ pages?

mulmboy1y ago

Is processing one page at a time not feasible? I'm always chunking things as small as possible for LLMs

lazyeye1y ago· 3 in thread

Is this just a training issue? They just need to train a model specifically for OCR?

ritvikpandey21OP1y ago

we don’t think so - we’ve fine tuned most of the SOTA language models available today on table datasets, documents with complex layouts, and while they do perform better, seems like they’re still prone to the same hallucinations. these frontier models have pretty much already been trained on most of the internet at this point, and tons of publically available documents.

sumedh1y ago

How does your solution compare to AWS textract?

anon3738391y ago

They probably do this already. But the problem is more fundamental: there are simply no process guarantees or guardrails inside a generative model to constrain the failure modes.

apt-get1y ago· 2 in thread

Question to anyone with experience in this domain: I have CSAM spam problems on a forum I host, with bots putting link shortener URLs embedded in images rather than the post body. Traditional OCR software deals poorly with them due to font modifications and intentional text edge modifications, and I'm obviously not gonna use a SaaS/closed source model to upload a bunch of may-be-may-not-be-CSAM pictures, so looking for a way to do this locally, with cheapish inference if possible (I don't mind spending a minute of compute to get the result out for one image, but need to do it on the CPU).

Is there any small model that would do this effectively, with pure text extraction (without going for any kind of formatting or whatnot)?

parsakhaz1y ago

Yup, Moondream is great for this use case! You can use locally with the quickstart: https://docs.moondream.ai/

It is a 2b vision model that runs anywhere and can object detect, point, query, and more.

sramam1y ago

Have you looked at https://moondream.ai/?

llm_trw1y ago· 2 in thread

This is a response to: https://news.ycombinator.com/item?id=42952605

A fun threat to read for the current hype cycle.

You can tell who is working in the field by the fact they don't use VLMs for OCR and who isn't because they think it's a solved problem.

A question to the authors.

Do you have resources to train any VLMs from scratch? They aren't quite the bests the sota LLMs are and I think they can be made a lot more useful with:

1). Better training data.

2). Larger vision parts of the model.

In short: 2d attention is not something that anyone's doing at scale - that I know of - and is a no brainer for understanding images.

ritvikpandey21OP1y ago

appreciate the feedback. completely agree, tuning and/or training a VLM will definitely produce better ocr extractions. however, it’s notoriously hard to accumulate a really good ground truth labeled dataset of pdf/excel/pptx. there are some resources online especially for tables, with IBM’s labeled table dataset for example. however, we’d guess the same hallucination issues will persist on complex layouts

llm_trw1y ago

You can generate the data synthetically.

We never had the budget to do it but I do have some notes somewhere on a 2d context free grammar to generate arbitrarily nested rows/columns and a css styling that got applied to the xhtml output of the grammar. It dynamically generated as much high quality synthetic data as you wanted - but the IBM and similar data sets were plenty big enough for what we could do even on specialist models.

It depends on what you're doing really. I thought that we'd done pretty well, then someone on HN reached out with a table that spanned 50 pages and I just gave up.

Feel free to drop an email if you'd like a quick chat. I find the state of table models particularly abysmal for how important they are.

edanm1y ago· 2 in thread

I'd just like to say this is a fantastic "marketing" blog post. Great explanation of an interesting problem, that this company theoretically helps solve. Very well done!

One note - there was a callout at the end to "stay tuned" for a follow-up post about the actual solution. I may have missed it, but I don't see any way to actually sign up to the blog or newsletter or anything. That's a shame - I'd love to follow this topic and product (and potentially have a few real-world use cases for it).

ritvikpandey21OP1y ago

thanks for the kind words! we do have a mailing list for current users, I can add you to that

edanm1y ago

Thank you! Saw you contacted me, that's great. I'm planning to study your product more during the week to see if it's a fit for something we're building. :)

levocardia1y ago· 2 in thread

LLMs do not struggle at all with raw text: they never lose decimal places or drop digits when transcribing a table from raw text. So the problem is not the internal representation. I do this all the time and all major LLMs work eminently well at it.

The problem comes from the vision part. Either (a) the ViT architecture needs a rework, or (b) the vision models need more training on tasks of the "copy this" nature versus the "do this" nature.

ritvikpandey21OP1y ago

on raw text, LLM’s usually do not struggle. however, when you start processing low-fidelity images (receipt scans with stains, documents with marks all over it, bent corners/areas, rotated docs) these transcription issues become extremely noticeable. to your point about table extraction, i disagree — we’ve had many examples on complex nested tables where the model hallucinated digits, especially from documents with weird aspect ratios.

fully agree on the last point, the vit architecture will need some working on for this — microsoft’s been doing some excellent research on this lately

croes1y ago

If you have raw text you don’t need OCR.

fpgaminer1y ago· 2 in thread

A lot of problems jump out to me with this article, particularly with the explanation of multi-modal LLMs. I'll say that I _do_ agree with the thrust of the article. Don't trust LLMs. But they probably should have argued legitimate issues with VLM based OCR, rather than try to talk about how VLMs are somehow fundamentally flawed or something.

> LLMs process images through high-dimensional embeddings, essentially creating abstract representations that prioritize semantic understanding over precise character recognition.

This isn't true. CLIP and its derivatives don't prioritize semantic understanding. They are trained contrastively, which (very roughly speaking) means they need to be able to differentiate similar images. If two images are just white with a few words, the only way to differentiate them is to include the text in the embedding.

Pretrained CLIP models do tend to be a bit lossy in this department, but not by as much as you would think considering they boil an entire image down to something on the order of 768 floats.

> Each step in this pipeline optimizes for semantic meaning while discarding precise visual information.

Again, that ... doesn't make any sense. It's a bit foolhardy to even say _what_ the models do, given that not even the most brilliant ML researchers know. But in broad _hypothesis_, the CLIP pipeline is optimizing being able to pair images with captions amongst a large number of possibilities. Which, again, requires them to surface all kinds of information from the image, and often times requires surfacing specific text from the image. How else would it differentiate powerpoint slides? Math problems in images? Etc.

> Fixed patch sizes may split individual characters

This doesn't matter. We know from empirical evidence. But even if it _did_, there's plenty of vision models that use overlapping patches.

> Position embeddings lose fine-grained spatial relationships

This isn't true. The model is fully aware of the position of pixels within patches, and the position embedding is merely to tell it the position of the patches themselves within the image. Therefore it can derive the absolute position of every pixel, if it needs to. In fact, we have proof they can and do.

> losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.

You get confidence scores for free because the model is explicitly trained to provide cosine similarity scores.

OWLv2 is a CLIP based open vocabulary bounding box model (from Google, makers of Gemini). It's finetuned from a standard, pretrained CLIP model. Nothing really special about the vision architecture; just that it gets finetuned to output bounding boxes. And it beats the pants off YOLO while being open vocabulary to boot. So not only are CLIP-like models capable of outputting bounding boxes, but OWLv2 was trained with human-in-the-loop processes and outputs confidence scores.

Oh and there's Florence, which is a VLM trained on bounding boxes.

> Favor common words over exact transcription

Nothing about LLMs indicates that. In fact, pretrained LLMs favor exact transcription.

> "Correct" perceived errors in the source document

Which OCR systems need to do to be useful for many applications. I get the argument that LLMs are a blackbox in this regard, which is a legitimate criticism, but correcting mistakes is not fundamentally the issue. It's better to say that LLMs _blindly_ correct issues. Whereas, perhaps, one could say a traditional OCR system can report "this is my exact transcription, I corrected it to this" and have various knobs to tweak thresholds. But there's no reason VLMs can't do that too.

> Merge or reorder information based on learned patterns

LLMs are perfectly capable of regurgitating data verbatim. That's perhaps the first thing they learn to do to get loss down. That's what all long context models are benchmarked against.

> Produce different outputs for the same input due to sampling

You can turn off sampling, and then they are deterministic. Or you can output the logits to the user, which gives you effectively confidence scores on its transcription.

And a well trained LLM for this task isn't really "probabilistic" in the sense that its outputs are completely different each time. If it's trained and prompted specifically to transcribe a document, that's what it's going to do. Any variations in output at that point are a result of real vagaries either in the document, vision, or the user request.

If a user wants consistency, they merely need to ask for it. Or the VLM needs to be trained better. In either case, these models are _capable_ of it.

It's most important to note here that, outside of pretrained LLMs, all LLMs that users interact with are Reinforcement trained. So while they were next token prediction trained during _pretraining_, they get trained to seek reward in production. That vastly trims the logits and focuses the model explicitly on performing tasks. Well trained, production LLMs only really put probability fields around tokens that are legitimately valid for the task at hand (bounded by the LLM's intelligence, of course).

> Unlike traditional OCR systems that fail obviously when uncertain, LLMs make educated guesses that appear plausible but may be entirely wrong.Consider the sequence "rn" versus "m". To a human reader scanning quickly, or an LLM processing image patches, these can appear nearly identical. The model, trained on vast amounts of natural language, will tend toward the statistically more common "m" when uncertain.

Again, LLMs don't just regurgitate the most "common" stuff. They are context specific. Besides, it's the vision module that would be making the differentiation here between rn and m. A vision module that is likely neither better nor worse than the vision modules traditional OCR systems are using. (Of course, the LLM may process the vision module's output and notice that perhaps it mis-transcribed "rn" vs "m" and "correct" it. But correct it based on _context_ not on some simplistic statistical model as suggested.)

> There’s a great paper from July 2024 (millennia ago in the world of AI) titled “Vision language models are blind” that emphasizes shockingly poor performance on visual tasks a 5 year old could do

Absolutely. I work in this field, and these vision models are not at the same level as their language counterparts. Due in large part to a lack of good data, good training processes, and good benchmarks. The Cambrian-1 paper is quite insightful here, as it studies the vision benchmarks themselves (https://arxiv.org/abs/2406.16860). The TLDR is that most of the vision benchmarks are actually just text benchmarks, and performance barely degrades when the model is blinded. I've found the same to be true of almost all publicly available training datasets for vision models, which is likely why these models don't learn good, robust visual understandings.

That doesn't really speak to the fundamental capabilities of the vision models. It speaks to the lack of training them well. So, if a model is explicitly trained to do OCR using lots of high quality ground truth data (which is easy to get and generate), then their performance can, and does, excel.

---

Now, all of that said, I also don't agree with the prior post this post is in response to. I work with VLMs a lot as part of my research, and I can assure you that they are nowhere near human level on OCR. They can exceed human performance in very specific tasks at the moment, but that's about it.

Are they better than other OCR offerings? As of this moment, I would tend to trust someone who does OCR for a living, so if Pulse says VLMs aren't as good as their solution, I would probably trust that over someone else saying VLMs work for their specific application. And VLMs _absolutely_ come with a myriad of caveats. They aren't as reliable as a more mechanical OCR system. Expect something like GPT4o to completely glitch 1 in every 10,000 queries. And expect them to be "weird". GPT4o will tend to not fully follow instructions maybe 1 in 100 times, so you might get your document back in the wrong format, or have "Sure, I can help with that!" at the start of your document, etc. Gemini tends to have better instruction following, but I don't have a good assessment of its reliability yet.

If I, personally, had a small project that needed OCR, I'd use Tesseract if it's just PDFs or something like that with printed text. If it's something with weird fonts, fancy stuff, handwriting, math formulas, etc. I might give Gemini a try. If it's mission critical, pay an expert to do it, whether that's in-house or paying a service explicitly built for the purpose.

---

NOTE: One thing that got glossed over in the article is that VLMs are not trained on the "embeddings" of the vision model, per se. CLIP processes the images as N number of tokens across L number of layers. At the end, you have N embeddings. For traditional CLIP, the last (or first) embedding is used as the result. Modern CLIPs average the embeddings together. Tomato, tomato.

VLMs are not trained on that single embedding from CLIP. The "head" gets stripped off, and the VLMs get trained on all N processed tokens from CLIP. So they have access to much more information. The vision models also get finetuned during the training of the VLM, and, importantly, CLIP architectures use skip connections throughout. So there is a direct path for the LLM to access pretty much anything from the vision model that it needs, and optimize for any information it needs.

The size of the embedded information given to the LLM, then, is almost about the same as the number of pixels from the source image. For example it might be something like a 384x384x3 image (442,368 dimensions) getting baked down into something like a 150,000 dimensional vector. So it's really not a fundamentally lossy process at that point.

myth_drannon1y ago

Which VLM models you found that are superior? A finetuned Trocr is very good based on my experience

martingoodson1y ago

Written by someone who knows what they are talking about.

2-3-7-43-18071y ago· 2 in thread

i dont understand. what have llms to do with ocr?

esafak1y ago

Some like gpt-4o are multi-modal.

2-3-7-43-18071y ago

the llm isn't multimodal. an llm can only process textual tokens. what should those tokens be for pictures. the llm gets fed a textual representation of what was optically recognized by another process. that's my understanding.

1 more reply

8338550bff961y ago· 2 in thread

February 6, 2024... okay grandpa

jackliuhahaha1y ago

In the article, it references a paper from July 2024, weird...

sidmanchkanti211y ago

fixed the year, good catch

faebi1y ago· 1 in thread

Shouldn't it be easy to generate a lot of OCR data? Generate HTML, randomize, generate image, apply noise and let it train on it.

kevincox1y ago

Yes, but if you aren't careful you will end up with a model carefully tuned for be ways that you add noise not all types of noise from the real world. But stuff like this can be very useful for some base training especially if you add many real-world examples afterwards.

wkat42421y ago· 1 in thread

I noticed llama 3.2 8b has big problems reading white on black text. Black on white goes way better. But I think it makes sense. They don't look at text like a dedicated OCR algorithm. I see the article elaborates on the very well.

ritvikpandey21OP1y ago

thanks for the feedback!

__rito__1y ago· 1 in thread

I was just trying a bunch of models for OCR. I only have 4 GB of VRAM in my personal machine.

My goal was to run an OCR model locally and extract text from scanned PDFs.

Many models could not even be run. Among those that did run, thanks to Ollama, provided very poor experience. Like llava-llama3, phi3.5 vision, etc.

What worked really well, but still not up to the mark- Surya [0].

It works perfectly on screenshots from true text PDFs, but not from scanned PDFs. Also has much better performance for English than Indian languages.

[0]: https://github.com/VikParuchuri/surya

ritvikpandey21OP1y ago

yup, the models you tried out require a lot of work to be able to run efficiently. additionally for actual decent ocr it’ll require very high quality document datasets (of PDF/excel/pptx). scanned documents are especially hard and cause a lot of issues for LLMs, which start making up info a lot of the time.

practice91y ago· 1 in thread

I tried the square example from the paper mentioned with o1-pro and it had no problem counting 4 nested squares…

And the 5 square variation as well.

So perhaps it is just a question of how much compute you are willing to throw at it

ritvikpandey21OP1y ago

yea seems like o1-pro was able to solve a few of those variations in the paper we referenced. take a look at the rest of the examples and let me know! the paper’s (somewhat) old at this point, but if you try our more complex square variations with overlapping edges and varying line thicknesses, the same issues arise. although I generally agree a mind boggling amount of compute will increase accuracy for sure.

m3kw91y ago· 1 in thread

You don’t really feed images to LLMs, rather to a vision model within the multi modal llm

ritvikpandey21OP1y ago

yup, important clarification! the language portion of the model also works with the extraction however, and is prone to the hallucinations

martingoodson1y ago· 1 in thread

I've worked in data extraction from documents for a decade and have developed algorithms in the space. I've developed a product using LLMs for this purpose too.

This article is essentially correct.

sidmanchkanti211y ago

thanks, glad to hear it.

coder5431y ago

I'm somewhat surprised neither this article nor the previous one mention anything about the Florence-2 model series. I had thought that Florence-2 was not just surprisingly capable for this kind of work, but also easily fine-tunable for a particular kind of document, when you expect to process a lot of instances of that document and want to further optimize accuracy. It's extremely small (0.23B and 0.77B parameters), so it's easy to run, easy to fine-tune, and probably unlikely to overthink things.

https://arxiv.org/abs/2311.06242

https://huggingface.co/blog/finetune-florence2

https://blog.roboflow.com/florence-2-ocr/

https://www.assemblyai.com/blog/florence-2-how-it-works-how-...

I don't personally deal with any OCR tasks, so maybe I misread the room, but it sounded promising, and I have seen some continuing interest in it online elsewhere.

In addition to the architectural issues mentioned in OP's article that are faced by most SOTA LLMs, I also expect that current SOTA LLMs like Gemini 2.0 Flash aren't being trained with very many document OCR examples... for now, it seems like the kind of thing that could benefit from fine-tuning on that objective, which would help emphasize to the model that it doesn't need to try to solve any equations or be helpful in any smart way.

mehulashah1y ago

(CEO of Aryn here: https://aryn.ai)

Nice post and response to the previous one.

It’s important to remember that the use cases for VLMs and document parsers are often different. VLMs definitely take a different approach than layout detection and OCR. They’re not mutually exclusive. VLMs are adaptable with prompting, eg please pull out the entries related to CapEx and summarize the contributions. Layout parsers and OCR are often used for indexing and document automation. Each will have their own place in an enterprise stack.

snthd1y ago

>Unlike traditional OCR systems that fail obviously when uncertain, LLMs make educated guesses that appear plausible but may be entirely wrong.

Except for a very special kind of bug:

https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

>Xerox scanners/photocopiers randomly alter numbers in scanned documents

markisus1y ago

I found this part questionable.

> Fixed patch sizes may split individual characters

> Position embeddings lose fine-grained spatial relationships, losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.

The author suggests that the standard ViT architecture is poorly suited for OCR because patches do not respect character boundaries and that the positional embeddings only embed the locations of patches, which are 16x16 pixels.

My mental model is that a token is a memory slot where computation results can be stored or retrieved from. There is no reason why we should want the layout of these memory slots must mimic the layout of the document, except at the very first layer, because then we don't have to think too hard about how to encode the document.

bambax1y ago

I'm making a simple service that outputs layout-following ASCII from images, PDFs of images or text PDFs. I too think the risk of hallucination is in many cases too great.

I fed my system the first image in the post [0] and got the text below in return.

I will be looking for beta testers next week... Email if interested!

  VEH  YR  MAKE  MODEL        IDENTIFICATION   TYPE SYM  ST  TER USE  CLASS ALARM
     2 02  HOND  CIVIC EX     1HGEM22952L086006 PP    18  IL  37  L  887120
     LOSS PAYEE THAT APPLIES:     2
     3.02  HYUN  SONATA / GL  KMHWF25S72A671544 PP   16   IL  37  P  887120

  H    NO. COVERAGE DESCRIPTION   LIABILITY LIMIT (S) DEDUCTIBLE          PREMIUM
      2   Preferred Extra Auto
           Bodily Injury          $ 250,000 / $ 500,000                    $ 92.00
           Property Damage        $ 100,000                                $ 43.00
           Medical Payments       $ 5,000                                  $ 13.00
           Uninsured Motorist     $ 250,000 / $ 500,000                    $ 62.00
           Undinsured Motor.-BI   $ 250,000 / $ 500,000                      INCL
           Collision                                       $ 500          $ 141.00
           Other than Collision                            $ 250           $ 92.00
                                                     TOTAL FOR UNIT   2   $ 443.00
     3-  Preferred Extra Auto
           Bodily Injury          $ 250,000 / $ 500,000                    $ 92.00
           Property Damage        $ 100,000                                $ 43.00
           Medical Payments       $ 5,000                                  $ 13.00
           Uninsured Motorist     $ 250,000 / $ 500,000                    $ 62.00
           Undinsured Motor. BI   $ 250,000 / $ 500,000                      INCL
           Collision                                       $ 500          $ 136.00
           Other than Collision                            $ 250           $ 90.00
                                                     TOTAL FOR UNIT   3   $ 436.00
  
  DRIVER  INFORMATION
  DR VEH  SEX MAR   BIRTH  G / S PRIN     DVR LIC NO.    NAME                  PTS

[0] https://i.imgur.com/sLWQoFG.jpeg

bryzaguy1y ago

Wasn’t seeing what OCR stands for, I believe it’s Optical Character Recognition.

pilooch1y ago

It's good and useful to see empirical analyses like this. I use open & custom VLMs a lot. The point of VLMs is that OCR is not needed anymore: it's intrinsic to the model. For instance at work we've developed a family vision-based RAG, and it's performance is twice that of a text-based one. The point I'd like to make here is that OCR is an intermediate step that is not explicitly needed anymore, un many cases. My hunch is that pure OCR will go away.

nicodjimenez1y ago

Check out mathpix.com we have a hybrid approach towards OCR that features accurate layout understanding (with accurate bounding boxes) plus accurate OCR outputs.

Disclaimer: I'm the founder and CEO.

uri_merhav1y ago

There's lots of hidden gotchas to this. Uploading a screenshot and asking an LLM to transcribe one page is generally ok. Give it a table that spans pages, or a 60 page doc, and you're in dire straits.

I cofounded DocuPanda to handle this issue specifically. Call me biased, but I do believe it's the best solution out there.

Zufriedenheit1y ago

Is there an OCR arena out there, similar to lmarena? Would be very useful but couldn't find one yet.

WhitneyLand1y ago

>>When an LLM processes a document image, it first embeds it into a high-dimensional vector space through the attention mechanism…

This is a confusing way to describe attention and gets a bit off topic, the attention mechanism is not really what’s causing any of the issues in the article.

gieksosz1y ago

I just tried the rectangle test on 4o and it answered correctly.

jmartin26831y ago

We use Claude 3.5 sonnet to OCR and structure tabular data from PDFs and it’s virtually flawless… orders of magnitude better than Textract (or pretty much any other LLM).

jrochkind11y ago

LLMs seem to be really good at audio speech to text though. One would naively think these are similar problems, but apparently not?

mycall1y ago

Ripcord demo'd their stack to me yesterday and the use of LLMs works great for OCR, so it is indeed possible.

akkad331y ago

I use Chatgpt to convert tables in fng and pdfs to pandas data frames and it works very well

iwangulenko1y ago

Resume parsing is a problem for decades,

and even today it can never be done right

because SOME resumes are just so f** up.

callamdelaney1y ago

To be fair, they would say that due to the fact they are selling a competing thing.

salimmahboubi1y ago

To me, the question is why we keep using PDFs that never get printed?

jebarker1y ago

s/LLMs/VLMs/g

rhavaei1y ago

very nice blogpost.

ritvikpandey21OP1y ago

j / k navigate · click thread line to collapse

145 comments

123 comments · 45 top-level

michaelbuckbee1y ago· 19 in thread

ChatGPT just inferred that I wanted the actual full names of the items (aka "flour" instead of "our").

Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.

llm_trw1y ago

osigurdson1y ago

It does seem that companies are able to get reliability in narrow problem domains via prompts, evals and fine tuning.

2 more replies

ritvikpandey21OP1y ago

dyauspitr1y ago

You can literally ask it to mark characters that are not clear instead of inferring them.

paulsutter1y ago

Unit tests work remarkably well

1 more reply

gcanyon1y ago

> They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.

So are human beings. Meaning we've been working around this issue since forever, we're not suddenly caught up in a new thing here.

3 more replies

Terr_1y ago

I'm in the "failure" camp, because the true correctness of an answer comes from how it was reached. [0]

TeMPOraL1y ago

> I'm in the "failure" camp, because the true correctness of an answer comes from how it was reached.

Seriously though, I'm mostly on the side of "huge success" here, but LLMs sometimes really get overzealous with fixing what ain't broke.

2 more replies

williamcotton1y ago

If you claim that you guess correctly 50% of the time then you are, from a Bayesian perspective, starting with a reasonable prior.

You then conflate the usefulness of some guessing skill with logic and statistics.

kaonwarb1y ago

afro881y ago

The correct way to handle it is to ask the user if it's not clear, like a real assistant would

nodamage1y ago

ritvikpandey21OP1y ago

llm_trw1y ago

It's not possible with current gen models.

To even have a chance at doing it you'd need to start the training from scratch with _huge_ penalties for filling in missing information and a _much_ larger vision component to the model.

See an old post I made on what you need to get above sota OCR that works today: https://news.ycombinator.com/item?id=42952605#42955414

amelius1y ago

Maybe ask it to return the bounding box of every glyph.

1 more reply

jmartin26831y ago

My experiences have been the same… that is to say nothing like what is reported here. This is more pitch than info.

Odd timing, too given flash 2.0 release and its performance on this problem.

Buttons8401y ago

Failed as a machine, succeeded as an intelligence. Intelligences don't make good machines though.

davidhs1y ago

I recently took a picture of the ingredients list on my multivitamin and fed it into ChatGPT o1-pro (at least that was the config.) and it made up ingredients and messed up the quantities.

daveguy1y ago

Sounds like a xerox.

jll291y ago· 9 in thread

zeograd1y ago

I tried https://github.com/PaddlePaddle/PaddleOCR for my own use case (scanline images of parcel labels) and it beat Tesseract by an order of magnitude.

(Tesseract managed to get 3 fields out of a damaged label, while PaddleOCR found 35, some of them barely readable even for a human taking time to decypher them)

moffkalast1y ago

ianhawes1y ago

Surya is on par with cloud vision offerings.

ritvikpandey21OP1y ago

sumedh1y ago

Are you targeting business or consumers?

I cannot find the pricing page.

1 more reply

patcon1y ago

Pls contact archive.org about adopting this digital archive once it exists (they also have a bad habit of accepting physical donations, if you are nearby)

ahoka1y ago

I’m very far from an expert, but had good luck with EasyOCR when fiddling with such things.

pbhjpbhj1y ago

If it's a large enough corpus I imagine it's worth fine tuning to the specific fonts/language used?

mdbmdb1y ago

I would love to get access to that archive!

jeswin1y ago· 6 in thread

> We have hundreds of examples like this queued up, so let us know if you want some more!

Link to it then, let people verify.

I've pushed a lot of financial tables through Claude, and it gives remarkable accuracy (99%+) when the text size is legible to a mid-40s person like me. Gpt-4o is far less accurate.

[1]: https://cdn.prod.website-files.com/6707c5683ddae1a50202bac6/...

sgc1y ago

rahimnathwani1y ago

  99.8%+ on first pass

Even with the best OCR, and high resolution scans, you might not get this due to:

- the quality of the original paper documents, and

- the language

I have non-English documents for which I'd love to have 99% accuracy!

1 more reply

davedx1y ago

Ha hi Jeswin! I was itching to reply to this post too, I wonder why…

jeswin1y ago

Dave! Our sample sizes were large enough, and tables complex enough to opine on this.

bambax1y ago

  PI tno Name             Time            3.5 km   18 C   (cont.)
  MEN B (39)                                                  3(34)         4(52)         5(53)         6(54)         7(55)         
  8(40)         9(57)
                                                               12(60)        13(61)        14(62)        15(63)        16(47)        
  17(48)       18(100)
                                                  1(51)         2(33)
                                                  10(58)        11(59)

The first column is offset vertically which mixes up information and is wrong.

I'm building a traditional OCR pipeline (for which I'm looking for beta testers! ;-) and this is what it outputs:

  PI      tno Name                       Time
  
  MEN   B (39)                                                         3.5 km       18 C          (cont.)
                                                   1 (51)                  2 (33)                 3 (34)                  4 (52)                  5   (53)                  6 (54)                  7 (55)                  8 (40)                 9 (57)
                                                  10 (58)                 11 (59)                12 (60)                 13 (61)                 14 (62)                 15 (63)                16 (47)                 17 (48)                18 (100)
                                                  Finish
  
  13     425  Peter  Hodkinson          11:40       0:48   +0: 06 (21)      1:29  +0: 13 (28)      1:58   +0: 13 (24)      2:44   +0: 18 (23)      3:38   +0: 20 (19)     4:28    +0: 22 (18)     5:05   +0: 23 (17)      5:36   +0: 26 (17)      6:19   +0: 29 (19)
              Great  Britain                        0:48   +0: 06 (21)      0:41  +0: 09 (30)      0:29   +0: 01 (4)       0:46   +0: 07 (22)      0:54   +0: 02 (5)      0:50    +0: 03 (7)      0:37   +0: 02 (10)      0:31   +0: 03 (11)      0:43   +0: 05 (20)
                                                    6:47   +0: 28 (17)     7:02   +0: 29 (17)      8:21   +0: 38 (16)      8:41   +0: 39 (16)      9:00   +0: 41 (16)     9:13    +0: 42 (16)     9:43   +0: 42 (16)     10:36   +0: 43 (14)     11:32   +0: 41 (13)
                                                    0:28   +0: 02 (8)      0:15   +0: 01 (4)       1:19   +0: 11 (16)      0:20   +0: 03 (15)      0:19   +0: 02 (4)      0:13   +0: 02 (11)      0:30   +0: 01 (2)       0:53   +0: 01 (3)       0:56    0:00  (1)
                                                   11:40   +0: 40 (13)
                                                    0:08   +0: 00 (8)

(edit: line wrap messes it all up... still I think my version is better ;-)

jeswin1y ago

I usually say something like: ".. output it as hierarchical json". For better accuracy, we can run the output through another model.

Again, that image is fuzzy. If the argument is that these generic models don't work well with scans or handwritten content, I can perhaps agree with that. But that's a much smaller subset of PDFs.

password43211y ago· 4 in thread

As opposed to the discussion 2 days ago with 400+ comments:

Ingesting PDFs and why Gemini 2.0 changes everything

https://news.ycombinator.com/item?id=42952605

h0l0cube1y ago

FTA:

password43211y ago

Yes and per the poster's opening comment:

https://news.ycombinator.com/item?id=42966958#42966959

1 more reply

_ea1k1y ago

That's what I thought too, but apparently the title is pure, absolute, rage-inducing clickbait.

The actual conclusion is that they make classes of errors that traditional OCR programs either don't make, or make in different ways.

dang1y ago

I assume you mean the title of the current thread? I've attempted to make it less baity now.

1 more reply

osigurdson1y ago· 4 in thread

ChatGPT is also still hilariously bad at drawing diagrams - universally producing a silly cartoon with misspelled words. The rate of improvement over the past two years is effectively zero.

catlifeonmars1y ago

Why would you use ChatGPT to draw diagrams? It’s a generative language model. Just because you can doesn’t mean it’s the best tool for the job.

osigurdson1y ago

Why not include some suggested tools in your comment? "you are an idiot for using ChatGPT" (paraphrased) isn't very helpful.

Logge1y ago

That's DALL3 which is not an LLM

osigurdson1y ago

Good point. I probably knew that at one time but now leverage it via chatgpt so forgot. Does anyone know if there is an AI wall with text to image?

1 more reply

codingwagie1y ago· 4 in thread

Really? I have been using 4o, and its flawless at OCR

ritvikpandey21OP1y ago

give it a shot with a few of the examples in the blog! or better yet, find some financial statements from Goldman/morgan Stanley and run it through the model.

sumedh1y ago

Check the output again, there will be small mistakes if your text is large enough.

phatfish1y ago

I used it once, was given a screenshot that contained a SHA1 hash and needed it in text. Maybe this is a case where ChatGPT can do a small task quickly for me and save me squinting?

https://postimg.cc/m1jNPL0j

8n4vidtmkvmk1y ago

https://i.imgur.com/UuO3JxM.png

thorum1y ago· 3 in thread

singularity20011y ago

I still don't get the reinforcement part here. Wouldn't that be normal training against the data set? Like how would you modify the normal MNIST training to be reinforcement learning

barrenko1y ago

not an expert - yes, what would usually just be called training, with LLMs here is called RL. You do end up writing a sort of a reward function, so I guess it is RL.

hodapp1y ago

You are right; the advanced in DeepSeek-R1 used RL almost solely because of the chain-of-thought sequences they were generating and training it on.

kyriakos1y ago· 3 in thread

I find that LLMs can read text off product label photos I can't even read myself.

AlphaAndOmega01y ago

If you don't know what the text says, do you have access to some other form of ground truth? Because otherwise you don't know if they're reading illegible labels correctly!

kyriakos1y ago

I can know what the text says cause I have the actual product available :) but you are right if the llm can't read it will fill in the gap with hallucinations probably

ritvikpandey21OP1y ago

julienchastang1y ago· 3 in thread

ritvikpandey21OP1y ago

pbhjpbhj1y ago

Do any models use some sort of context pruning to keep the [most] relevant parts of the context?

What single documents are you processing that are 1000+ pages?

mulmboy1y ago

Is processing one page at a time not feasible? I'm always chunking things as small as possible for LLMs

lazyeye1y ago· 3 in thread

Is this just a training issue? They just need to train a model specifically for OCR?

ritvikpandey21OP1y ago

sumedh1y ago

How does your solution compare to AWS textract?

anon3738391y ago

They probably do this already. But the problem is more fundamental: there are simply no process guarantees or guardrails inside a generative model to constrain the failure modes.

apt-get1y ago· 2 in thread

Is there any small model that would do this effectively, with pure text extraction (without going for any kind of formatting or whatnot)?

parsakhaz1y ago

Yup, Moondream is great for this use case! You can use locally with the quickstart: https://docs.moondream.ai/

It is a 2b vision model that runs anywhere and can object detect, point, query, and more.

sramam1y ago

Have you looked at https://moondream.ai/?

llm_trw1y ago· 2 in thread

This is a response to: https://news.ycombinator.com/item?id=42952605

A fun threat to read for the current hype cycle.

You can tell who is working in the field by the fact they don't use VLMs for OCR and who isn't because they think it's a solved problem.

A question to the authors.

Do you have resources to train any VLMs from scratch? They aren't quite the bests the sota LLMs are and I think they can be made a lot more useful with:

1). Better training data.

2). Larger vision parts of the model.

In short: 2d attention is not something that anyone's doing at scale - that I know of - and is a no brainer for understanding images.

ritvikpandey21OP1y ago

llm_trw1y ago

You can generate the data synthetically.

It depends on what you're doing really. I thought that we'd done pretty well, then someone on HN reached out with a table that spanned 50 pages and I just gave up.

Feel free to drop an email if you'd like a quick chat. I find the state of table models particularly abysmal for how important they are.

edanm1y ago· 2 in thread

I'd just like to say this is a fantastic "marketing" blog post. Great explanation of an interesting problem, that this company theoretically helps solve. Very well done!

ritvikpandey21OP1y ago

thanks for the kind words! we do have a mailing list for current users, I can add you to that

edanm1y ago

Thank you! Saw you contacted me, that's great. I'm planning to study your product more during the week to see if it's a fit for something we're building. :)

levocardia1y ago· 2 in thread

The problem comes from the vision part. Either (a) the ViT architecture needs a rework, or (b) the vision models need more training on tasks of the "copy this" nature versus the "do this" nature.

ritvikpandey21OP1y ago

fully agree on the last point, the vit architecture will need some working on for this — microsoft’s been doing some excellent research on this lately

croes1y ago

If you have raw text you don’t need OCR.

fpgaminer1y ago· 2 in thread

> LLMs process images through high-dimensional embeddings, essentially creating abstract representations that prioritize semantic understanding over precise character recognition.

Pretrained CLIP models do tend to be a bit lossy in this department, but not by as much as you would think considering they boil an entire image down to something on the order of 768 floats.

> Each step in this pipeline optimizes for semantic meaning while discarding precise visual information.

> Fixed patch sizes may split individual characters

This doesn't matter. We know from empirical evidence. But even if it _did_, there's plenty of vision models that use overlapping patches.

> Position embeddings lose fine-grained spatial relationships

> losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.

You get confidence scores for free because the model is explicitly trained to provide cosine similarity scores.

Oh and there's Florence, which is a VLM trained on bounding boxes.

> Favor common words over exact transcription

Nothing about LLMs indicates that. In fact, pretrained LLMs favor exact transcription.

> "Correct" perceived errors in the source document

> Merge or reorder information based on learned patterns

LLMs are perfectly capable of regurgitating data verbatim. That's perhaps the first thing they learn to do to get loss down. That's what all long context models are benchmarked against.

> Produce different outputs for the same input due to sampling

You can turn off sampling, and then they are deterministic. Or you can output the logits to the user, which gives you effectively confidence scores on its transcription.

If a user wants consistency, they merely need to ask for it. Or the VLM needs to be trained better. In either case, these models are _capable_ of it.

---

myth_drannon1y ago

Which VLM models you found that are superior? A finetuned Trocr is very good based on my experience

martingoodson1y ago

Written by someone who knows what they are talking about.

2-3-7-43-18071y ago· 2 in thread

i dont understand. what have llms to do with ocr?

esafak1y ago

Some like gpt-4o are multi-modal.

2-3-7-43-18071y ago

1 more reply

8338550bff961y ago· 2 in thread

February 6, 2024... okay grandpa

jackliuhahaha1y ago

In the article, it references a paper from July 2024, weird...

sidmanchkanti211y ago

fixed the year, good catch

faebi1y ago· 1 in thread

Shouldn't it be easy to generate a lot of OCR data? Generate HTML, randomize, generate image, apply noise and let it train on it.

kevincox1y ago

wkat42421y ago· 1 in thread

ritvikpandey21OP1y ago

thanks for the feedback!

__rito__1y ago· 1 in thread

I was just trying a bunch of models for OCR. I only have 4 GB of VRAM in my personal machine.

My goal was to run an OCR model locally and extract text from scanned PDFs.

Many models could not even be run. Among those that did run, thanks to Ollama, provided very poor experience. Like llava-llama3, phi3.5 vision, etc.

What worked really well, but still not up to the mark- Surya [0].

It works perfectly on screenshots from true text PDFs, but not from scanned PDFs. Also has much better performance for English than Indian languages.

[0]: https://github.com/VikParuchuri/surya

ritvikpandey21OP1y ago

practice91y ago· 1 in thread

I tried the square example from the paper mentioned with o1-pro and it had no problem counting 4 nested squares…

And the 5 square variation as well.

So perhaps it is just a question of how much compute you are willing to throw at it

ritvikpandey21OP1y ago

m3kw91y ago· 1 in thread

You don’t really feed images to LLMs, rather to a vision model within the multi modal llm

ritvikpandey21OP1y ago

yup, important clarification! the language portion of the model also works with the extraction however, and is prone to the hallucinations

martingoodson1y ago· 1 in thread

I've worked in data extraction from documents for a decade and have developed algorithms in the space. I've developed a product using LLMs for this purpose too.

This article is essentially correct.

sidmanchkanti211y ago

thanks, glad to hear it.

coder5431y ago

https://arxiv.org/abs/2311.06242

https://huggingface.co/blog/finetune-florence2

https://blog.roboflow.com/florence-2-ocr/

https://www.assemblyai.com/blog/florence-2-how-it-works-how-...

I don't personally deal with any OCR tasks, so maybe I misread the room, but it sounded promising, and I have seen some continuing interest in it online elsewhere.

mehulashah1y ago

(CEO of Aryn here: https://aryn.ai)

Nice post and response to the previous one.

snthd1y ago

>Unlike traditional OCR systems that fail obviously when uncertain, LLMs make educated guesses that appear plausible but may be entirely wrong.

Except for a very special kind of bug:

https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

>Xerox scanners/photocopiers randomly alter numbers in scanned documents

markisus1y ago

I found this part questionable.

> Fixed patch sizes may split individual characters

> Position embeddings lose fine-grained spatial relationships, losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.

bambax1y ago

I'm making a simple service that outputs layout-following ASCII from images, PDFs of images or text PDFs. I too think the risk of hallucination is in many cases too great.

I fed my system the first image in the post [0] and got the text below in return.

I will be looking for beta testers next week... Email if interested!

  VEH  YR  MAKE  MODEL        IDENTIFICATION   TYPE SYM  ST  TER USE  CLASS ALARM
     2 02  HOND  CIVIC EX     1HGEM22952L086006 PP    18  IL  37  L  887120
     LOSS PAYEE THAT APPLIES:     2
     3.02  HYUN  SONATA / GL  KMHWF25S72A671544 PP   16   IL  37  P  887120

  H    NO. COVERAGE DESCRIPTION   LIABILITY LIMIT (S) DEDUCTIBLE          PREMIUM
      2   Preferred Extra Auto
           Bodily Injury          $ 250,000 / $ 500,000                    $ 92.00
           Property Damage        $ 100,000                                $ 43.00
           Medical Payments       $ 5,000                                  $ 13.00
           Uninsured Motorist     $ 250,000 / $ 500,000                    $ 62.00
           Undinsured Motor.-BI   $ 250,000 / $ 500,000                      INCL
           Collision                                       $ 500          $ 141.00
           Other than Collision                            $ 250           $ 92.00
                                                     TOTAL FOR UNIT   2   $ 443.00
     3-  Preferred Extra Auto
           Bodily Injury          $ 250,000 / $ 500,000                    $ 92.00
           Property Damage        $ 100,000                                $ 43.00
           Medical Payments       $ 5,000                                  $ 13.00
           Uninsured Motorist     $ 250,000 / $ 500,000                    $ 62.00
           Undinsured Motor. BI   $ 250,000 / $ 500,000                      INCL
           Collision                                       $ 500          $ 136.00
           Other than Collision                            $ 250           $ 90.00
                                                     TOTAL FOR UNIT   3   $ 436.00
  
  DRIVER  INFORMATION
  DR VEH  SEX MAR   BIRTH  G / S PRIN     DVR LIC NO.    NAME                  PTS

[0] https://i.imgur.com/sLWQoFG.jpeg

bryzaguy1y ago

Wasn’t seeing what OCR stands for, I believe it’s Optical Character Recognition.

pilooch1y ago

nicodjimenez1y ago

Check out mathpix.com we have a hybrid approach towards OCR that features accurate layout understanding (with accurate bounding boxes) plus accurate OCR outputs.

Disclaimer: I'm the founder and CEO.

uri_merhav1y ago

I cofounded DocuPanda to handle this issue specifically. Call me biased, but I do believe it's the best solution out there.

Zufriedenheit1y ago

Is there an OCR arena out there, similar to lmarena? Would be very useful but couldn't find one yet.

WhitneyLand1y ago

>>When an LLM processes a document image, it first embeds it into a high-dimensional vector space through the attention mechanism…

This is a confusing way to describe attention and gets a bit off topic, the attention mechanism is not really what’s causing any of the issues in the article.

gieksosz1y ago

I just tried the rectangle test on 4o and it answered correctly.

jmartin26831y ago

We use Claude 3.5 sonnet to OCR and structure tabular data from PDFs and it’s virtually flawless… orders of magnitude better than Textract (or pretty much any other LLM).

jrochkind11y ago

LLMs seem to be really good at audio speech to text though. One would naively think these are similar problems, but apparently not?

mycall1y ago

Ripcord demo'd their stack to me yesterday and the use of LLMs works great for OCR, so it is indeed possible.

akkad331y ago

I use Chatgpt to convert tables in fng and pdfs to pandas data frames and it works very well

iwangulenko1y ago

Resume parsing is a problem for decades,

and even today it can never be done right

because SOME resumes are just so f** up.

callamdelaney1y ago

To be fair, they would say that due to the fact they are selling a competing thing.

salimmahboubi1y ago

To me, the question is why we keep using PDFs that never get printed?

jebarker1y ago

s/LLMs/VLMs/g

rhavaei1y ago

very nice blogpost.

ritvikpandey21OP1y ago

j / k navigate · click thread line to collapse