Replace OCR with Vision Language Models (opens in new tab)

(github.com)

292 pointsEarlyOom1y ago125 comments

125 comments

99 comments · 33 top-level

rafram1y ago· 25 in thread

It’s an interesting idea, but still way too unreliable to use in production IMO. When a traditional OCR model can’t read the text, it’ll output gibberish with low confidence; when a VLM can’t read the text, it’ll output something confidently made up, and it has no way to report confidence. (You can ask it to, but the number will itself be made up.)

I tried using a VLM to recognize handwritten text in genealogical sources, and it made up names and dates that sort of fit the vibe of the document when it couldn’t read the text! They sounded right for the ethnicity and time period but were entirely fake. There’s no way to ground the model using the source text when the model is your OCR.

themanmaran1y ago

Thing is, the majority of OCR errors aren't character issues, but layout issues. Things like complex tables with cells being returned under the wrong header. And if the numbers in an income statement are one column off creates a pretty big risk.

Confidence intervals are a red herring. And only as good as the code interpreting them. If the OCR model gives you back 500 words all ranging from 0.70 to 0.95 confidence, what do you do? Reject the entire document if there's a single value below 0.90?

If so you'd be passing every single document to a human review, and might as well not run the OCR. But if you're not rejecting based on CI, then you're exposed to just as much risk as using an LLM.

tensor1y ago

Having experience in this area, audit, legal, confidence intervals are essential. No, you don't end up "passing every single document" to human review. That's made up nonsense. But confidence intervals can pretty easily flag poorly OCR'd documents, and then yes they are done by human review.

If you try to pitch hallucinations to these fields, they'll just choose 100% manual instead. It's a non-starter.

1 more reply

anon3738391y ago

> But if you're not rejecting based on CI, then you're exposed to just as much risk as using an LLM.

That's not true. LLMs and OCR have very different failure modes. With LLMs, there is unbounded potential for hallucination, and the entire document is at risk. For example: if something in the lower right-hand corner of the page takes the model to a sparsely sampled part of the latent space, it can end up deciding that it makes sense to rewrite the document title! Or anything else. LLMs also have a pernicious habit of "helpfully" completing partial sentences that appear at the beginning or end of a page of text.

With OCR, errors are localized and have a greater chance of being detected when read.

I think for a lot of cases, the best solution is to fine-tune a model like LayoutLM, which can classify the actual text tokens in a document (whether obtained from OCR or a native text layer) using visual and spatial information. Then, there are no hallucinations and you can use uncertainty information from both the OCR (if used) and the text classification. But it does mean that you have to do the work of annotating data and training a model, rather than prompt engineering...

1 more reply

bayindirh1y ago

The problem is, regardless of the confidence number, you can scan and mark document for grammatical errors.

In VLM/LLM powered methods, the missing/misred data will be hallucinated and you can't know whether something scanned correctly or not. I personally scan and OCR tons of personal documents, I prefer "gibberish" rather than "hallucinations", because they're easier to catch.

We had this problem before [0], on some Xerox scanners and copiers. Results will be disastrous. It's not a question of if, but when.

I personally tried Gemini and OpenAI's models for OCR, and no, I won't continue using them further.

[0]: https://www.theregister.com/2013/08/06/xerox_copier_flaw_mea...

rafram1y ago

Then use an LLM to extract layout information. Don’t trust it to read the text.

> If the OCR model gives you back 500 words all ranging from 0.70 to 0.95 confidence, what do you do? Reject the entire document if there's a single value below 0.90?

No, of course not. You have a human review the words/segments with low confidence.

sudoshred1y ago

That’s assuming that confidence intervals are even independently comparable. Anecdotally major OCR services with specific languages have average confidence intervals that are wildly divergent from similar services with different languages for the same relative quality of result. Acting as if confidence interval is in any way absolute or otherwise able to reliably and consistently indicate the relative quality of results is a mischaracterization at best. In the worst case CI is as good as an RNG. The value of the CI is in the ability to tune usage of the results based on observations of the users and characteristics of the request, sometimes it is meaningful but not always. In this case “good” code essentially hardcodes handling for all the idiosyncrasies of the common usage and the OCR service.

constantinum1y ago

The primary issue with LLMs is hallucination, which can lead to incorrect data and flawed business decisions.

For example, Llamaparse(https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse...) uses LLMs for PDF text extraction but faces hallucination problems. See this issue for more details: https://github.com/run-llama/llama_parse/issues/420.

For those interested, try LLMWhisperer(https://unstract.com/llmwhisperer/) for OCR. It avoids LLMs, eliminates hallucination issues, and preserves the input document layout for better context.

Examples of extracting complex layout:

https://imgur.com/a/YQMkLpA

https://imgur.com/a/NlZOrtX

https://imgur.com/a/htIm6cf

Hackbraten1y ago

> try LLMWhisperer(https://unstract.com/llmwhisperer/) for OCR. It avoids LLMs

The website you linked says it uses LLMs?

1 more reply

ungerik1y ago

Those images look exactly like what you get from every OCR tool out there if you use the XY information.

EarlyOomOP1y ago

This is the main focus of VLM Run and typed extraction more generally. If you provide proper type constraints (e.g. with Pydantic) you can dramatically reduce the surface area for hallucination. Then there's actually fine-tuning on your dataset (we're working on this) to push accuracy beyond what you get from an unspecialized frontier model.

rafram1y ago

Re type constraints: Not really. If one of the fields in my JSON is `name` but the model can’t read the name on the page, it will very happily make one up. Type constraints are good for making sure that your data is parseable, but they don’t do anything to fix the undetectable inaccuracy problem.

Fine-tuning does help, though.

1 more reply

hashta1y ago

An effective way that usually increases accuracy is to use an ensemble of capable models that are trained independently (e.g., gemini, gpt-4o, qwen). If >x% of them have the same output, accept it, otherwise reject and manually review

1 more reply

refulgentis1y ago

That's not OCR.

It is an absolute miracle.

It is transmutating a picture into JSON.

I never thought this would be possible in my lifetime.

But that is different from what your interlocutor is discussing.

1 more reply

KoolKat231y ago

I've been using gemini 2 flash to extract financial data, within my sample which is perhaps small (probably 1000 entries so far), I've had one single error only so like a 99.9% success rate.

(There's slightly more errors if I ask it to add numbers but this isn't OCR and a bit more of a reach, although it is very good at this too regardless).

Many hallucinations can be avoided by telling it to use null if there is no number present.

CarolineRommer1y ago

And by using two different systems (say Gem plus ChatGPT) you essentially reduce chances of hallucination to zero, no? You would need to be VERY unlucky to find to LLMs hallucinating the exact same response.

cratermoon1y ago

Agree wholeheartedly. Modern OCR is astonishingly good, more importantly it's deterministically so. It's failure modes, when it's unable to read the text, are recognizably failures.

Results for VLM accuracy & precision are not good. https://arxiv.org/html/2406.04470v1#S4

VeejayRampay1y ago

which solutions would you classify as "modern OCR"

are we talking tesseract or something?

1 more reply

delichon1y ago

How about calculating confidence in terms of which output regions are stable across the same input on multiple tries. Expensive, but the hallucinations should have more variable output and be fuzzier than higher confidence regions in averages.

staticman21y ago

I think it would be pretty reliable in controlled circumstances. If I take a picture of a book with my cell phone- google Gemini pro is much better at recognizing the text than Samsung's built in OCR.

Grimblewald1y ago

I would think the same, the cause for hesitation is that we only think this, but cannot know it without thorough testing. Right now the scope of problems where things behave reliably and as expected and scope of problems where things get whacky are unknown. The borders are known to some rather fuzzy extent at best, by people who work with these things as a full-time job. This means we are just blindly gambling on it. For important things, archiving, etc. where truth matters, I will continue using traditional OCR until we can define the reliable use-case scope of LLM based OCR better. I am extremely enthusiastic about LLM's and the things these offer, but i am also a realist. LLM's are an infant technology, and no-where near the level of maturity that companies like openAI claim.

the84721y ago

Shouldn't confidence be available at the sampler level and also be conditional on the vision input, not just the next-token prediction?

j_bum1y ago

This is naive, but can you ask the model to provide a confidence rating for sections of the document?

thatjoeoverthr1y ago

More broadly, it’s not trained to have any self awareness and this is a factor in other “hallucinations”. If you ask, for example, to describe the “marathon crater”, it doesn’t recognize that there’s no such thing in its corpus, but will instead start by writing an answer (“sure! The marathon crater is..”) and freestyle from there. Same if you ask it why it did something, or details about itself, etc. You should access one directly (not through an app like chatGPT) and build a careful suite of tests to learn more. Really fascinating.

1 more reply

UnlockedSecrets1y ago

You can ask, and it will be made up not grounded in reality

1 more reply

ttyprintk1y ago

It’s not naive; tesseract does this.

1 more reply

temp08261y ago· 5 in thread

I've been looking for a solution to translate a dictionary for me. It is a Shipibo-Conibo (indigenous Peruvian language) to Spanish dictionary- I'd like to translate the Spanish to English (and leave the Shipibo intact). Curious for any thoughts here. I have the dictionary as a PDF (already searchable so I don't think it would need to be re-OCR'd...though that's possible too, it's not clearest scan).

wrs1y ago

I wouldn’t be surprised to find that Claude/ChatGPT/etc. can just…do that. With the prompt you just gave.

The output could be in Markdown, which is easily turned into a PDF. You would have to break up the input PDF into pages to avoid running out of output window.

temp08261y ago

I didn't consider that would actually work and am giving it a try now...but by its own estimate it's going to take several days to finish (I'm not paying for plus or whatever).

rafram1y ago

I would!

zzleeper1y ago

By any chance, would it be possible to share the PDF? I haven't heard shipibo language in a long while, and am quite curious about it.

temp08261y ago

Here you go-

https://archive.org/details/shipibodiccionario

iLemming1y ago· 4 in thread

What's the fastest and accurate CLI OCR tool? My use case is simple - I want to be able to grab a piece of screen (Flameshot is great for that), and OCR it. I need this for note-taking during pair-programming over Zoom.

Currently I'm using tesseract - it works, it's fast, but it also makes mistakes; it would be also great if it could discern tabular data and put them in ascii or markdown tables. I've tried docling, but it feels like a bit of an overkill. It seems to be slower - remember, I need to be able to grab the text from the screenshot very quickly. I have only tried default settings, maybe tweaking it would improve things.

Can anyone share some thoughts on this? Thanks!

acdha1y ago

Anything using the Apple Vision framework is fast and surprisingly accurate:

https://github.com/bytefer/macos-vision-ocr

cdolan1y ago

Cool to see, may use this locally for OCR in some cases. But I think the "handwriting" example is a little misleading. Thats a font, not a scan of hand written material

wahnfrieden1y ago

This uses the old APIs that are less accurate than the new Swift-only LiveText ones

ANighRaisin1y ago

The AI OCR build into snipping tool in windows is better than tesseract, albeit more inconvenient than something like powertoys or Capture2Text, which use a quick shortcut.

LeoPanthera1y ago· 4 in thread

What's the characters-per-Wh of an LLM compared to traditional OCR?

fzysingularity1y ago

That's a tough one to answer right now, but to be perfectly honest, we're off by 2-3 orders of magnitude in terms of chars/W.

That said, VLMs are extremely powerful visual learners with LLM-like reasoning capabilities making them more versatile than OCR for practically all imaging domains.

In a matter of a few years, I think we'll essentially see models that are more cost-performant via distillation, quantization and the multitude of tricks you can do to reduce the inference overhead.

mlyle1y ago

A lot worse. But, higher quality OCR will reduce the amount of human post-processing needed, and, in turn will allow us to reduce the number of humans. Since humans are relatively expensive in energy use, this can be expected to save a lot of energy.

rafram1y ago

> Since humans are relatively expensive in energy use

Are they? I'm seeing figures around 80 watts at rest, and 150 when exercising. The brain itself only uses about 20 watts [1]. That's 1/35 of a single H100's power consumption (700 watts - which doesn't even take into account the energy required to cool the data center, the humans who build and maintain it, ...).

[1]: https://www.humanbrainproject.eu/en/follow-hbp/news/2023/09/...

1 more reply

ambicapter1y ago

People really only started talking about the cost of running things when LLMs came out. Most everything before that was too cheap to be a serious consideration.

orliesaurus1y ago· 3 in thread

I think OCR tools are good at what they say on the box, recognizing characters on a piece of paper etc. If I understand this right, the advantage of using a vision language model is the added logic that you can say things like: "Clearly this is a string, but does it look like a timestamp or something else?"

EarlyOomOP1y ago

VLMs are able to take context into account when filling in fields, following either a global or field specific prompt. This is great for e.g. unlabeled axes, checking a legend for units to be suffixed after a number, etc. Also, you catch lots of really simple errors with type hints (e.g. dates, addresses, country codes etc.).

raxxorraxor1y ago

This has always been part of the complete OCR package as far as I know. The raw result of an OCR constantly fails to differentiate 1 l I i | or other similar symbols/letters.

Maybe this necessary step can be improved and altered with a VLM. There is also the preprocessing where the image get its perspective corrected. Not sure how well a VLM performs here.

As you said, I think combining these techniques will be the most efficient way forward.

vintermann1y ago

You can also use it for robustness. Looking at e.g. historical censuses, it's amazing how many ways people found to not follow the written instructions for filling them out. Often the information you want is still there, but woe to you if you look at the columns one by one and assume the information in them to be accurate and neatly within its bounding box.

BrannonKing1y ago· 3 in thread

What I want: take scan/photo of a document (including a full book), pass it to the language model, and then get out a Latex document that matches the original document exactly (minus the copier/camera glitches and angles). I feel like some kind of reinforcement learning model would be possible for this. It should be able to learn to generate Latex that reproduces the exact image, pixel for pixel (learning which pixels are just noise).

NoMoreNicksLeft1y ago

A big difficulty there is typeface detection, some of these were never digital fonts. But, even if it could detect them, you likely don't have those fonts on your computer to be able to put it back together as a digital typesetting for any but the most trivial fonts.

retrorangular1y ago

The tool could include all known open source fonts, and for the rest, maybe could have a model recreate missing fonts for non-patented fonts, as while font files (.ttf, .otf, .woff, etc.) are copyrighted, styles usually do not have design patents, so tracing and re-creating them is usually not an issue as far as I'm aware (not a lawyer.) [1]

Though if it accidentally "traces" one of the few exceptions, then you've potentially committed a crime, and the big difficulty in typeface detection you mention increases those odds. That said, there are so few exceptions that even if the model couldn't properly identify a font, it might be able to identify whether a font is likely to have a design patent.

I do think getting an AI to create a high quality vector font from a potentially low-res raster graphic is going to be quite challenging though. Raster to vector tools I've tried in the past left a bit to be desired.

1. https://www.copyright.gov/comp3/chap900/ch900-visual-art.pdf

> As a general rule, typeface, typefont, lettering, calligraphy, and typographic ornamentation are not registrable. 37 C.F.R. § 202.1(a), (e). These elements are mere variations of uncopyrightable letters or words, which in turn are the building blocks of expression. See id. The Office typically refuses claims based on individual alphabetic or numbering characters, sets or fonts of related characters, fanciful lettering and calligraphy, or other forms of typeface. This is true regardless of how novel and creative the shape and form of the typeface characters may be.

> There are some very limited cases where the Office may register some types of typeface, typefont, lettering, or calligraphy, such as the following:

> • Pictorial or graphic elements that are incorporated into uncopyrightable characters or used to represent an entire letter or number may be registrable. Examples include original pictorial art that forms the entire body or shape of the typeface characters, such as a representation of an oak tree, a rose, or a giraffe that is depicted in the shape of a particular letter.

> • Typeface ornamentation that is separable from the typeface characters is almost always an add-on to the beginning and/or ending of the characters. To the extent that such flourishes, swirls, vector ornaments, scrollwork, borders and frames, wreaths, and the like represent works of pictorial or graphic authorship in either their individual designs or patterned repetitions, they may be protected by copyright. However, the mere use of text effects (including chalk, popup papercraft, neon, beer glass, spooky-fog, and weathered-and-worn), while potentially separable, is de minimis and not sufficient to support a registration.

> The Office may register a computer program that creates or uses certain typeface or typefont designs, but the registration covers only the source code that generates these designs, not the typeface, typefont, lettering, or calligraphy itself. For a general discussion of computer programs that generate typeface designs, see Chapter 700, Section 723.

sva_1y ago

Did you try mathpix? Not sure about full pages, but it is pretty good at eqn

erulabs1y ago· 3 in thread

You sort of have to use both. OCR and LLM and then correlate the two results. They are bad at very different things, but a subsequent call to a 2nd LLM to pair together the results does improve quality significantly, plus you get both document understanding and context as well as bounding boxes, etc.

I'm building a "never fill out paperwork again" app, if anyone is interested, would be happy to chat!

fzysingularity1y ago

We think VLMs would outperform most OCR+LLM solutions in due time. I get that there’s need for these hybrid solutions today, but we’re comparing 20+ year mature tech vs something that’s roughly 1.5 years old.

Also, VLMs are end-to-end trainable, unlike OCR+LLM solutions (that are trained separately), so it’s clear that these approaches scale much better for domain-specific use cases or verticals.

cpursley1y ago

Any tips on how to prompt that second pairing step? And what sort of things to ask the llm to extract in step 1?

K0balt1y ago

A VLM that invokes ocr tool use is a compelling idea that could result in pretty good results, I would expect.

gfiorav1y ago· 3 in thread

I wonder what the speed of this approach vs traditional ocr techniques. Also, curious if this could be used for text detection (find a bounding box containing text within an image).

vunderba1y ago

Was just coming here to say this, there does not yet exist a multimodal vision LLM approach that is capable of identifying bounding boxes of where the text occurs. I suppose you could manually cut the image up and send each part separately to the LLM but that feels like an kludge and it's still in-exact.

EarlyOomOP1y ago

We can do bounding boxes too :) we just call it visual grounding https://github.com/vlm-run/vlmrun-cookbook/blob/main/noteboo...

3 more replies

chpatrick1y ago

qwen 2.5 vl was specifically trained to produce bounding boxes I believe.

intalentive1y ago· 3 in thread

What's the value-add here? The schemas?

fzysingularity1y ago

We've seen so many different schemas and ways of prompting the VLMs. We're just standardizing it here, and making it dead-simple to try it out across model providers.

vlmrunadmin0071y ago

Basically there is no model schema combination. IF you go ahead and prompt a open source model with the schema it doesn't produce the results in the expected format. The main contribution is how to make these model conform to your specific needs and in a structured format.

idiliv1y ago

Wait, but we're doing that already, and it works well (Qwen 2.5 VL)? If need be, you can always resort to structured generation to enforce schema conformity?

ekidd1y ago· 2 in thread

I've been experimenting with vlm-run (plus custom form definitions), and it works surprisingly well with Gemini 2.0 Flash. Costs, as I understand, are also quite low for Gemini. You'll have best results with simple to medium-complexity forms, roughly the same ones you could ask a human to process with less than 10 minutes of training.

If you need something like this, it's definitely good enough that you should consider kicking the tires.

fzysingularity1y ago

BTW Check out the Gemini qualitative results here in our hub: https://github.com/vlm-run/vlmrun-hub?tab=readme-ov-file#-qu....

It gives you an idea of where today's models fail (Gemini Flash, OpenAI gpt4o+mini, open-source ones like Llama 3.2 Vision, Qwen VL 2.5 etc).

fzysingularity1y ago

Very cool! If you have more examples / schemas you'd be interested in sharing, feel free to add to the `contrib` section.

beebaween1y ago· 2 in thread

What's the best way to run this is I prefer to use local GPUs?

fzysingularity1y ago

We’re adding this as we speak. Ollama support is already there, and here’s vLLM inference: https://github.com/vlm-run/vlmrun-hub/pull/120

EarlyOomOP1y ago

You can try out some of our schemas with Ollama if you want: https://github.com/vlm-run/vlmrun-hub (instructions in Readme)

egorfine1y ago· 2 in thread

I had a need to scan serial numbers from Apple's product boxes out of pictures taken by a clueless person on their phone. All OCR tools failed.

Vision model did the trick so well it's not even funny to discuss anything further.

"This is a picture of Apple product box. Find and return only the serial number of the product as found on a label. Return 'none' if no serial number can be found".

ptx1y ago

Did you check if all the numbers were correct?

egorfine1y ago

Of course. There was a little piece of code to query Apple for S/N data and it validated whether it was correct.

themanmaran1y ago· 1 in thread

We recently published an open source benchmark [1] specifically for evaluating VLM vs OCR. And generally the VLMs did much better than the traditional OCR models.

VLM highlights:

- Handwriting. Being contextually aware helps here. i.e. they read the document like a human would, interpreting the whole word/sentence instead of character by character

- Charts/Infographics. VLMs can actually interpret charts or flow diagrams into a text format. Including things like color coded lines.

Traditional OCR highlights:

- Standardized documents (e.x. US tax forms that they've been trained on)

- Dense text. Imagine textbooks and multi column research papers. This is the easiest OCR use case, but VLMS really struggle as the number of output tokens increase.

- Bounding boxes. There still isn't really a model that gives super precise bounding boxes. Supposedly Gemini and Qwen were trained for it, but they don't perform as well as traditional models.

There's still a ton of room for improvement, but especially with models like Gemini the accuracy/cost is really competitive.

[1] https://github.com/getomni-ai/benchmark

fzysingularity1y ago

Saw your benchmark, looks great. Will run our models against those benchmark and share some of our learnings.

As you mentioned there are a few caveats to VLMs that folks are typically unaware of (not at all exhaustive, but the ones you highlighted):

1. Long-form text (dense): Token limits of 4/8K mean that dense pages may go over limits of the LLM outputs. This requires some careful work to make them work as seamlessly as OCR.

2. Visual grounding a.k.a. bounding boxes are definitely one of those things that VLMs aren't natively good at (partly because the cross-entropy losses used aren't really geared for bounding box regression). We're definitely making some strides here [1] to improve that so you're going to get an experience that is almost as good as native bounding box regression (all within the same VLM). [1]

[1] https://colab.research.google.com/github/vlm-run/vlmrun-cook...

fl0under1y ago· 1 in thread

Looks cool!

May also be interested in Allen AI's OCR tool olmOCR they just released too [1][2]. They say "convert a million PDF pages for only $190 USD".

[1] https://github.com/allenai/olmocr [2] https://arxiv.org/abs/2502.18443

TZubiri1y ago

The issue with that promise is that anyone can convert pdfs, the question is whether the conversions are correct or whether you have

Income Expenses 200 100

On one document, and

Income Expenses 20 0100

On others.

There's no shortage of products that tried to solve this problem from scratch (or by piggybacking on other projects) and called it a day without worrying about the huge problem that is quality and parseability.

The most robust players just give you the coordinates of a glyph and you are on your own: Textract, PDFBox.

syntaxing1y ago· 1 in thread

Maybe I’m being greedy but is it possible to have a vLLM detect when a portion is an image? I want to convert some handwritten notes into markdown but some portion are diagrams. I want the vLLM to extract the diagrams to embed into the markdown output

vlmrunadmin0071y ago

We have successfully tested the model with vLLM and plan to release it across multiple inference server frameworks, including vLLM and OLAMA.

TZubiri1y ago· 1 in thread

Wow thanks!

There's a client who had a startup idea that involved analyzing pdfs, I used textract, but it was too cumbersome and unreliable.

Maybe I can reach out to see if he wants to give it anothee go with this!

fzysingularity1y ago

Let us know, I think >70% of OCR tasks today can be done with VLMs with a little bit of guidance ;). Ping us at contact "at" vlm.run

submeta1y ago· 1 in thread

Can I use this to convert flowcharts to yaml representations?

EarlyOomOP1y ago

We convert to a JSON schema, but it would be trivial to convert this to yaml. There are some minor differences in e.g. tokens required to output JSON vs yaml which is why we've opted for our strategy.

duckb1y ago· 1 in thread

Does this support table detection and extraction?

fzysingularity1y ago

Yes, it's experimental at the moment: https://docs.vlm.run/guides/doc-ai/guide-visual-grounding

tgtweak1y ago· 1 in thread

Not really interested until this can run locally without api keys :\

EarlyOomOP1y ago

You can! it works with Ollama https://github.com/vlm-run/vlmrun-hub

At the end of the day its just schemas. You can decide for yourself if its work upgrading to a larger, more expensive model.

rendaw1y ago

Why do all these OCR services only show examples with flawless screenshots of digital documents? Are there that many people trying to OCR digital data? Why not just copy the HTML?

If it's not intended for digital documents, where are the screenshots with fold marks, slipping lines, lighting gradients, thumbs, etc etc.

serjester1y ago

Good to see more work being done here, but I don't understand why this is tied to someone's proprietary API. Swapping model providers and adding some basic logging is not remotely painful enough to justify onboarding yet another vendor. Especially one that's handling something as sensitive as LLM prompts.

leecarraher1y ago

maybe it was my prompt, but there seems to be far too much interpretation after the image embedding. In my examples it implicitly started to summarize parts of the text, unfortunately incorrectly. On an invoice with typed lettering it summarized that payments submitted would not post for 2-3 business days, when in reality the text said if you submitted after 2p on a friday, the payment would not post until the following monday. Which is significantly different. I'd be curious if you could ablate those layers in some way, because the one-shot structured text detection recognition was much better than vanilla ocr.

wantlotsofcurry1y ago

I'll definitely be trying this out on my current side project!

Question: What tools/libs are people using to accurately detect square/rectangle objects in images?

I've used VNDetectRectangle [1] in Swift but it's not as accurate as I'd like it to be, even with preprocessing.

[1]: https://developer.apple.com/documentation/vision/vndetectrec...

rasz1y ago

I rather see machine learning used to help OCR by

- recognizing/recreating exact font used

- helping align/rotate source

Not to hallucinate gibberish when source lacks enough data.

Inviz1y ago

Service doesnt inspire confidence. Openai-compatible api doesnt work (expects `content.str` in message to be a string - ???). Getting 500s on non-openai compatible endpoint - seems like timeouts(?). When it did work it missed a lot, and hallucinated a lot too on custom documents/schemas.

cyp06331y ago

Existing solutions like Tesseract already can embed text into the image, but I'm wondering if there's a way to combine LLM with Tesseract, so that LLMs can help correcting results and finding unidentified text, and finally still embed text back to the image

rasguanabana1y ago

Wouldn’t VLM be susceptible to prompt injection?

Eisenstein1y ago

If you just want to play with using a vision model to do OCR, I made a little script that uses KoboldCpp to do it locally.

* https://github.com/jabberjabberjabber/LLMOCR

ritvikpandey211y ago

hey -- wrote a blog post about this exact phenomena [1] (also posted on HN couple weeks back [2]). tldr: maintaining confidence from LLM nondeterministic outputs over millions of pages is a problem. especially in production environments like healthcare, finance, etc. we've noticed decently high hallucination rates, even in finetuned LLMs.

[1]: https://www.runpulse.com/blog/why-llms-suck-at-ocr [2]: https://news.ycombinator.com/item?id=42966958#42977527

htrp1y ago

VLM's can't replace ocr one to one.. most hosted multimodal models seem to have a classical OCR (tesseract-based) step in their inference loop

gunian1y ago

replaced it with real humans -> nano tech in their brain -> transmit to server getting almost 99% accuracy

skbjml1y ago

This is awesome!

mmusson1y ago

Lol. The resume includes expert in Mia Khalifa easter egg.

j / k navigate · click thread line to collapse

125 comments

99 comments · 33 top-level

rafram1y ago· 25 in thread

themanmaran1y ago

If so you'd be passing every single document to a human review, and might as well not run the OCR. But if you're not rejecting based on CI, then you're exposed to just as much risk as using an LLM.

tensor1y ago

If you try to pitch hallucinations to these fields, they'll just choose 100% manual instead. It's a non-starter.

1 more reply

anon3738391y ago

> But if you're not rejecting based on CI, then you're exposed to just as much risk as using an LLM.

With OCR, errors are localized and have a greater chance of being detected when read.

1 more reply

bayindirh1y ago

The problem is, regardless of the confidence number, you can scan and mark document for grammatical errors.

We had this problem before [0], on some Xerox scanners and copiers. Results will be disastrous. It's not a question of if, but when.

I personally tried Gemini and OpenAI's models for OCR, and no, I won't continue using them further.

[0]: https://www.theregister.com/2013/08/06/xerox_copier_flaw_mea...

rafram1y ago

Then use an LLM to extract layout information. Don’t trust it to read the text.

> If the OCR model gives you back 500 words all ranging from 0.70 to 0.95 confidence, what do you do? Reject the entire document if there's a single value below 0.90?

No, of course not. You have a human review the words/segments with low confidence.

sudoshred1y ago

constantinum1y ago

The primary issue with LLMs is hallucination, which can lead to incorrect data and flawed business decisions.

For those interested, try LLMWhisperer(https://unstract.com/llmwhisperer/) for OCR. It avoids LLMs, eliminates hallucination issues, and preserves the input document layout for better context.

Examples of extracting complex layout:

https://imgur.com/a/YQMkLpA

https://imgur.com/a/NlZOrtX

https://imgur.com/a/htIm6cf

Hackbraten1y ago

> try LLMWhisperer(https://unstract.com/llmwhisperer/) for OCR. It avoids LLMs

The website you linked says it uses LLMs?

1 more reply

ungerik1y ago

Those images look exactly like what you get from every OCR tool out there if you use the XY information.

EarlyOomOP1y ago

rafram1y ago

Fine-tuning does help, though.

1 more reply

hashta1y ago

1 more reply

refulgentis1y ago

That's not OCR.

It is an absolute miracle.

It is transmutating a picture into JSON.

I never thought this would be possible in my lifetime.

But that is different from what your interlocutor is discussing.

1 more reply

KoolKat231y ago

I've been using gemini 2 flash to extract financial data, within my sample which is perhaps small (probably 1000 entries so far), I've had one single error only so like a 99.9% success rate.

(There's slightly more errors if I ask it to add numbers but this isn't OCR and a bit more of a reach, although it is very good at this too regardless).

Many hallucinations can be avoided by telling it to use null if there is no number present.

CarolineRommer1y ago

cratermoon1y ago

Agree wholeheartedly. Modern OCR is astonishingly good, more importantly it's deterministically so. It's failure modes, when it's unable to read the text, are recognizably failures.

Results for VLM accuracy & precision are not good. https://arxiv.org/html/2406.04470v1#S4

VeejayRampay1y ago

which solutions would you classify as "modern OCR"

are we talking tesseract or something?

1 more reply

delichon1y ago

staticman21y ago

Grimblewald1y ago

the84721y ago

Shouldn't confidence be available at the sampler level and also be conditional on the vision input, not just the next-token prediction?

j_bum1y ago

This is naive, but can you ask the model to provide a confidence rating for sections of the document?

thatjoeoverthr1y ago

1 more reply

UnlockedSecrets1y ago

You can ask, and it will be made up not grounded in reality

1 more reply

ttyprintk1y ago

It’s not naive; tesseract does this.

1 more reply

temp08261y ago· 5 in thread

wrs1y ago

I wouldn’t be surprised to find that Claude/ChatGPT/etc. can just…do that. With the prompt you just gave.

The output could be in Markdown, which is easily turned into a PDF. You would have to break up the input PDF into pages to avoid running out of output window.

temp08261y ago

I didn't consider that would actually work and am giving it a try now...but by its own estimate it's going to take several days to finish (I'm not paying for plus or whatever).

rafram1y ago

I would!

zzleeper1y ago

By any chance, would it be possible to share the PDF? I haven't heard shipibo language in a long while, and am quite curious about it.

temp08261y ago

Here you go-

https://archive.org/details/shipibodiccionario

iLemming1y ago· 4 in thread

Can anyone share some thoughts on this? Thanks!

acdha1y ago

Anything using the Apple Vision framework is fast and surprisingly accurate:

https://github.com/bytefer/macos-vision-ocr

cdolan1y ago

Cool to see, may use this locally for OCR in some cases. But I think the "handwriting" example is a little misleading. Thats a font, not a scan of hand written material

wahnfrieden1y ago

This uses the old APIs that are less accurate than the new Swift-only LiveText ones

ANighRaisin1y ago

The AI OCR build into snipping tool in windows is better than tesseract, albeit more inconvenient than something like powertoys or Capture2Text, which use a quick shortcut.

LeoPanthera1y ago· 4 in thread

What's the characters-per-Wh of an LLM compared to traditional OCR?

fzysingularity1y ago

That's a tough one to answer right now, but to be perfectly honest, we're off by 2-3 orders of magnitude in terms of chars/W.

That said, VLMs are extremely powerful visual learners with LLM-like reasoning capabilities making them more versatile than OCR for practically all imaging domains.

In a matter of a few years, I think we'll essentially see models that are more cost-performant via distillation, quantization and the multitude of tricks you can do to reduce the inference overhead.

mlyle1y ago

rafram1y ago

> Since humans are relatively expensive in energy use

[1]: https://www.humanbrainproject.eu/en/follow-hbp/news/2023/09/...

1 more reply

ambicapter1y ago

People really only started talking about the cost of running things when LLMs came out. Most everything before that was too cheap to be a serious consideration.

orliesaurus1y ago· 3 in thread

EarlyOomOP1y ago

raxxorraxor1y ago

This has always been part of the complete OCR package as far as I know. The raw result of an OCR constantly fails to differentiate 1 l I i | or other similar symbols/letters.

Maybe this necessary step can be improved and altered with a VLM. There is also the preprocessing where the image get its perspective corrected. Not sure how well a VLM performs here.

As you said, I think combining these techniques will be the most efficient way forward.

vintermann1y ago

BrannonKing1y ago· 3 in thread

NoMoreNicksLeft1y ago

retrorangular1y ago

1. https://www.copyright.gov/comp3/chap900/ch900-visual-art.pdf

> There are some very limited cases where the Office may register some types of typeface, typefont, lettering, or calligraphy, such as the following:

sva_1y ago

Did you try mathpix? Not sure about full pages, but it is pretty good at eqn

erulabs1y ago· 3 in thread

I'm building a "never fill out paperwork again" app, if anyone is interested, would be happy to chat!

fzysingularity1y ago

Also, VLMs are end-to-end trainable, unlike OCR+LLM solutions (that are trained separately), so it’s clear that these approaches scale much better for domain-specific use cases or verticals.

cpursley1y ago

Any tips on how to prompt that second pairing step? And what sort of things to ask the llm to extract in step 1?

K0balt1y ago

A VLM that invokes ocr tool use is a compelling idea that could result in pretty good results, I would expect.

gfiorav1y ago· 3 in thread

I wonder what the speed of this approach vs traditional ocr techniques. Also, curious if this could be used for text detection (find a bounding box containing text within an image).

vunderba1y ago

EarlyOomOP1y ago

We can do bounding boxes too :) we just call it visual grounding https://github.com/vlm-run/vlmrun-cookbook/blob/main/noteboo...

3 more replies

chpatrick1y ago

qwen 2.5 vl was specifically trained to produce bounding boxes I believe.

intalentive1y ago· 3 in thread

What's the value-add here? The schemas?

fzysingularity1y ago

We've seen so many different schemas and ways of prompting the VLMs. We're just standardizing it here, and making it dead-simple to try it out across model providers.

vlmrunadmin0071y ago

idiliv1y ago

Wait, but we're doing that already, and it works well (Qwen 2.5 VL)? If need be, you can always resort to structured generation to enforce schema conformity?

ekidd1y ago· 2 in thread

If you need something like this, it's definitely good enough that you should consider kicking the tires.

fzysingularity1y ago

BTW Check out the Gemini qualitative results here in our hub: https://github.com/vlm-run/vlmrun-hub?tab=readme-ov-file#-qu....

It gives you an idea of where today's models fail (Gemini Flash, OpenAI gpt4o+mini, open-source ones like Llama 3.2 Vision, Qwen VL 2.5 etc).

fzysingularity1y ago

Very cool! If you have more examples / schemas you'd be interested in sharing, feel free to add to the `contrib` section.

beebaween1y ago· 2 in thread

What's the best way to run this is I prefer to use local GPUs?

fzysingularity1y ago

We’re adding this as we speak. Ollama support is already there, and here’s vLLM inference: https://github.com/vlm-run/vlmrun-hub/pull/120

EarlyOomOP1y ago

You can try out some of our schemas with Ollama if you want: https://github.com/vlm-run/vlmrun-hub (instructions in Readme)

egorfine1y ago· 2 in thread

I had a need to scan serial numbers from Apple's product boxes out of pictures taken by a clueless person on their phone. All OCR tools failed.

Vision model did the trick so well it's not even funny to discuss anything further.

"This is a picture of Apple product box. Find and return only the serial number of the product as found on a label. Return 'none' if no serial number can be found".

ptx1y ago

Did you check if all the numbers were correct?

egorfine1y ago

Of course. There was a little piece of code to query Apple for S/N data and it validated whether it was correct.

themanmaran1y ago· 1 in thread

We recently published an open source benchmark [1] specifically for evaluating VLM vs OCR. And generally the VLMs did much better than the traditional OCR models.

VLM highlights:

- Handwriting. Being contextually aware helps here. i.e. they read the document like a human would, interpreting the whole word/sentence instead of character by character

- Charts/Infographics. VLMs can actually interpret charts or flow diagrams into a text format. Including things like color coded lines.

Traditional OCR highlights:

- Standardized documents (e.x. US tax forms that they've been trained on)

- Dense text. Imagine textbooks and multi column research papers. This is the easiest OCR use case, but VLMS really struggle as the number of output tokens increase.

- Bounding boxes. There still isn't really a model that gives super precise bounding boxes. Supposedly Gemini and Qwen were trained for it, but they don't perform as well as traditional models.

There's still a ton of room for improvement, but especially with models like Gemini the accuracy/cost is really competitive.

[1] https://github.com/getomni-ai/benchmark

fzysingularity1y ago

Saw your benchmark, looks great. Will run our models against those benchmark and share some of our learnings.

As you mentioned there are a few caveats to VLMs that folks are typically unaware of (not at all exhaustive, but the ones you highlighted):

1. Long-form text (dense): Token limits of 4/8K mean that dense pages may go over limits of the LLM outputs. This requires some careful work to make them work as seamlessly as OCR.

[1] https://colab.research.google.com/github/vlm-run/vlmrun-cook...

fl0under1y ago· 1 in thread

Looks cool!

May also be interested in Allen AI's OCR tool olmOCR they just released too [1][2]. They say "convert a million PDF pages for only $190 USD".

[1] https://github.com/allenai/olmocr [2] https://arxiv.org/abs/2502.18443

TZubiri1y ago

The issue with that promise is that anyone can convert pdfs, the question is whether the conversions are correct or whether you have

Income Expenses 200 100

On one document, and

Income Expenses 20 0100

On others.

The most robust players just give you the coordinates of a glyph and you are on your own: Textract, PDFBox.

syntaxing1y ago· 1 in thread

vlmrunadmin0071y ago

We have successfully tested the model with vLLM and plan to release it across multiple inference server frameworks, including vLLM and OLAMA.

TZubiri1y ago· 1 in thread

Wow thanks!

There's a client who had a startup idea that involved analyzing pdfs, I used textract, but it was too cumbersome and unreliable.

Maybe I can reach out to see if he wants to give it anothee go with this!

fzysingularity1y ago

Let us know, I think >70% of OCR tasks today can be done with VLMs with a little bit of guidance ;). Ping us at contact "at" vlm.run

submeta1y ago· 1 in thread

Can I use this to convert flowcharts to yaml representations?

EarlyOomOP1y ago

duckb1y ago· 1 in thread

Does this support table detection and extraction?

fzysingularity1y ago

Yes, it's experimental at the moment: https://docs.vlm.run/guides/doc-ai/guide-visual-grounding

tgtweak1y ago· 1 in thread

Not really interested until this can run locally without api keys :\

EarlyOomOP1y ago

You can! it works with Ollama https://github.com/vlm-run/vlmrun-hub

At the end of the day its just schemas. You can decide for yourself if its work upgrading to a larger, more expensive model.

rendaw1y ago

Why do all these OCR services only show examples with flawless screenshots of digital documents? Are there that many people trying to OCR digital data? Why not just copy the HTML?

If it's not intended for digital documents, where are the screenshots with fold marks, slipping lines, lighting gradients, thumbs, etc etc.

serjester1y ago

leecarraher1y ago

wantlotsofcurry1y ago

I'll definitely be trying this out on my current side project!

Question: What tools/libs are people using to accurately detect square/rectangle objects in images?

I've used VNDetectRectangle [1] in Swift but it's not as accurate as I'd like it to be, even with preprocessing.

[1]: https://developer.apple.com/documentation/vision/vndetectrec...

rasz1y ago

I rather see machine learning used to help OCR by

- recognizing/recreating exact font used

- helping align/rotate source

Not to hallucinate gibberish when source lacks enough data.

Inviz1y ago

cyp06331y ago

rasguanabana1y ago

Wouldn’t VLM be susceptible to prompt injection?

Eisenstein1y ago

If you just want to play with using a vision model to do OCR, I made a little script that uses KoboldCpp to do it locally.

* https://github.com/jabberjabberjabber/LLMOCR

ritvikpandey211y ago

[1]: https://www.runpulse.com/blog/why-llms-suck-at-ocr [2]: https://news.ycombinator.com/item?id=42966958#42977527

htrp1y ago

VLM's can't replace ocr one to one.. most hosted multimodal models seem to have a classical OCR (tesseract-based) step in their inference loop

gunian1y ago

replaced it with real humans -> nano tech in their brain -> transmit to server getting almost 99% accuracy

skbjml1y ago

This is awesome!

mmusson1y ago

Lol. The resume includes expert in Mia Khalifa easter egg.

j / k navigate · click thread line to collapse