DeepSeek OCR (opens in new tab)

(github.com)

1003 pointspierre5mo ago244 comments

244 comments

The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote

>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.

(I guess you could say a picture token is worth 10 textual tokens...)

Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic encoding does compared to huffman codes)?

And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.

miki1232115mo ago

Text tokens are quantized and represent subword units, vision tokens only exist in the embedding space.

The way text tokenization works in LLMs is that you have a "lookup table" of (small) token ids to (large) vector embeddings. To pass text to the LLM, you split it at token boundaries, convert strings to token ids, and then construct the "context", a matrix where each row is a vector taken from that lookup table.

Transmitting text token sequences can be relatively efficient, you just transmit the token IDs themselves[1]. They're small integers (~100k possible token ids is typical for large models). Transmitting the actual embeddings matrix would be far less efficient, as embeddings often consist of thousands of floating point numbers.

Images are encoded differently. After some basic preprocessing, image data is passed straight to a neural- network-based image encoder. That encoder encodes the image into vectors, which are then appended to the context. There are no token ids, there's no lookup table, we go straight from image data to token embeddings.

This means transmitting image tokens cannot be done as efficiently, as you'd have to transmit the embeddings themselves. Even though an image is encoded in fewer tokens, the most efficient representation of those tokens takes more bytes.

You can think of a text token as an integer between 0 and n, which we know how to map to a vector. This means you have `n` possible choices of tokens. In contrast, an image token is an array of m floating point numbers (the vector itself), each of which can take on many possible values. This means the "token space" of vision tokens is actually much larger.

There's also the issue of patterns. Text tokens correspond directly to a contiguous span of UTF-8 bytes, and most tokenizers won't create tokens that span word boundaries. This means they can't encode global patterns efficiently. You can't have a "Hamlet's monologue" or "the text that follows is in Spanish" token.

rco87865mo ago

Great explanation, thanks. I was surprised to hear that models still only work with ~100k tokens, but after giving it some thought it makes sense. There's only so many words/subword units that get used in any given language. The entropy comes from all the billions of different ways those subwords can be ordered.

3 more replies

isaacfung5mo ago

Some models use vector quantized variational autoencoders to discretize images into sequences of discrete symbols from a fixed codebook.

https://grok.com/share/bGVnYWN5LWNvcHk%3D_572b4955-6265-4210...

krackers5mo ago

Thank you, this makes sense! As [1] puts it pithily

>Image-patch tokens make better use of the high-dimensional embedding space than text tokens do.

That seems to imply it's not necessarily something unique about images, just a byproduct of having better conversion from "raw input -> embeddings" [2]. Although there is a certain elegance of handling both images and text with the same method.

[1] https://twitter.com/c0mbinat0r/status/1980698103234891892 [2] https://twitter.com/Kangwook_Lee/status/1980709454522744902

1 more reply

lubesGordi5mo ago

So in terms of OCR, does the neural network 'map' the words into an embedding directly, or is it getting a bunch of words like "Hamlet's monologue" and mapping that to an embedding? Basically what I'm asking is if the neural network image encoder is essentially doing OCR 'internally' when it is coming up with the embedding (if that makes any sense).

ttul5mo ago

This is a great summary. If you think about it a bit, text is an expanded representation of concepts meant for display on a two-dimensional surface that can then be read back by human eyes; our brains convert the two-dimensional information into concepts again.

So to me it’s not a surprise that you can transform the two-dimensional representation of the same information into concepts again without losing much.

The paper talks about using this approach to generate large amounts of LLM training data rapidly. That’s intriguing. It suggests that one of the best ways of training models on a wide variety of input data with very long context is to provide it with an image representation instead of text tokens.

1 more reply

jph005mo ago

Actually there are VAEs which use a codebook approach to creating discrete tokens instead of float vectors. There has been some success in that direction in diffusion models for instance.

storus5mo ago

That's not really true, the latest autoregressive image models create a codebook of patches that are then encoded as tokens and image is assembled out of them.

HarHarVeryFunny5mo ago

I don't know if there is any common practice among multi-modal input "LLM"s as to how they encode image inputs - convert them into "vision tokens", but it's basically going to come down to splitting the image into a grid of regions and encoding those.

I'm not sure there's any information theoretic intuition to be had with DeepSeek's experiments - it seems to be more about what's the lowest resolution image resolution/grid you can get away with and still capture enough image detail to be able to accurately perform OCR on it.

It'd be cool if Karpathy would extend his NanoChat to be multi-modal to spread the knowledge of how this is typically done.

hendersoon5mo ago

Exactly right, the OCR isn't the interesting part. 10x context compression is potentially huge. (With caveats, at only ~97% accuracy, so not appropriate for everything.)

runeblaze5mo ago

each text token is often subword unit, but in VLMs the visual tokens are in semantic space. Semantic space obviously compresses much more than subword slices.

disclaimer: not expert, on top of my head

ssivark5mo ago

Surely the appropriate ratio depends on the resolution of each character, relative to the size of the vision token patch? That is the only way the number of text tokens needed to describe the output of OCR can be independent of the resolution of the image (as it should).

looobay5mo ago

LLMs are compute heavy with quadratic scaling (in compute) per tokens. They are trying to compress text tokens into visual tokens with their VLM.

Maybe they would render texts to an image before tokenizing to reduce the compute cost.

krackers5mo ago

But naively wouldn't you expect the representation of a piece of text in terms of vision tokens to be roughly the same number of bits (or more) than the representation as textual token? You're changing representation sure, but that by itself doesn't give you any compute advantages unless there is some sparsity/compressability you can take advantage of in the domain you transform to right?

So I guess my question is where is the juice being squeezed from, why does the vision token representation end up being more efficient than text tokens.

6 more replies

simonw5mo ago

I figured out how to get this running on the NVIDIA Spark (ARM64, which makes PyTorch a little bit trickier than usual) by running Claude Code as root in a new Docker container and having it figure it out. Notes here: https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-co...

Here's a result I got https://github.com/simonw/research/blob/main/deepseek-ocr-nv... - against this image: https://static.simonwillison.net/static/2025/ft.jpeg

jjcm5mo ago

Looks like this did really solid, with the exception of the paragraph directly below the quote. It hallucinated some filler there and bridge it with the next column.

Thanks for running the test quickly!

djmips5mo ago

By my eye it just bridge. I didn't see any filler. It went from "Code is a language" - above the quote and then to "in a garden by name." which was the top of the next column but missing the chicken subject.

throwaway3141555mo ago

> by running Claude Code as root in a new Docker container

How do you get the "as root" part of that to work?

(sorry if it's explained in your article)

simonw5mo ago

Run it on a root account and do:

  IS_SANDBOX=1 claude --dangerously-skip-permissions

1 more reply

CaptainOfCoit5mo ago

It missed the initial "A" in the text which I sort of understand, seems not a lot of news articles were put in the dataset. But more interestingly, it missed the entire "Hallucination is a risk and...", the article "theme" next to the author name also the final email.

arkmm5mo ago

Wow, this deserves its own submission.

ellisd5mo ago

The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.

https://annas-archive.org/blog/duxiu-exclusive.html

bluecoconut5mo ago

Previous paper from DeepSeek has mentioned Anna’s Archive.

> We cleaned 860K English and 180K Chinese e-books from Anna’s Archive (Anna’s Archive, 2024) alongside millions of K-12 education exam questions. https://arxiv.org/abs/2403.05525 DeepSeek-VL paper

_vqpz5mo ago

Why do they need to grant access for people to use copies of books they don’t own?

JohnLocke45mo ago

Not to rationalize it, but it appears that they're gatekeeping the dataset to get access to the OCR-scans from the people they choose to share it with. This is to improve their existing service by making the content of books (and not just their title/tags) searchable.

As per the blog post: >What does Anna’s Archive get out of it? Full-text search of the books for its users.

1 more reply

est5mo ago

> The books from Duxiu have long been pirated on the Chinese internet. Usually they are being sold for less than a dollar by resellers. They are typically distributed using the Chinese equivalent of Google Drive, which has often been hacked to allow for more storage space

Ownership laundering.

singularfutur5mo ago

Yes it means they will never release their dataset :(

throawayonthe5mo ago

hahaha also immediately thought of this, wonder when the ocr'd dataset would be getting released

dev1ycan5mo ago

Oh great so now Anna's archive will get taken down as well by another trash LLM provider abusing repositories that students and researchers use, META torrenting 70TB from library genesis wasn't enough

sigmoid105mo ago

Seems like they are doing fine:

https://open-slum.org

1 more reply

c0balt5mo ago

It appears this is an active offer from Anna's archive, so presumably they can handle the load and are able to satisfy the request safely.

breadislove5mo ago

For everyone wondering how good this and other benchmarks are:

- the OmniAI benchmark is bad

- Instead check OmniDocBench[1] out

- Mistral OCR is far far behind most Open Source OCR models and even further behind then Gemini

- End to End OCR is still extremely tricky

- composed pipelines work better (layout detection -> reading order -> OCR every element)

- complex table parsing is still extremely difficult

[1]: https://github.com/opendatalab/OmniDocBench

hakunin5mo ago

Wish someone benchmarked Apple Vision Framework against these others. It's built into most Apple devices, but people don't know you can actually harness it to do fast, good quality OCR for you (and go a few extra steps to produce searchable pdfs, which is my typical use case). I'm very curious where it would fall in the benchmarks.

wahnfrieden5mo ago

It is unusable trash for languages with any vertical writing such as Japanese. It simply doesn’t work.

1 more reply

graeme5mo ago

Interesting. How do you harness it for that purpose? I've found apple ocr to be very good.

2 more replies

CaptainOfCoit5mo ago

Yeah, if it was cross-platform maybe more people would be curious about it, but something that can only run on ~10% of the hardware people have doesn't make it very attractive to even begin to spend time on Apple-exclusive stuff.

2 more replies

cheema335mo ago

> the OmniAI benchmark is bad

According to Omni OCR benchmark, Omni OCR is the best OCR. I am sure you all will find no issues with these findings.

yoran5mo ago

How does an LLM approach to OCR compare to say Azure AI Document Intelligence (https://learn.microsoft.com/en-us/azure/ai-services/document...) or Google's Vision API (https://cloud.google.com/vision?hl=en)?

ozgune5mo ago

OmniAI has a benchmark that companies LLMs to cloud OCR services.

https://getomni.ai/blog/ocr-benchmark (Feb 2025)

Please note that LLMs progressed at a rapid pace since Feb. We see much better results with the Qwen3-VL family, particularly Qwen3-VL-235B-A22B-Instruct for our use-case.

cheema335mo ago

Omni OCR team says that according to their own benchmark, the best OCR is the Omni OCR. I am quite surprised.

CaptainOfCoit5mo ago

Magistral-Small-2509 is pretty neat as well for its size, has reasoning + multimodality, which helps in some cases where context isn't immediately clear, or there are few missing spots.

daemonologist5mo ago

My base expectation is that the proprietary OCR models will continue to win on real-world documents, and my guess is that this is because they have access to a lot of good private training data. These public models are trained on arxiv and e-books and stuff, which doesn't necessarily translate to typical business documents.

As mentioned though, the LLMs are usually better at avoiding character substitutions, but worse at consistency across the entire page. (Just like a non-OCR LLM, they can and will go completely off the rails.)

numpad05mo ago

Classical OCR still probably make undesirable su6stıtutìons in CJK from there being far too many of similar ones, even some absurd ones that are only distinguishable under microscope or by looking at binary representations. LLMs are better constrained to valid sequences of characters, and so they would be more accurate.

Or at least that kind of thing would motivate them to re-implement OCR with LLM.

fluoridation5mo ago

Huh... Would it work to have some kind of error checking model that corrected common OCR errors? That seems like it should be relatively easy.

1 more reply

junto5mo ago

Not sure how it compares but we did some trials with Azure AI Document Intelligence and were very surprised at how good it was. We had a document example which was a poor photograph of a document that had quite a skew, and it (too our surprise), also detected the customer’s human legible signature and extracted their name from that signature.

stopyellingatme5mo ago

Not sure about the others but we use Azure AI Document Intelligence and its working well for our resume parsing system. Took a good bit of tuning but we havent had to touch it for almost a year now.

make35mo ago

aren't all of these multimodal LLM approaches, just open vs closed ones

sandblast5mo ago

Not sure why you're being downvoted, I'm also curious.

dumpsterkid5mo ago

I haven't fired this up yet to try but I've been evaluating & working with quite a few different VLMs from the small granite, qwen etc models up to the larger VLMs available to see if we can fully replace traditional OCR in our system but I've been disappointed so far - our system takes documents from customers and supplies them back normalized documents (i.e rasterized multi-page bitmaps) marked up as they've requested - however in our use case we need accurate coordinates of data down to the letter/word level and from my experience the positional outputs from these VLMs are either wildly inconsistent, completely hallucinated, or so vague that it doesn't allow us to target anything with any kind of accuracy or granularity.

our solution so far has been to stick to using tesseract with good clean-up routines and then augmenting/fixing-up the output using the VLM OCR text where we don't have structured source document data available

it could be that we just have a very niche use-case and it doesn't matter to most people, I'm sure if you just want a text dump or restructured markdown/html representation of documents these VLMs work well but the number of articles & comments I've seen claiming that these models have 'solved' OCR just seems counter to our experiences

kamranjon5mo ago

Have you tried moondream yet[1]? The moondream 3 preview model[2], according to the blogpost[3] appears to outperform many frontier models on VLM tasks and does so with a relatively small footprint.

[1] https://moondream.ai/

[2] https://huggingface.co/moondream/moondream3-preview

[3] https://moondream.ai/blog/moondream-3-preview

sampton5mo ago

You can train a cnn to find bounding boxes of text first. Then run VLM on each box.

jmpeax5mo ago

Your customers don't have any handwritten text?

pietz5mo ago

My impression is that OCR is basically solved at this point.

The OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

I've been able to solve a broad range of OCR tasks by simply sending each page as an image to Gemini 2.5 Flash Lite and asking it nicely to extract the content in Markdown under some additional formatting instructions. That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

I'd be interested to hear where OCR still struggles today.

cahaya5mo ago

Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML: Tables with multiple headers and merged cells that get mixed up, multiple columns with tick boxes get mixed up, multi page tables that are not understood correctly. Also Llamaindex fails miserably on those things.

Curious to hear which OCR/ LLM excels with these specific issues? Example complex table: https://cdn.aviation.bot/complex-tables.zip

I can only parse this table correctly by first parsing the table headers manually into HTML as example output. However, it still mixes up tick boxes. Full table examples: https://www.easa.europa.eu/en/icao-compliance-checklist

CaptainOfCoit5mo ago

> Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML:

But that's something else, that's no longer just OCR ("Optical Character Recognition"). If the goal suddenly changes from "Can take letters in images and make into digital text" to "Can replicate anything seen on a screen", the problem-space gets too big.

For those images you have, I'd use something like Magistral + Structured Outputs instead, first pass figure out what's the right structure to parse into, second pass to actually fetch and structure the data.

3 more replies

pietz5mo ago

I threw in the first image/table into Gemini 2.5 Pro letting it choose the output format and it looks like it extracted the data just fine. It decided to represent the checkboxes as "checked" and "unchecked" because I didn't specify preferences.

carschno5mo ago

Technically not OCR, but HTR (hand-written text/transcript recognition) is still difficult. LLMs have increased accuracy, but their mistakes are very hard to identify because they just 'hallucinate' text they cannot digitize.

mormegil5mo ago

This. I am reading old vital records in my family genealogy quest, and as those are sometimes really difficult to read, I turned to LLMs, hearing they are great in OCR. It’s been… terrible. The LLM will transcribe the record without problems, the output seems completely correct, a typical text of a vital record. Just… the transcribed text has nothing to do with my specific record. On the other hand, transkribus.eu has been fairly usable for old vital record transcription – even though the transcribed text is far from perfect, many letters and words are recognized incorrectly, it helps me a lot with the more difficult records.

pietz5mo ago

We ran a small experiment internally on this and it looked like Gemini is better at handwriting recognition than I am. After seeing what it parsed, I was like "oh yeah, that's right". I do agree that instead of saying "Sorry, I can't read that" it just made up something.

1 more reply

sramam5mo ago

Interesting - have you tried sending the image and 'hallucinated' text together to a review LLM to fix mistakes?

I don't have a use case of 100s or 1000s of hand-written notes have to be transcribed. I have only done this with whiteboard discussion snapshots and it has worked really well.

1 more reply

raincole5mo ago

If you can accept that the machine just make up what it doesn't recognize instead of saying "I don't know," then yes it's solved.

(I'm not being snarky. It's acceptable in some cases.)

jakewins5mo ago

But this was very much the case with existing OCR software as well? I guess the LLMs will end up making up plausible looking text instead of text riddled with errors, which makes it much harder to catch the mistakes, in fairness

2 more replies

red75prime5mo ago

Just checked it with Gemini 2.5 Flash. Instructing it to mark low-confidence words seems to work OK(ish).

KoolKat235mo ago

These days it does just that, it'll say null or whatever if you give it the option. When it does make it up, it tends to be limitation of the image qualify ( max dpi).

Blotchy text and specific typeface make 6's look like 8's, even to the non-discerning eye, a human would think it's an 8, zoom in and see it's a 6.

Google's image quality on uploads is still streets ahead of openai for instance btw.

wahnfrieden5mo ago

Do any LLM OCRs give bounding boxes anyway? Per character and per block.

2 more replies

peter-m805mo ago

No way it's solved. try to make OCR over a magazine with creative layouts. Not possible. I have a collection of vintage computer magazines and from time to time I try to OCR them whith the state of the art mechanisms. All of them requiere a lot of human intervention

constantinum5mo ago

I use LLMWhisperer[1] for OCR'ing old magazine ads. It preserves the layout and context. Example > https://postimg.cc/ts3vT7kG

https://pg.llmwhisperer.unstract.com/

pietz5mo ago

Could you provide an example that fails? I'm interested in this.

jmkni5mo ago

do you have an example of a particularly tricky one?

1 more reply

kbumsik5mo ago

> My impression is that OCR is basically solved at this point.

Not really in practice to me. Especially they still struggle with Table format detection.

coulix5mo ago

This.

Any complex parent table span cell relationship still has low accuracy.

Try the reverse, take a complex picture table and ask Chatgpt5, claude Opus 3.1, Gemini Pro 2.5 to produce a HTML table.

They will fail.

2 more replies

richardlblair5mo ago

I had mentioned this when the new QWEN model dropped - I have a stack of construction invoices that fail through both OCR and OpenAI.

It's a hard (and very interesting) problem space.

llm_nerd5mo ago

Complex documents is where OCR struggles mightily. If you have a simple document with paragraphs of text, sure OCR is pretty solved. If you have a complex layout with figures and graphs and supporting images and asides and captions and so on (basically any paper, or even trade documents), it absolutely falls apart.

And GP LLMs are heinous at OCR. If you are having success with FL, your documents must be incredibly simple.

There has been enormous advances in OCR over the past 6 months, so the SoTa is a moving, rapidly advancing target.

Gazoche5mo ago

There is no "solved" in computer vision, there is only "good enough" and what constitutes "good enough" depends on your problem domain.

Take an OCR model with 99.9% character-wise accuracy. Sounds pretty good, right? Well if your use case is, say, digitalizing old printed novels, then yeah it's probably good enough.

But what if your documents are personal records with millions of names, to insert in some administrative database? Now 1 out of 1000 persons will have their name misspelled. Ooops.

themanmaran5mo ago

> OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

Benchmark author here. No, just pivoted away from OCR API as a product! Still use our API internally but have been lazy about updating benchmarks.

Gemini is definitely the best model for OCR. But it has a really high rate of "recitation" errors. Where it will determine the output token is too close to its training data and cut it off. Something like 10% of the time from our testing. Also it has this hilarious hallucination when you have a blank page in the document mix and it just makes up new info.

OpenAI is OK. GPT5 wasn't any better than 4o or 4.1. Main issues were: dropping content like headers/footers, loses it's mind on sideways pages, and will frequently refuse to read things like ID documents, health care forms, or things it judges to have too much PII.

robotswantdata5mo ago

VLLMs suck at complex layouts and there is a high risk of hallucination. Never use alone for contracts or health data.

burpsnard5mo ago

I've only used tesseract, 'recreationally', but i tried generating images of random chars to see what resolution/contrast/noise was minimally recognisable; shocked at how bad it was. heavily relies on language models of character sequences, pretty useless On 'line noise'

cle5mo ago

That will not work with many of the world's most important documents because of information density. For example, dense tables or tables with lots of row/col spans, or complex forms with checkboxess, complex real-world formatting and features like strikethroughs, etc.

To solve this generally you need to chunk not by page, but by semantic chunks that don't exceed the information density threshold of the model, given the task.

This is not a trivial problem at all. And sometimes there is no naive way to chunk documents so that every element can fit within the information density limit. A really simple example is a table that spans hundreds pages. Solving that generally is an open problem.

veidr5mo ago

Clearly-printed text to a sequence of characters is solved, for use cases that don't require 100% accuracy.

But not for semantic document structure — recognizing that the grammatically incomplete phrase in a larger font is a heading, recognizing subheadings and bullet lists, tables, etc.

Also not for handwritten text, text inside of images (signage and so forth), or damaged source material (old photocopies and scans created in the old days).

Those areas all seem to me where an LLM-based approach could narrow the gap between machine recognition and humans. You have to sort of reason about it from the context as a human to figure it out, too.

vintermann5mo ago

OCR of printed text may be one thing, but handwriting OCR (a.k.a HTR) is very, very far from solved. It's actually hard to find a practical task general historical HTR is good enough to do usefully, even for state of the art models.

KoolKat235mo ago

I agree, Gemini 2.5 models are excellent.

The fuss around old fashioned OCR seemed strange to me initially considering the above, but I selfishly forgot to consider addressing compute/offline requirements.

It would also be nice for there to be a good competitor.

constantinum5mo ago

Why PDF parsing is Hell[1]:

Fixed layout and lack of semantic structure in PDFs.

Non-linear text flow due to columns, sidebars, or images.

Position-based text without contextual or relational markers.

Absence of standard structure tags (like in HTML).

Scanned or image-based PDFs requiring OCR.

Preprocessing needs for scanned PDFs (noise, rotation, skew).

Extracting tables from unstructured or visually complex layouts.

Multi-column and fancy layouts breaking semantic text order.

Background images and watermarks interfering with text extraction.

Handwritten text recognition challenges.

[1] https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

simlevesque5mo ago

> That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

Can you explain more about your setup ? I have a quarter million pages I want to OCR.

blindriver5mo ago

I attempted OCR using all of the open source models available about 3 months ago, including Llama 4. These were pngs of text using a regular font. Most produced garbage except Llama 4, and even then it was only about 90% accurate. Using OpenAI or Gemini produced much better results but the open source models were really bad.

6gvONxR4sf7o5mo ago

OCR for printed documents is super robust, but handwriting, low res, and aligned recognition (not just image to "hello world" but also having "h is here in space e is here in space...) are all still well behind "basically solved."

Davidzheng5mo ago

I think it'll be good to have an end-to-end pdf to latex converter for old math papers. For commutative diagrams almost all models still struggle. especially very complicated commutative diagrams.

darkwater5mo ago

So, the mug with inspirational text says "Bountiful Potential"?

kelvinjps105mo ago

Google vision it's still better than Gemini at OCR, for example at getting bounding boxes.

sbinnee5mo ago

Maybe for English. Other languages are very much not solved.

baobun5mo ago

Chinese, especially handwritten.

rsp19845mo ago

Can someone ELI5 to me (someone who doesn't have the time to keep up with all the latest research) what this is and why it's a big deal?

It's very hard to guess from the github and paper. For example, there is OCR in the title but the abstract and readme.md talk about context compression for LLMs, which I find confusing. Somebody care to explain the link and provide some high-level context?

intalentive5mo ago

Suppose you have an image with 1000 words in it, and suppose for simplicity that every word is 1 token. Then the image is “worth” 1000 tokens.

But under the hood, the image will have to be transformed into features / embeddings before it can be decoded into text. Suppose that the image gets processed into 100 “image tokens”, which are subsequently decoded into 1000 “text tokens”.

Now forget that we are even talking about images or OCR. If you look at just the decoding process, you find that we were able to compress the output into a 10x smaller representation.

The implication for LLMs is that we don’t need 1000 tokens and 1000 token embeddings to produce the 1001st token, if we can figure out how to compress them into a 10x smaller latent representation first.

rsp19845mo ago

Excellent, thanks. So basically this is saying: "our pixels-to-token encoding is so efficient (information density in a set of "image tokens" is much higher as compared to a set of text tokens), why even bother representing text as text?"

Correct?

1 more reply

foofoo125mo ago

How does it compare to Tesseract? https://github.com/tesseract-ocr/tesseract

I use ocrmypdf (which uses Tesseract). Runs locally and is absolutely fantastic. https://ocrmypdf.readthedocs.io/en/latest/

utopiah5mo ago

Indeed, seems the default benchmark is LLM/VLM based alternatives as if they somehow "solved" the problem but IMHO even if it goes from (totally made up numbers) 80% with tesseract to 95% with this or Qwen or whatever but it takes 100x harddisk with containers or a CUDA stack, dedicated hardware, e.g. GPU with 16GB or VRAM, etc then it's such a trade of it should be considered.

modeless5mo ago

Hmm, at first I was thinking "why OCR?", but maybe the reason is to ingest more types of training data for LLM improvement, e.g. scanned academic papers? I imagine all the frontier labs have a solution for this due to the value of academic papers as a data source.

Edit: Oh I see the paper abstract says this explicitly: "In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G)". This is just part of the training data ingestion pipeline for their real models. Explains why the architecture is not using all of their latest tricks: it's already good enough for their use case and it's not the main focus.

polytely5mo ago

If we get ocr working it makes it possible to store all human knowledge now stored in PDF's with way less resources

https://annas-archive.org/blog/critical-window.html

giardini5mo ago

This model (DeepSeek-OCR) aligns particularly well with what we know about written language and the human act of reading.

The Visual Word Form Area (VWFA) on the left side of the brain is where the visual representation of words is transformed to something more meaningful to the organism.

https://en.wikipedia.org/wiki/Visual_word_form_area

The DeepSeek-OCR encoding (rather than simple text encoding) appears analogous to what occurs in the VWFA.

This model may not only be more powerful than text-based LLMs but may open the curtain of ignorance that has stymied our understanding of how language works and ergo how we think, what intelligence is precisely, etc.

Kudos to the authors: Haoran Wei, Yaofeng Sun, Yukun Li. You may have tripped over the Rosetta Stone of intelligence itself! Bravo!

empressplay5mo ago

This could be great for extracting text from old magazines; traditional OCR gives you a bit of a mess you have to clean up, but this looks like it can properly identify columns and track the flow accurately (and extract images!) It appears it can convert magazine layouts to markdown too

h14h5mo ago

This feels like another one of those stairstep ML/AI advances that makes computers behave eerily more like humans.

We tend to think in images rather than plaintext, and here we are discovering it's more efficient for a computer to do so as well.

edtechdev5mo ago

I tried this out on huggingface, and it has the same issue as every other multimodal AI OCR option (including MinerU, olmOCR, Gemini, ChatGPT, ...). It ignores pictures, charts, and other visual elements in a document, even though the models are pretty good at describing images and charts by themselves. What this means is that you can't use these tools yet to create fully accessible alternatives to PDFs.

mediaman5mo ago

I have a lot of success asking models such as Gemini to OCR the text, and then to describe any images on the document, including charts. I have it format the sections with XML-ish tags. This also works for tables.

sherlockxu5mo ago

Seems there’s still some confusion around what DeepSeek-OCR really does. Learn about the model, Contexts Optical Compression, and its impact on LLMs here: https://www.bentoml.com/blog/deepseek-ocr-contexts-optical-c...

allanren5mo ago

It says the conversation can reduce size with large compression, which basic make the image blur but still contaim import information.

This is indeed amazing. It's actually how human try to understand and remember things. BY VISUAL! And when memory fade out, the image are getting blurred.

Not sure if those close source multimodal models are already using this method.

neves5mo ago

I see that the project uses conda for development. Is it still a good tool now that pip also install binaries?

modeless5mo ago

No. Everyone should be using uv instead.

schopra9095mo ago

It’s not clear to me what the bottleneck for OCR to “100%” work with LLMS is.

In my work we do a lot of stuff with image understanding and captioning (not OCR). There object identification and description works great, since all the models are using a CLIP like visual backbone. But it falls apart when you ask about nuances like left/right or counting (reasoning kind of improves the latter but it’s too expensive to matter IMO).

For our tasks, it’s clear that there’s more fundamental research that needs to be done on vision understanding to push past CLIP. That would really improve LLMs for our usecases.

Curious if there’s something similar going on for OCR in the vision encoder that’s fundamentally holding it back.

tidbeck5mo ago

How does this compare to https://huggingface.co/ibm-granite/granite-docling-258M in performance and how they work?

bugglebeetle5mo ago

The granite Dockling models are unfortunately quite far below SOTA. dots-ocr and PaddleOCR were best here.

bugglebeetle5mo ago

Looks great, but looking at the benchmark, can’t help but think about how crazy good dots-ocr is as a model. Too bad they’re not as open as the Deepseek team because its so crazy good and would love to know how it was trained.

rfoo5mo ago

If you look you'd notice that it's the same Haoran Wei behind DeepSeek-OCR and GOT-OCR2.0 :p

bugglebeetle5mo ago

Oh you’re right! Good catch!

bethekind5mo ago

Did we read the same graph? DeepSeek Gundam 200 dpi appeared to get similar perf as dots-ocr, but with less tokens needed. The x axis is inverted, descending with distance from the origin.

Thma5mo ago

Is this only at the level of visual compression? For example, are there any applications in terms of understanding (being able to represent the actual meaning it stands for) and reasoning? Technically, it seems to have no connection with current reinforcement learning and other techniques. The model is quite small, yet there appears to be no explanation regarding its understanding capabilities. If it is merely for compression, what impact will it have on the current large models?

piker5mo ago

This looks really cool for prototyping and playing around.

It seems to me though if one is building a modern application that needs to get image segmentation and/or text recognition right there are better APIs available than natural language? It seems like a lot of effort to make a production-scale CV application to weigh it down with all of an LLM’s shortcomings. Not a field I’m familiar with but I would assume that this doesn’t produce state of the art results—that would change the analysis.

CheeseFromLidl5mo ago

As a hobby photographer, I organise everything for speedy retrieval but this would be amazing to search my collection.

randomNumber75mo ago

Imagine you build an image segmentation model for a e.g. specific industrial application.

With this LLM approach you can at least create your training data from the raw images with natural language.

piker5mo ago

That does make sense

2big2fail_475mo ago

I find it interesting that there's all these independent AI-OCR Projects but still no commercial offering. Is it still too inaccurate, too complex or simply too expensive?

Annatar015mo ago

I dont know, but maybe existing commercial OCR is still on top, and also using ML. Recently tried a free trial for OCR/reading Sütterlin and it was a weird feeling being so outclassed in reading.

rsolva5mo ago

Mistral offers their OCR commercially through their API and in their Chat services, at least.

https://mistral.ai/news/mistral-ocr

simlevesque5mo ago

https://cloud.google.com/document-ai

prats2265mo ago

https://docstrange.nanonets.com/ as well, wrapper on top of 7B version of https://huggingface.co/nanonets/Nanonets-OCR2-3B

daemonologist5mo ago

There are commercial OCR offerings from the big cloud providers (plus, like, Adobe). In my experience they generally outperform anything open-weights, although there's been a lot of improvement in VLMs in the past year or two.

aleinin5mo ago

One that I’ve seen recently is https://reducto.ai It appears to be an OCR wrapper.

Eisenstein5mo ago

It is because the AI is not actually doing OCR. It is giving an interpretation of what the text in an image is by ingesting vision tokens and mapping them onto text tokens.

So you either have to be fine with a lot of uncertainty as to the accuracy of that interpretation or you have to wait for an LLM that can do it in a completely reproducible way every time.

CloseChoice5mo ago

It's deepseek so one can expect an open-source license but for anyone (like me) who wants to see that explicitly, since it's not obvious in the GitHub repo: https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/LI...

TLDR: It's MIT licensed

AndroTux5mo ago

> since it's not obvious in the GitHub repo

Literally says MIT license on the right sidebar and in the readme tab and in the file called LICENSE

maxloh5mo ago

Model weights are MIT too: https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/LI...

vladpowerman5mo ago

The compression framing is super interesting. It makes me wonder if there’s an equivalent notion for source code - like how much “information” or entropy a commit contains vs. boilerplate churn.

I’ve been exploring Git activity analysis recently and ran into similar trade-offs: how do you tokenize real-world code and avoid counting noise?

k_sze5mo ago

It's interesting how they use "Gundam" in their variant names. I gather that Gundam-M and Gundam are their most powerful ones.

daemonologist5mo ago

I think maybe to distinguish their dynamic resolution approach from the t-shirt sizes, which have a fixed input. (Although I don't know why "Gundam")

mrasong5mo ago

Kinda reminds me of PaddleOCR.

Would be awesome if DeepSeek OCR could be integrated into a mobile app someday. That’d make OCR way more convenient!

pzo5mo ago

iOS already have on device both text detector and document scanner in apple Vision API. Hard to say how good are they compared to LLM based solutions. Similarly google had MLKit with OCR working on devices also for many years.

prats2265mo ago

Top 3 models on huggingface are all OCR models. Most automation projects involve documents where you need a model finetuned to understand all elements inside documents and provide grounding and confidence scores etc which is why these subset of models are gaining popularity

x______________5mo ago

  >先天下之忧而忧

How is this an example of a prompt?

Google translated this to "Worry about the world first" while Bing says "Worry before the worries of the world."

Can anyone shed some light on this saying or why it's in the article?

raincole5mo ago

It's a very famous (classical) Chinese phrase.

Both translations don't catch the meaning well though. It means: "worry before the rest of the world (notice that they have something to) worry." The next part is 後天下之樂而樂("be happy only after the rest of the world is happy.")

I don't know why it's a prompt example.

jdthedisciple5mo ago

Sibling comment has the second part as

后天下之乐而乐

which one is correct?

2 more replies

SequoiaHope5mo ago

Ask a language model - ChatGPT says it’s a line from a famous poem “Memorial to Yueyang Tower” which expresses the Confucian ideal of selfless concern for people and society.

fspeech5mo ago

Google is closer. This is from a famous essay expressing tbe author's desire to bear the burden for the world. Essay is 岳阳楼记 by 范仲淹 in year 1046 https://zh.wikisource.org/zh-hans/%E5%B2%B3%E9%99%BD%E6%A8%9...

gudzpoz5mo ago

This clause is usually used together with the next sentence in the original poem:

> 先天下之忧而忧，后天下之乐而乐

> (put the world's worries before yours, and put your happiness after the world's) > edit: this translation is wrong, and raincole has a definitely better translation

Since the model is a language model, they probably use this to demonstrate the model's language capabilities – the model should be able to complete the whole sentence pair. The paper also mentions this:

> To ensure the model’s language capabilities, we introduced 10% of in-house text-only pretrain data.

So I believe it is just a text-only demonstration.

jdthedisciple5mo ago

Sibling comment has the second part as

後天下之樂而樂

Which one is correct?

1 more reply

shepardrtc5mo ago

How does this fair with the Vidore benchmark?

https://huggingface.co/spaces/vidore/vidore-leaderboard

dcl5mo ago

Is there any 'small' OCR models around?

Say I only care about reading serial numbers from photos in a manufacturing process, not whole document parsing. Using a 3B param model to do this seems like a bit of overkill...

ammar_x5mo ago

Language support is not mentioned in the repo. But from the paper, it offers extensive multilingual support (nearly 100 languages) which is good, but I need to test it to see how it compares to Gemini and Mistral OCR.

zacmps5mo ago

I suspect the number of langauges it can do with reasonable accuracy is actually much smaller, probably <15.

hank20005mo ago

Have yall seen tensorlake? I’m curious how this compares to a model custom built for the problem. My guess is it can be as good. But can it be as efficient?

disclaimer: I do not work for tensorlake—but i know the folks behind it.

dlowe245mo ago

The only model that I found so far that extract table data with OCR is dots.ocr. Models that came after it have not done a good job. Interesting on testing this new model.

singularity20015mo ago

Instead of downloading a specific OCR model how would one fare just downloading the currently best multi-modal foundation model? And what would that be at less than 30 GB?

prats2265mo ago

Then you can just download finetuned version of same multi-modal foundation model that's trained on documents?

tinyhouse5mo ago

OCR is not a great name for these models. While they can do traditional OCR such as digitize and scanned PDF for example, they do so much more.

intalentive5mo ago

>they do so much more I'm not familiar. What else are they good for?

tinyhouse5mo ago

They can take something like an image of a graph and provide a description of it. From my understanding, these are multimodal models with reasoning capabilities.

loaderchips5mo ago

Great work guys, how about we replace the global encoder with a Mamba (state-space) vision backbone to eliminate the O(n²) attention bottleneck, enabling linear-complexity encoding of high-resolution documents. Pair this with a non-autoregressive (Non-AR) decoder—such as Mask-Predict or iterative refinement—that generates all output tokens in parallel instead of sequentially. Together, this creates a fully parallelizable vision-to-text pipeline, The combination addresses both major bottlenecks in DeepSeek-OCR.

loaderchips5mo ago

not sure why i m getting downvoted. Would love to have a technical discussion on the validity of my suggestions.

Karen6675mo ago

DeepSeek-OCR revolutionizes document processing by converting text into high-resolution images, achieving up to 20× compression while maintaining impressive accuracy. At a 10× compression ratio, it retains approximately 97% accuracy, and even at 20×, it maintains around 60% accuracy Tom's Hardware .

This approach reduces token usage significantly, making it particularly beneficial for industries like finance, healthcare, and legal sectors. For a comprehensive guide on implementing and utilizing DeepSeek-OCR, you can check https://deepseeksguides.com/deepseek-ocr-guide/

farseer5mo ago

How good is this compared to most commercial OCR software?

ozim5mo ago

Any vision model is better than commercial OCR software.

Etheryte5mo ago

I'm not really sure if that's an accurate summary of the state of the art, [0] is a better overview. In short, SOTA multi-modal LLMs are the best option for handwriting, nearly anything is good at printed text, for printed media, specialty models from hyperscalers are slightly better than multi-modal LLMs.

[0] https://research.aimultiple.com/ocr-accuracy/

1 more reply

dragonwriter5mo ago

Since “commercial OCR software” includes VLM-based commercial offerings, that's clearly not correct.

brightUiso5mo ago

Please a bit of education, what does it do?

joshstrange5mo ago

> [2025/x/x] We release DeepSeek-OCR, a model to investigate the role of vision encoders from an LLM-centric viewpoint.

So close but it should be 2025/X/XX as "X" = 10 in Roman Numerals /s

Jokes aside, this is really neat and I'm looking forward to getting this running. For most OCR-type stuff I just use AWS Textract since I need it so rarely and that service does a decent job. I really like how well this model seems to extract images/figures as well from the original document.

j / k navigate · click thread line to collapse

244 comments

krackers5mo ago

The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote

(I guess you could say a picture token is worth 10 textual tokens...)

And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.

miki1232115mo ago

Text tokens are quantized and represent subword units, vision tokens only exist in the embedding space.

rco87865mo ago

3 more replies

isaacfung5mo ago

Some models use vector quantized variational autoencoders to discretize images into sequences of discrete symbols from a fixed codebook.

https://grok.com/share/bGVnYWN5LWNvcHk%3D_572b4955-6265-4210...

krackers5mo ago

Thank you, this makes sense! As [1] puts it pithily

>Image-patch tokens make better use of the high-dimensional embedding space than text tokens do.

[1] https://twitter.com/c0mbinat0r/status/1980698103234891892 [2] https://twitter.com/Kangwook_Lee/status/1980709454522744902

1 more reply

lubesGordi5mo ago

ttul5mo ago

So to me it’s not a surprise that you can transform the two-dimensional representation of the same information into concepts again without losing much.

1 more reply

jph005mo ago

Actually there are VAEs which use a codebook approach to creating discrete tokens instead of float vectors. There has been some success in that direction in diffusion models for instance.

storus5mo ago

That's not really true, the latest autoregressive image models create a codebook of patches that are then encoded as tokens and image is assembled out of them.

HarHarVeryFunny5mo ago

It'd be cool if Karpathy would extend his NanoChat to be multi-modal to spread the knowledge of how this is typically done.

hendersoon5mo ago

Exactly right, the OCR isn't the interesting part. 10x context compression is potentially huge. (With caveats, at only ~97% accuracy, so not appropriate for everything.)

runeblaze5mo ago

each text token is often subword unit, but in VLMs the visual tokens are in semantic space. Semantic space obviously compresses much more than subword slices.

disclaimer: not expert, on top of my head

ssivark5mo ago

looobay5mo ago

LLMs are compute heavy with quadratic scaling (in compute) per tokens. They are trying to compress text tokens into visual tokens with their VLM.

Maybe they would render texts to an image before tokenizing to reduce the compute cost.

krackers5mo ago

So I guess my question is where is the juice being squeezed from, why does the vision token representation end up being more efficient than text tokens.

6 more replies

simonw5mo ago

Here's a result I got https://github.com/simonw/research/blob/main/deepseek-ocr-nv... - against this image: https://static.simonwillison.net/static/2025/ft.jpeg

jjcm5mo ago

Looks like this did really solid, with the exception of the paragraph directly below the quote. It hallucinated some filler there and bridge it with the next column.

Thanks for running the test quickly!

djmips5mo ago

throwaway3141555mo ago

> by running Claude Code as root in a new Docker container

How do you get the "as root" part of that to work?

(sorry if it's explained in your article)

simonw5mo ago

Run it on a root account and do:

  IS_SANDBOX=1 claude --dangerously-skip-permissions

1 more reply

CaptainOfCoit5mo ago

arkmm5mo ago

Wow, this deserves its own submission.

ellisd5mo ago

https://annas-archive.org/blog/duxiu-exclusive.html

bluecoconut5mo ago

Previous paper from DeepSeek has mentioned Anna’s Archive.

_vqpz5mo ago

Why do they need to grant access for people to use copies of books they don’t own?

JohnLocke45mo ago

As per the blog post: >What does Anna’s Archive get out of it? Full-text search of the books for its users.

1 more reply

est5mo ago

Ownership laundering.

singularfutur5mo ago

Yes it means they will never release their dataset :(

throawayonthe5mo ago

hahaha also immediately thought of this, wonder when the ocr'd dataset would be getting released

dev1ycan5mo ago

sigmoid105mo ago

Seems like they are doing fine:

https://open-slum.org

1 more reply

c0balt5mo ago

It appears this is an active offer from Anna's archive, so presumably they can handle the load and are able to satisfy the request safely.

breadislove5mo ago

For everyone wondering how good this and other benchmarks are:

- the OmniAI benchmark is bad

- Instead check OmniDocBench[1] out

- Mistral OCR is far far behind most Open Source OCR models and even further behind then Gemini

- End to End OCR is still extremely tricky

- composed pipelines work better (layout detection -> reading order -> OCR every element)

- complex table parsing is still extremely difficult

[1]: https://github.com/opendatalab/OmniDocBench

hakunin5mo ago

wahnfrieden5mo ago

It is unusable trash for languages with any vertical writing such as Japanese. It simply doesn’t work.

1 more reply

graeme5mo ago

Interesting. How do you harness it for that purpose? I've found apple ocr to be very good.

2 more replies

CaptainOfCoit5mo ago

2 more replies

cheema335mo ago

> the OmniAI benchmark is bad

According to Omni OCR benchmark, Omni OCR is the best OCR. I am sure you all will find no issues with these findings.

yoran5mo ago

ozgune5mo ago

OmniAI has a benchmark that companies LLMs to cloud OCR services.

https://getomni.ai/blog/ocr-benchmark (Feb 2025)

Please note that LLMs progressed at a rapid pace since Feb. We see much better results with the Qwen3-VL family, particularly Qwen3-VL-235B-A22B-Instruct for our use-case.

cheema335mo ago

Omni OCR team says that according to their own benchmark, the best OCR is the Omni OCR. I am quite surprised.

CaptainOfCoit5mo ago

Magistral-Small-2509 is pretty neat as well for its size, has reasoning + multimodality, which helps in some cases where context isn't immediately clear, or there are few missing spots.

daemonologist5mo ago

numpad05mo ago

Or at least that kind of thing would motivate them to re-implement OCR with LLM.

fluoridation5mo ago

Huh... Would it work to have some kind of error checking model that corrected common OCR errors? That seems like it should be relatively easy.

1 more reply

junto5mo ago

stopyellingatme5mo ago

Not sure about the others but we use Azure AI Document Intelligence and its working well for our resume parsing system. Took a good bit of tuning but we havent had to touch it for almost a year now.

make35mo ago

aren't all of these multimodal LLM approaches, just open vs closed ones

sandblast5mo ago

Not sure why you're being downvoted, I'm also curious.

dumpsterkid5mo ago

kamranjon5mo ago

Have you tried moondream yet[1]? The moondream 3 preview model[2], according to the blogpost[3] appears to outperform many frontier models on VLM tasks and does so with a relatively small footprint.

[1] https://moondream.ai/

[2] https://huggingface.co/moondream/moondream3-preview

[3] https://moondream.ai/blog/moondream-3-preview

sampton5mo ago

You can train a cnn to find bounding boxes of text first. Then run VLM on each box.

jmpeax5mo ago

Your customers don't have any handwritten text?

pietz5mo ago

My impression is that OCR is basically solved at this point.

The OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

I'd be interested to hear where OCR still struggles today.

cahaya5mo ago

Curious to hear which OCR/ LLM excels with these specific issues? Example complex table: https://cdn.aviation.bot/complex-tables.zip

CaptainOfCoit5mo ago

> Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML:

3 more replies

pietz5mo ago

carschno5mo ago

mormegil5mo ago

pietz5mo ago

1 more reply

sramam5mo ago

Interesting - have you tried sending the image and 'hallucinated' text together to a review LLM to fix mistakes?

I don't have a use case of 100s or 1000s of hand-written notes have to be transcribed. I have only done this with whiteboard discussion snapshots and it has worked really well.

1 more reply

raincole5mo ago

If you can accept that the machine just make up what it doesn't recognize instead of saying "I don't know," then yes it's solved.

(I'm not being snarky. It's acceptable in some cases.)

jakewins5mo ago

2 more replies

red75prime5mo ago

Just checked it with Gemini 2.5 Flash. Instructing it to mark low-confidence words seems to work OK(ish).

KoolKat235mo ago

These days it does just that, it'll say null or whatever if you give it the option. When it does make it up, it tends to be limitation of the image qualify ( max dpi).

Blotchy text and specific typeface make 6's look like 8's, even to the non-discerning eye, a human would think it's an 8, zoom in and see it's a 6.

Google's image quality on uploads is still streets ahead of openai for instance btw.

wahnfrieden5mo ago

Do any LLM OCRs give bounding boxes anyway? Per character and per block.

2 more replies

peter-m805mo ago

constantinum5mo ago

I use LLMWhisperer[1] for OCR'ing old magazine ads. It preserves the layout and context. Example > https://postimg.cc/ts3vT7kG

https://pg.llmwhisperer.unstract.com/

pietz5mo ago

Could you provide an example that fails? I'm interested in this.

jmkni5mo ago

do you have an example of a particularly tricky one?

1 more reply

kbumsik5mo ago

> My impression is that OCR is basically solved at this point.

Not really in practice to me. Especially they still struggle with Table format detection.

coulix5mo ago

This.

Any complex parent table span cell relationship still has low accuracy.

Try the reverse, take a complex picture table and ask Chatgpt5, claude Opus 3.1, Gemini Pro 2.5 to produce a HTML table.

They will fail.

2 more replies

richardlblair5mo ago

I had mentioned this when the new QWEN model dropped - I have a stack of construction invoices that fail through both OCR and OpenAI.

It's a hard (and very interesting) problem space.

llm_nerd5mo ago

And GP LLMs are heinous at OCR. If you are having success with FL, your documents must be incredibly simple.

There has been enormous advances in OCR over the past 6 months, so the SoTa is a moving, rapidly advancing target.

Gazoche5mo ago

There is no "solved" in computer vision, there is only "good enough" and what constitutes "good enough" depends on your problem domain.

Take an OCR model with 99.9% character-wise accuracy. Sounds pretty good, right? Well if your use case is, say, digitalizing old printed novels, then yeah it's probably good enough.

But what if your documents are personal records with millions of names, to insert in some administrative database? Now 1 out of 1000 persons will have their name misspelled. Ooops.

themanmaran5mo ago

> OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

Benchmark author here. No, just pivoted away from OCR API as a product! Still use our API internally but have been lazy about updating benchmarks.

robotswantdata5mo ago

VLLMs suck at complex layouts and there is a high risk of hallucination. Never use alone for contracts or health data.

burpsnard5mo ago

cle5mo ago

To solve this generally you need to chunk not by page, but by semantic chunks that don't exceed the information density threshold of the model, given the task.

veidr5mo ago

Clearly-printed text to a sequence of characters is solved, for use cases that don't require 100% accuracy.

But not for semantic document structure — recognizing that the grammatically incomplete phrase in a larger font is a heading, recognizing subheadings and bullet lists, tables, etc.

Also not for handwritten text, text inside of images (signage and so forth), or damaged source material (old photocopies and scans created in the old days).

vintermann5mo ago

KoolKat235mo ago

I agree, Gemini 2.5 models are excellent.

The fuss around old fashioned OCR seemed strange to me initially considering the above, but I selfishly forgot to consider addressing compute/offline requirements.

It would also be nice for there to be a good competitor.

constantinum5mo ago

Why PDF parsing is Hell[1]:

Fixed layout and lack of semantic structure in PDFs.

Non-linear text flow due to columns, sidebars, or images.

Position-based text without contextual or relational markers.

Absence of standard structure tags (like in HTML).

Scanned or image-based PDFs requiring OCR.

Preprocessing needs for scanned PDFs (noise, rotation, skew).

Extracting tables from unstructured or visually complex layouts.

Multi-column and fancy layouts breaking semantic text order.

Background images and watermarks interfering with text extraction.

Handwritten text recognition challenges.

[1] https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

simlevesque5mo ago

> That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

Can you explain more about your setup ? I have a quarter million pages I want to OCR.

blindriver5mo ago

6gvONxR4sf7o5mo ago

Davidzheng5mo ago

I think it'll be good to have an end-to-end pdf to latex converter for old math papers. For commutative diagrams almost all models still struggle. especially very complicated commutative diagrams.

darkwater5mo ago

So, the mug with inspirational text says "Bountiful Potential"?

kelvinjps105mo ago

Google vision it's still better than Gemini at OCR, for example at getting bounding boxes.

sbinnee5mo ago

Maybe for English. Other languages are very much not solved.

baobun5mo ago

Chinese, especially handwritten.

rsp19845mo ago

Can someone ELI5 to me (someone who doesn't have the time to keep up with all the latest research) what this is and why it's a big deal?

intalentive5mo ago

Suppose you have an image with 1000 words in it, and suppose for simplicity that every word is 1 token. Then the image is “worth” 1000 tokens.

Now forget that we are even talking about images or OCR. If you look at just the decoding process, you find that we were able to compress the output into a 10x smaller representation.

rsp19845mo ago

Correct?

1 more reply

foofoo125mo ago

How does it compare to Tesseract? https://github.com/tesseract-ocr/tesseract

I use ocrmypdf (which uses Tesseract). Runs locally and is absolutely fantastic. https://ocrmypdf.readthedocs.io/en/latest/

utopiah5mo ago

modeless5mo ago

polytely5mo ago

If we get ocr working it makes it possible to store all human knowledge now stored in PDF's with way less resources

https://annas-archive.org/blog/critical-window.html

giardini5mo ago

This model (DeepSeek-OCR) aligns particularly well with what we know about written language and the human act of reading.

The Visual Word Form Area (VWFA) on the left side of the brain is where the visual representation of words is transformed to something more meaningful to the organism.

https://en.wikipedia.org/wiki/Visual_word_form_area

The DeepSeek-OCR encoding (rather than simple text encoding) appears analogous to what occurs in the VWFA.

Kudos to the authors: Haoran Wei, Yaofeng Sun, Yukun Li. You may have tripped over the Rosetta Stone of intelligence itself! Bravo!

empressplay5mo ago

h14h5mo ago

This feels like another one of those stairstep ML/AI advances that makes computers behave eerily more like humans.

We tend to think in images rather than plaintext, and here we are discovering it's more efficient for a computer to do so as well.

edtechdev5mo ago

mediaman5mo ago

sherlockxu5mo ago

allanren5mo ago

It says the conversation can reduce size with large compression, which basic make the image blur but still contaim import information.

This is indeed amazing. It's actually how human try to understand and remember things. BY VISUAL! And when memory fade out, the image are getting blurred.

Not sure if those close source multimodal models are already using this method.

neves5mo ago

I see that the project uses conda for development. Is it still a good tool now that pip also install binaries?

modeless5mo ago

No. Everyone should be using uv instead.

schopra9095mo ago

It’s not clear to me what the bottleneck for OCR to “100%” work with LLMS is.

For our tasks, it’s clear that there’s more fundamental research that needs to be done on vision understanding to push past CLIP. That would really improve LLMs for our usecases.

Curious if there’s something similar going on for OCR in the vision encoder that’s fundamentally holding it back.

tidbeck5mo ago

How does this compare to https://huggingface.co/ibm-granite/granite-docling-258M in performance and how they work?

bugglebeetle5mo ago

The granite Dockling models are unfortunately quite far below SOTA. dots-ocr and PaddleOCR were best here.

bugglebeetle5mo ago

rfoo5mo ago

If you look you'd notice that it's the same Haoran Wei behind DeepSeek-OCR and GOT-OCR2.0 :p

bugglebeetle5mo ago

Oh you’re right! Good catch!

bethekind5mo ago

Did we read the same graph? DeepSeek Gundam 200 dpi appeared to get similar perf as dots-ocr, but with less tokens needed. The x axis is inverted, descending with distance from the origin.

Thma5mo ago

piker5mo ago

This looks really cool for prototyping and playing around.

CheeseFromLidl5mo ago

As a hobby photographer, I organise everything for speedy retrieval but this would be amazing to search my collection.

randomNumber75mo ago

Imagine you build an image segmentation model for a e.g. specific industrial application.

With this LLM approach you can at least create your training data from the raw images with natural language.

piker5mo ago

That does make sense

2big2fail_475mo ago

I find it interesting that there's all these independent AI-OCR Projects but still no commercial offering. Is it still too inaccurate, too complex or simply too expensive?

Annatar015mo ago

I dont know, but maybe existing commercial OCR is still on top, and also using ML. Recently tried a free trial for OCR/reading Sütterlin and it was a weird feeling being so outclassed in reading.

rsolva5mo ago

Mistral offers their OCR commercially through their API and in their Chat services, at least.

https://mistral.ai/news/mistral-ocr

simlevesque5mo ago

https://cloud.google.com/document-ai

prats2265mo ago

https://docstrange.nanonets.com/ as well, wrapper on top of 7B version of https://huggingface.co/nanonets/Nanonets-OCR2-3B

daemonologist5mo ago

aleinin5mo ago

One that I’ve seen recently is https://reducto.ai It appears to be an OCR wrapper.

Eisenstein5mo ago

It is because the AI is not actually doing OCR. It is giving an interpretation of what the text in an image is by ingesting vision tokens and mapping them onto text tokens.

So you either have to be fine with a lot of uncertainty as to the accuracy of that interpretation or you have to wait for an LLM that can do it in a completely reproducible way every time.

CloseChoice5mo ago

TLDR: It's MIT licensed

AndroTux5mo ago

> since it's not obvious in the GitHub repo

Literally says MIT license on the right sidebar and in the readme tab and in the file called LICENSE

maxloh5mo ago

Model weights are MIT too: https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/LI...

vladpowerman5mo ago

I’ve been exploring Git activity analysis recently and ran into similar trade-offs: how do you tokenize real-world code and avoid counting noise?

k_sze5mo ago

It's interesting how they use "Gundam" in their variant names. I gather that Gundam-M and Gundam are their most powerful ones.

daemonologist5mo ago

I think maybe to distinguish their dynamic resolution approach from the t-shirt sizes, which have a fixed input. (Although I don't know why "Gundam")

mrasong5mo ago

Kinda reminds me of PaddleOCR.

Would be awesome if DeepSeek OCR could be integrated into a mobile app someday. That’d make OCR way more convenient!

pzo5mo ago

prats2265mo ago

x______________5mo ago

  >先天下之忧而忧

How is this an example of a prompt?

Google translated this to "Worry about the world first" while Bing says "Worry before the worries of the world."

Can anyone shed some light on this saying or why it's in the article?

raincole5mo ago

It's a very famous (classical) Chinese phrase.

I don't know why it's a prompt example.

jdthedisciple5mo ago

Sibling comment has the second part as

后天下之乐而乐

which one is correct?

2 more replies

SequoiaHope5mo ago

Ask a language model - ChatGPT says it’s a line from a famous poem “Memorial to Yueyang Tower” which expresses the Confucian ideal of selfless concern for people and society.

fspeech5mo ago

gudzpoz5mo ago

This clause is usually used together with the next sentence in the original poem:

> 先天下之忧而忧，后天下之乐而乐

> (put the world's worries before yours, and put your happiness after the world's) > edit: this translation is wrong, and raincole has a definitely better translation

> To ensure the model’s language capabilities, we introduced 10% of in-house text-only pretrain data.

So I believe it is just a text-only demonstration.

jdthedisciple5mo ago

Sibling comment has the second part as

後天下之樂而樂

Which one is correct?

1 more reply

shepardrtc5mo ago

How does this fair with the Vidore benchmark?

https://huggingface.co/spaces/vidore/vidore-leaderboard

dcl5mo ago

Is there any 'small' OCR models around?

Say I only care about reading serial numbers from photos in a manufacturing process, not whole document parsing. Using a 3B param model to do this seems like a bit of overkill...

ammar_x5mo ago

zacmps5mo ago

I suspect the number of langauges it can do with reasonable accuracy is actually much smaller, probably <15.

hank20005mo ago

Have yall seen tensorlake? I’m curious how this compares to a model custom built for the problem. My guess is it can be as good. But can it be as efficient?

disclaimer: I do not work for tensorlake—but i know the folks behind it.

dlowe245mo ago

The only model that I found so far that extract table data with OCR is dots.ocr. Models that came after it have not done a good job. Interesting on testing this new model.

singularity20015mo ago

Instead of downloading a specific OCR model how would one fare just downloading the currently best multi-modal foundation model? And what would that be at less than 30 GB?

prats2265mo ago

Then you can just download finetuned version of same multi-modal foundation model that's trained on documents?

tinyhouse5mo ago

OCR is not a great name for these models. While they can do traditional OCR such as digitize and scanned PDF for example, they do so much more.

intalentive5mo ago

>they do so much more I'm not familiar. What else are they good for?

tinyhouse5mo ago

They can take something like an image of a graph and provide a description of it. From my understanding, these are multimodal models with reasoning capabilities.

loaderchips5mo ago

not sure why i m getting downvoted. Would love to have a technical discussion on the validity of my suggestions.

Karen6675mo ago

farseer5mo ago

How good is this compared to most commercial OCR software?

ozim5mo ago

Any vision model is better than commercial OCR software.

Etheryte5mo ago

[0] https://research.aimultiple.com/ocr-accuracy/

1 more reply

dragonwriter5mo ago

Since “commercial OCR software” includes VLM-based commercial offerings, that's clearly not correct.