Unverified: What Practitioners Post About OCR, Agents, and Tables (opens in new tab)

(idp-software.com)

30 pointschelm2mo ago28 comments

28 comments

18 comments · 7 top-level

bonsai_spool2mo ago· 6 in thread

Please write in your own words! I’m not inclined to read something if it consists of what you copy and pasted from Claude

ikidd2mo ago

This reads less like LLM output than it does someone just transcribing their brief notes as they did their research. Lot of missing subject nouns, which is not something I'd expect to see from AI output.

bonsai_spool2mo ago

You can ask an LLM to write in a different voice—they don't all sound exactly the same, though this one is no different than other examples.

When I use an LLM, it tries to sound like me but there are still tendencies it falls back on, especially when the context window begins to expand.

The 'missing subject nouns' is probably the LLM's way of sounding like an authoritative source in a technical field since many programmers like to write that way.

1 more reply

obsidianbases12mo ago

Interesting complaint, because many might not share any of their ideas if it weren't for LLMs making it easy. Not everyone has the incentive to dedicate a day to producing writing worth publishing. But maybe they would if it took significantly less time.

Even considering HNs no LLMs for comments rule, which I mostly agree with, I think we would all lose of the same rule were applied to publishing in general.

curtisf2mo ago

"I would rather read the prompt"

https://claytonwramsey.com/blog/prompt/

discussion: https://news.ycombinator.com/item?id=43888803

All of the output beyond the prompt contains, definitionally, essentially no useful information. Unless it's being used to translate from one human language to another, you're wasting your reader's time and energy in exchange for you own. If you have useful ideas, share them, and if you believe in the age of LLMs, be less afraid of them being unpolished and simply ask you readers to rely on their preferred tools to piece through it.

2 more replies

chelmOP2mo ago

Did you read the article?

bonsai_spool2mo ago

> Did you read the article?

How else do you think I would have come to write this comment? I got to the second major heading before realizing that there is little human input in this document.

I use LLMs but I will never impose on Claude's intellectual musings on another person as some sort of intellectual insight.

This is about the same as copying someone else's homework and then presenting the copied work as an example of deep brilliance. The copying isn't great, but the boasting is absurd. Who are we trying to con?

1 more reply

quinndupont2mo ago· 1 in thread

Very helpful analysis that confirms everything I’ve encountered. OCR remains a thorny issue. The author talks about professional workflows struggling with tables and such, but I’ve found it challenging to get clean copies of long documents (books). The hybrid workflow (layout then OCR) sounds promising.

chelmOP2mo ago

I found only a few that correct OCR by using LLM. I think it feels too risky.

Think of an LLM that corrects 898,00 to 888,00. It feels like the David Kriesel Xerox case. Still, it's an interesting way to think of the issue of optical character recognition.

ChrisKnott2mo ago· 1 in thread

Is there a SOTA OCR model that prioritises failing in a debuggable way?

What I want is an output that records which sections of the image have contributed to each word/letter, preferably with per word confidence levels and user correctable identification information.

I should be able to build a UI to say: no, this section is red-on-green vertically aligned Cyrillic characters; try again.

chelmOP2mo ago

The relevant term is "bounding box", as you probably need the confidence level of a character or word, not just the image. I built such an interface. I think the effort is only worth it if you really have multi-millions of pages.

Niels lately posted a lot about other OCR engines: https://www.linkedin.com/posts/niels-rogge-a3b7a3127_lots-of...

ikidd2mo ago· 1 in thread

Funny enough I was processing some handwritten tables into excel with Sonnet. It did way better than I thought it would, I'd say like 95%.

I did have it put confidence indexes next to the output per line, and that was pretty useless, they were either really high or really low, and the confidence didn't match the mistakes at all.

chelmOP2mo ago

IMHO LLMs cannot provide statistically confident measures, and they are terrible at pretending to be capable of doing so.

What worked: You use an OCR that provides character/word-level bounding boxes and let the LLM extract from data. Then the LLM is capable of "calculating" a confidence of extracted data.

bobajeff2mo ago· 1 in thread

It's very surprising to me that the state of the art tools for data entry and digitizing still require a lot of supervision. From the article it's not that surprising that handwritten documents are harder for old-school OCR or AI as that can be hard even for humans in some cases. But tables and different layouts seem like low hanging fruit for vision models.

chelmOP2mo ago

Speaking about "that the state of the art tools", might be 6 months or 20 years old. Surfaced opinions might rely on software that a company licensed 2 years ago. Sadly, we need to take this enterprise speed of adaptation into account.

adam-badar2mo ago· 1 in thread

working with continuous OCR capture across 3 monitors using screenpipe. at 1.2fps you get usable text extraction but use 600mb-2gb ram.

biggest issue is OCR can't distinguish directionality - ie. if someone messages you, or you type "let's cancel the meeting" the text is identical but the intent isn't

chelmOP2mo ago

You scrape your screen continuously and OCR it? Never heard of this use case.

jgalt2122mo ago

> The Demo Works. Production Does Not.

Truer words have never been spoken. LLMs make mind blowing demos, but real-world performance is much less (but still useful).

An example from yesterday:

I asked Google / Nano Banana to repaint my house with a few options. It gave a nice write up on three themes and a nice rendering of 1/3 vertical slices in one image of each theme.

Then, I asked it to redraw the image entirely in one of the themes. It redrew the image 1/3 in the one theme I asked for and 2/3 in a theme I did not ask for. Further prompting did not fix it. At the end of the day, this was a useful exercise and I was able to get some sense of what color scheme would work better for my house, but the level of execution was miles away from the perfection portrayed in demos and hypester / huckster bloggers and VCs.

j / k navigate · click thread line to collapse

28 comments

18 comments · 7 top-level

bonsai_spool2mo ago· 6 in thread

Please write in your own words! I’m not inclined to read something if it consists of what you copy and pasted from Claude

ikidd2mo ago

bonsai_spool2mo ago

You can ask an LLM to write in a different voice—they don't all sound exactly the same, though this one is no different than other examples.

When I use an LLM, it tries to sound like me but there are still tendencies it falls back on, especially when the context window begins to expand.

The 'missing subject nouns' is probably the LLM's way of sounding like an authoritative source in a technical field since many programmers like to write that way.

1 more reply

obsidianbases12mo ago

Even considering HNs no LLMs for comments rule, which I mostly agree with, I think we would all lose of the same rule were applied to publishing in general.

curtisf2mo ago

"I would rather read the prompt"

https://claytonwramsey.com/blog/prompt/

discussion: https://news.ycombinator.com/item?id=43888803

2 more replies

chelmOP2mo ago

Did you read the article?

bonsai_spool2mo ago

> Did you read the article?

How else do you think I would have come to write this comment? I got to the second major heading before realizing that there is little human input in this document.

I use LLMs but I will never impose on Claude's intellectual musings on another person as some sort of intellectual insight.

1 more reply

quinndupont2mo ago· 1 in thread

chelmOP2mo ago

I found only a few that correct OCR by using LLM. I think it feels too risky.

Think of an LLM that corrects 898,00 to 888,00. It feels like the David Kriesel Xerox case. Still, it's an interesting way to think of the issue of optical character recognition.

ChrisKnott2mo ago· 1 in thread

Is there a SOTA OCR model that prioritises failing in a debuggable way?

What I want is an output that records which sections of the image have contributed to each word/letter, preferably with per word confidence levels and user correctable identification information.

I should be able to build a UI to say: no, this section is red-on-green vertically aligned Cyrillic characters; try again.

chelmOP2mo ago

Niels lately posted a lot about other OCR engines: https://www.linkedin.com/posts/niels-rogge-a3b7a3127_lots-of...

ikidd2mo ago· 1 in thread

Funny enough I was processing some handwritten tables into excel with Sonnet. It did way better than I thought it would, I'd say like 95%.

I did have it put confidence indexes next to the output per line, and that was pretty useless, they were either really high or really low, and the confidence didn't match the mistakes at all.

chelmOP2mo ago

IMHO LLMs cannot provide statistically confident measures, and they are terrible at pretending to be capable of doing so.

What worked: You use an OCR that provides character/word-level bounding boxes and let the LLM extract from data. Then the LLM is capable of "calculating" a confidence of extracted data.

bobajeff2mo ago· 1 in thread

chelmOP2mo ago

adam-badar2mo ago· 1 in thread

working with continuous OCR capture across 3 monitors using screenpipe. at 1.2fps you get usable text extraction but use 600mb-2gb ram.

biggest issue is OCR can't distinguish directionality - ie. if someone messages you, or you type "let's cancel the meeting" the text is identical but the intent isn't

chelmOP2mo ago

You scrape your screen continuously and OCR it? Never heard of this use case.

jgalt2122mo ago

> The Demo Works. Production Does Not.

Truer words have never been spoken. LLMs make mind blowing demos, but real-world performance is much less (but still useful).

An example from yesterday:

I asked Google / Nano Banana to repaint my house with a few options. It gave a nice write up on three themes and a nice rendering of 1/3 vertical slices in one image of each theme.

j / k navigate · click thread line to collapse