undefined | Better HN

0 pointsyigitkonur351y ago0 comments

You've got a point, but try testing it on a tricky example like the Apollo 17 document - you know, with those sideways tables and old-school writing. You'll see all three non-AI services totally bomb. Now, if you tweak it to batch = 1 instead of 10, you'll notice there's hardly any made-up stuff. When you dial down the temperature close to zero, it's super unlikely to see hallucinations with limited context. At worst, you might get some skipped bits, but that's not a dealbreaker for folks looking to feed PDFs into AI systems. Let's face it, regular OCR already messes up so much that...

0 comments

hnlmorg1y ago

AWS Textract does use ML and I’ve personally used it to parse tables for automated invoice processing.

You wouldn’t get a markdown document automatically generated (or at least you couldn’t when I last used it a few years ago) but you did get an XML document

That XML document was actually better for our purposes because it gives you a confidence score and is properly structured, so floating frame, tables and columns would be properly structured in the output document. This reduces the risk of hallucinations.

It’s less of an out-of-the-box solution but that’s to be expected with AWS APIs.

rescbr1y ago

For a similar use case I’m using Azure Document AI - at least you can ask for markdown/html output directly from it instead of parsing the output structure from Textract.

And it’s cheaper too.

malcolmhere1y ago

You can get Markdown nowadays too, at least using this Python wrapper:

https://aws-samples.github.io/amazon-textract-textractor/not...

It's very consistent, though pricey.

Propelloni1y ago

> you might get some skipped bits, but that's not a dealbreaker for folks looking to feed PDFs into AI systems

Unless it is. We have a few hundred PDF per month (mostly tables) where we need 100% accuracy. Currently we feed them into an OCR and have humans check the result. I do not win anything if I have to check the LLM output, too.

llm_trw1y ago

I'm currently solving this problem for work and thinking of a spin out, what's a ballpark figure you'd be willing to pay per 1000 pages for 99.999% character level accuracy?

rescbr1y ago

At least for my use case, which is Layout processing (i.e. must output tables in some kind of table format), the OCR part (Azure Document AI or AWS Textract) dominates the cost factor.

Running OCR on a document is twice more expensive than processing the output on the most expensive GPT offering. Intuitively, this was kind of unexpected for me. Only when I did some calculations on Excel that I realized it.

If you’re able to halve the pricing for Layout output then you’re unblocking lots of use cases out there.

Propelloni1y ago

> I'm currently solving this problem for work and thinking of a spin out, what's a ballpark figure you'd be willing to pay per 1000 pages for 99.999% character level accuracy?

I guess anything up to 5 ¢ per page would be acceptable. But I'm afraid my company wouldn't be a customer. We are in Germany and we deal with particularly protected private data, there is no chance that we would exfiltrate this data to a cloud service.

llm_trw1y ago

What's the total spend per quarter? For a margin that fat I'd be willing to jump through a lot of hoops if you're doing enough pages.

The models (currently) fit in 24gb vram sequentially with small enough batch sizes, so a local server with consumer grade gpus wouldn't be impossible.

1 more reply

Lerc1y ago

I guess it depends on the use case, but if it surpasses the error rate that exists in the source document then it would be difficult to argue against.

Specific things like evidentiary use would want 100% but that's at a level where any document processing would be suspect.

What is the the typical range for error rate in PDF generation in various fields? Even robust technical documents have the occasional typo.

llm_trw1y ago

I'm not using generative models to fill in details not present in the original document. If there's a typo there then there will be a typo in the transcript. If you want to fix that then you can run another model on top of it.

2 more replies

refulgentis1y ago

You and your delegated writing are lying about cost.

It is off by 2 orders of magnitude.

My guess is you're using the token counting algorithm for pre-4o with the costs for 4o and later.

That aside, I strongly suggest taking a week off from code-outside-work and use that time to reflect-as-work. The post and ensuing comments are a horror show. Don't take it too hard, it probably won't matter in the long run, no ones going to remember.

But you'd get a lot out of taking it harder than you did in the comments I've seen, including one this morning where you replied to me. It worries me that you don't seem to understand how sloppy this work is.

When I was 14, my math teacher gave me a 0 on a test because I just wrote the answers instead of showing work. That gave me a powerful appreciation for being precise, clear, and accurate.

The only positive outcome is that even though there was enough upvotes for a simple, sloppy, mispurposed GPT wrapper to end up on the front page for ~16 hours, near-universally, the comments seem to understand contextually there's a lot of problems with how this was shared.

authorfly1y ago

I will say I have had a look at your code here. I really do value your innovation here in gaining better accuracy, but I don't think it's is much more accurate for obscure PDF cases - Maybe it halves those obscure errors. I found it still hallucinated or failed to parse some text (e.g. that unusual languages, screenshots with tiny blurred JPEG text, images/shapes remain hallucination issues with your solution). BTW I noticed a small typo "Convert document as is be creative to use markdown effectively" in the prompt. For me changing this and adding text about returning "None" if the text is unreadable reduced hallucinations.

Would you contrast your accuracy with Textract? Because Textract is 10x cheaper than this at approx 1 cent per page (and 20x cheaper than Cloudconvert). What documents make more sense to use with your tool? Is it worth waiting till gpt-4o costs drop 10x with the same quality level (i.e. not gpt-4o-mini) to use this? In my use case it's better to drop than to hallucinate.

What do you think makes sense in relation to Textract?

fzysingularity1y ago

Re: obscure PDFs, I’d love to see a PDF dataset with a whole bunch of these from different domains.

I think in general it’s very hard to say if any approach is “good enough” until you see some serious degree of variability in the input domain.

llm_trw1y ago

That is not a tricky example. Those tables are as clear cut as clear cut can be.

j / k navigate · click thread line to collapse

0 comments

hnlmorg1y ago

AWS Textract does use ML and I’ve personally used it to parse tables for automated invoice processing.

You wouldn’t get a markdown document automatically generated (or at least you couldn’t when I last used it a few years ago) but you did get an XML document

It’s less of an out-of-the-box solution but that’s to be expected with AWS APIs.

rescbr1y ago

For a similar use case I’m using Azure Document AI - at least you can ask for markdown/html output directly from it instead of parsing the output structure from Textract.

And it’s cheaper too.

malcolmhere1y ago

You can get Markdown nowadays too, at least using this Python wrapper:

https://aws-samples.github.io/amazon-textract-textractor/not...

It's very consistent, though pricey.

Propelloni1y ago

> you might get some skipped bits, but that's not a dealbreaker for folks looking to feed PDFs into AI systems

llm_trw1y ago

I'm currently solving this problem for work and thinking of a spin out, what's a ballpark figure you'd be willing to pay per 1000 pages for 99.999% character level accuracy?

rescbr1y ago

At least for my use case, which is Layout processing (i.e. must output tables in some kind of table format), the OCR part (Azure Document AI or AWS Textract) dominates the cost factor.

If you’re able to halve the pricing for Layout output then you’re unblocking lots of use cases out there.

Propelloni1y ago

> I'm currently solving this problem for work and thinking of a spin out, what's a ballpark figure you'd be willing to pay per 1000 pages for 99.999% character level accuracy?

llm_trw1y ago

What's the total spend per quarter? For a margin that fat I'd be willing to jump through a lot of hoops if you're doing enough pages.

The models (currently) fit in 24gb vram sequentially with small enough batch sizes, so a local server with consumer grade gpus wouldn't be impossible.

1 more reply

Lerc1y ago

I guess it depends on the use case, but if it surpasses the error rate that exists in the source document then it would be difficult to argue against.

Specific things like evidentiary use would want 100% but that's at a level where any document processing would be suspect.

What is the the typical range for error rate in PDF generation in various fields? Even robust technical documents have the occasional typo.

llm_trw1y ago

2 more replies

refulgentis1y ago

You and your delegated writing are lying about cost.

It is off by 2 orders of magnitude.

My guess is you're using the token counting algorithm for pre-4o with the costs for 4o and later.

When I was 14, my math teacher gave me a 0 on a test because I just wrote the answers instead of showing work. That gave me a powerful appreciation for being precise, clear, and accurate.

authorfly1y ago

What do you think makes sense in relation to Textract?

fzysingularity1y ago

Re: obscure PDFs, I’d love to see a PDF dataset with a whole bunch of these from different domains.

I think in general it’s very hard to say if any approach is “good enough” until you see some serious degree of variability in the input domain.

llm_trw1y ago

That is not a tricky example. Those tables are as clear cut as clear cut can be.

j / k navigate · click thread line to collapse