undefined | Better HN

0 pointsSobrino19d ago0 comments

I worked in an AI (or well ML) consultancy before the ChatGPT moment. I remember we had a project where we had to extract a large sum of documents (country wide, terrabytes of pdfs of scans). We had to set up a pipeline that looked a bit like this.

Download pdf of scan -> Tessaract to get a text layer -> Clean it up with a language specific BERT model -> detect paragraphs of a certain type -> Look them up against a database we build with scored similar paragraps -> Do recommendations.

The documents were not standard and a lot of them were historical documents and handwritten or with scratched out text with corrections.

We had student workers spending days labeling the data.

It took us months to get it all working with a high accuracy. We were so proud.

Now you can do it all with a prompt and a ChatGPT call.

0 comments

3 comments · 2 top-level

archagon19d ago· 1 in thread

I'm pretty sure that "a ChatGPT call" will happily add or fudge stuff in your scanned PDFs. That sounds like a massive liability.

SobrinoOP18d ago

It's surprisingly robust and the quality is pretty good with the right prompt and quality gating.

ok12345619d ago

And now you can do all of that locally with qwen3.6:35b.

j / k navigate · click thread line to collapse