Download pdf of scan -> Tessaract to get a text layer -> Clean it up with a language specific BERT model -> detect paragraphs of a certain type -> Look them up against a database we build with scored similar paragraps -> Do recommendations.
The documents were not standard and a lot of them were historical documents and handwritten or with scratched out text with corrections.
We had student workers spending days labeling the data.
It took us months to get it all working with a high accuracy. We were so proud.
Now you can do it all with a prompt and a ChatGPT call.