I think there is some kind of misunderstanding. Sure, if you get somehow structured, machine-generated PDFs parsing them might be feasible.
But what about the "scanned" document part? How do you handle that? Your PDF rendering engine probably just says: image at pos x,y with size height,width.
So as parent says you have to OCR/AI that photo anyway and it seems that's also a feasible approach for "real" pdfs.