1
Ask HN: How to Structure Gnarly PDFs
I'm trying to compile a time series of publicly listed stocks stretching back to 2005. I'm doing this by parsing the semi-annual reports (NCSR filings) from a mutual fund complex that includes a large index fund (VTI). The reports are html with very different formats over the years. They each render to 500 pdf pages.
I initially tried passing the full pdf to the famous parsing platforms, without much luck. I then manually located the holdings tables I'm interested in (50 of the 500 pages in each of the pdfs) and tried using the famous parsing platforms without much luck.
Any advice from the community?