undefined | Better HN

0 pointsmehrdadn6y ago0 comments

Can I ask how you parse PDFs? I'm curious both in terms of reading the PDF data (Python library?) and parsing it (regex?)... and do you have to deal with OCR as well?

0 comments

2 comments · 1 top-level

haberman6y ago· 1 in thread

I use "pdftotext -layout" and then parse that. Here is some more info from people who have tried this approach:

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

mehrdadnOP6y ago

Thanks!

j / k navigate · click thread line to collapse