Search PDFs with Transformers and Python Notebook (opens in new tab)

(colab.research.google.com)

132 pointsalexcg13y ago51 comments

51 comments

38 comments · 9 top-level

divan3y ago· 8 in thread

Can anyone recommend how to build the following solution?

- Full-text search on modern era PDFs (i.e no need for OCR)

- Exact word search would suffice (fuzzy/contextual search actually is less desirable)

- Cross-platform frontend part that highlights and jumps to the found text within the document. Frontend should be embeddable (i.e. not a SaaS or just standalone UI)

- As lightweight as possible (i.e. no Java, Python or Ruby)

- Long-term oriented stack (i.e. minimum dependencies, ideally promise of compatibility)

I'm looking at Mellisearch or Bleve for indexing/backend, and Syncfusion Flutter PDF viewer for frontend, but it still needs a lot of gluing code and I would love to explore more options.

Google Pinpoint is pretty cool, and I use it a lot, but there is only hosted Google version, plus it's too smart (still can't get it to do exact word search).

simonw3y ago

If you hadn't ruled out Python I'd be suggesting using Datasette + SQLite FTS - I've been building a whole bunch of different search engines on that (including ones for searching within OCRd PDF files) and the cost to host is trivial, since you just need to run a Python process somewhere with a binary SQLite database file. I usually use Vercel, Cloud Run or Fly for that.

One example of a search engine I've built like this is the one on the Datasette website: https://datasette.io/-/beta?q=fts - I wrote about how that works here: https://simonwillison.net/2020/Dec/19/dogsheep-beta/

divan3y ago

Interesting, thanks! I'll take a look (datasette is amazing).

alexcg1OP3y ago

- Modern PDFs - if you wanna extract text and images, then the PDFSegmenter used in my example will work. If tables too, might need some additional jiggery-pokery, but definitely doable. I know other ppl using the same framework (Jina) who've accomplished it.

- Exact word search - pretty simple. I've focused on more advanced stuff because color vs colour is same same but different. Also just because it's pretty easy since I'm just using pre-defined building blocks, not manually integrating stuff

- Cross platform frontend - I've seen a lyrics search frontend [0] and I've built stuff in Streamlit before. Jina offers RESTful/gRPC/WebSockets gateways so it can't be too tough

- Lightweight? I mean how lightweight do you want it? C? Bash? Assembly? I've found Python good for text parsing

- Long-term: The notebook I wrote has a few (each of which have their own), but compared to others they're relatively lightweight.

- Gluing code: I've been using pre-existing building blocks, and writing new Executors (i.e. building blocks) is relatively straightforward, and then scaling them up with shards, replicas, etc is just a parameter away.

I'm more into the search side then the PDF stuff. The PDF side I've had experience with through bitter suffering and torment. Not a fun format to work with (unless you're into sado-masochism)

[0] https://github.com/jina-ai/examples/tree/master/multires-lyr...

divan3y ago

Thanks for elaborated answer.

Most of my use cases have to deal with 10-100 PDF small documents, some – 1000-2000, but I don't want the solution to choke on 10GB of huge PDFs (I was just uploading those to Google Pinpoint). So Go or Rust for backend should be good fit.

By cross-platform frontend I meant web/ios/android/desktop. It's probably only Flutter, but I'm looking for other plugins than Syncfusion's one to try. I know that sounds like overkill for many people (website with search suffice), but I already have cross-platform apps that would benefit from this functionality, and web is a fallback there, not the main option.

1 more reply

capableweb3y ago

> - As lightweight as possible (i.e. no Java, Python or Ruby)

I don't have suggestions for you, but I do have a question regarding this point. Why wouldn't Java be considered lightweight? Java literally runs on your SIM card, which is a very bare-bones environment to run something on, I'd probably consider something like that pretty lightweight.

divan3y ago

Ha, I'm from that generation of developers who have the mental model of what is actually happening on the hardware level when you run the program. Doesn't necesarilly mean I overoptimize or think about struct fields offsets or cache branching, but I do have this in my mental model and just can't unlearn it.

When I think about how many stuff needs to be moved in cpu/memory/io bus just to launch simple "Hello, World" in Java - I just cannot accept it. I do realize that for large programs that overhead is small, but still the JVM concept is something I want to avoid as much as possible. Plus the sheer scale of Java SDK and amount of legacy and complexity behind it exceeds my treshold of "avoiding complexity" by orders of magnitude. And the nail to the coffin of "no java" stance is, of course, experience with desktop Java applications. Consistenly the worst UX experience and performance I've seen in 25 years among desktop apps.

1 more reply

snowstormsun3y ago

pdfgrep with some formatting to add links open the correct page?

alexcg1OP3y ago

Getting the URI of original PDF would be straightforward enough - I could whack that into the code tomorrow with a few lines.

Opening up the correct page? I don't know of any standardized PDF reader that supports that kind of thing. And the format has such a history that even if it were supported (technically by Adobe - don't even get me started on what PDF readers support what formats), there's no guarantee the file itself would even have that cooked in.

shubham_saboo3y ago· 7 in thread

Wao, this is a really cool way to build full fledged search that too in a notebook!

Does it work end-to-end with PDF as a data structure or do we have to use OCR and parse the text first to be able to search it, really curious?

alexcg1OP3y ago

The version in the notebook is just for simple text-based PDFs. I wrote some posts on our company blog[1] about the sheer agonies of dealing with PDF as a data format, so wanted to stick with as simple as possible for now.

That said, I'm planning future notebooks where you can perform text-to-image or image-to-image search, integrate OCR, scale it up, serve it, deploy it, etc.

[1] https://medium.com/jina-ai

shubham_saboo3y ago

Awesome, will be on the lookout for that!

1 more reply

rahimnathwani3y ago

Under the hood, it uses https://github.com/pdfminer/pdfminer.six which expects the text to be stored as text.

alexcg1OP3y ago

You mean the PDFSegmenter Executor in the notebook?

1 more reply

spaetzleesser3y ago

"PDF as a data structure"

Don't. PDF is a terrible format for storing machine readable data. You lose a ton of Information while you create the PDF which you then painstakingly have to get back later (if that's even possible)

alexcg1OP3y ago

I may have misworded it (if I wrote those words - PDF rots the brain and my memory likewise).

Agreed on the rest. PDFs don't store machine-readable data. Often just pixelated scanned hot garbage dumpster fire text.

I hate PDFs but have to work with the satanforesaken things. Hence the notebook. It's my little way of trying to give my little PDF-bespoked-hellscape a tiny little glow-up.

1 more reply

alexcg1OP3y ago

Incidentally Jina Hub [0] has a few OCR Executors [1][2] you could integrate into my notebook (though you'd have to do some rewiring to take images into account since it's a text-based notebook)

[0] https://hub.jina.ai/

[1] https://hub.jina.ai/executor/w4p7905v

[2] https://hub.jina.ai/executor/78yp7etm

gapovaj7423y ago· 4 in thread

okay but what if my PDF is non parseable? Not sure if Python's any good for that

nicodjimenez3y ago

Mathpix PDF search is fully visually powered and does not use underlying PDF metadata, even working on handwriting. It’s a great choice for researchers (especially in STEM) who want to build a searchable archive of PDFs.

simonw3y ago

Amazon Textract does a phenomenal job of extracting text from dodgy scanned PDFs - I've been running it against scanned typewritten text and even handwritten journal text from the 1880s with great results.

I built a tool for running OCR against every PDF in an S3 bucket (which costs about $1.50/thousand pages) here: https://simonwillison.net/2022/Jun/30/s3-ocr/

ydant3y ago

Textract really does do a good job of balancing cost, ease of use, and quality, at least for my hobbyist needs.

I was inspired by another recent comment you posted on HN, and after some testing of the Textract console [0] I wrote a simple "local only" command-line version [1] (Python, boto3) that does similar things to your tool.

I used my tool to OCR a few hundred comic strip images I've been meaning to OCR for a while now - the service did beautifully where other tools I've tried in the past struggled with the handwritten text on the comics. Textract is fast enough that running serially was fine for a one-off without involving the more parallelized S3 workflow.

[0] https://us-east-1.console.aws.amazon.com/textract/home?regio... [1] https://github.com/mbafford/textract-cli

1 more reply

alexcg1OP3y ago

In that case I'd use:

1. PDFSegmenter (in the notebook) - extract the images of the text (yup, it does images too) 2. An OCR Executor [0][1] from Jina Hub [2] to extract the text from the images 3. Actually splice the text chunks together to be what you'd expect - that's the tricky part. Even text splitting over pages can be tricky to reassemble properly. PDFs are a pain the butt frankly.

[0] https://hub.jina.ai/executor/78yp7etm

[1] https://hub.jina.ai/executor/w4p7905v

[2] https://hub.jina.ai

PaulHoule3y ago· 2 in thread

Does it really work better than a simple tfidf?

I worked on a neural search engine just when deep networks were taking off and we knew that it worked because we had test data that said certain documents were relevant for certain queries so we could compute precision and recall curves. My experience was that if the AUC metric is substantially improved customers really notice the difference.

Very few search vendors do this kind of testing because it is expensive and because enterprise customers seem to care more that there are connectors to 800+ external systems than if the search results are any good.

The main trouble I see with pdf search is that test extracted from pdf files is full of junk punctuation including spaces so if you are trying a bag of words based search the words are corrupted. Seems to me you could build a neural model that works around the brokenness of PDF but that isn’t ‘download a model from spacy and pray’ but would be a big job that starts with getting 10 GB+ of PDF text.

alexcg1OP3y ago

I'll agree that there's quite a bit of junk punctuation in the extracted sentences (and sentence fragments), quite often from short footnotes in the Wiki articles. Getting "good" PDFs with open usage rights was a bit tricky, especially in a super simple PDF format. I ended up PDF-printing from Chrome.

Needless to say, working with PDFs makes me want to pull my hair out.

I also ended up writing the SpacySentencizer Executor instead of using a "vanilla" sentencizer. That led to consistent sentence splitting (so "J.R.R. Tolkien turned to pg. 3" would be one sentence, not 5)

For testing, Jina allows you to swap out encoders with just a couple of lines of code, so trying different methods out should work just fine.

PaulHoule3y ago

I dunno, you can download a million or so PDFs from arxiv.org and even more from archive.org. They aren't hard to find.

There is something to say for roundtripping PDFs from source you control (you can accurately model the corruption produced by a particular system) but you will certainly see new and different phenomena if you try more.

I'd agree that spacy's sentence segmentation is better than many of the alternatives.

2 more replies

nicodjimenez3y ago· 2 in thread

Mathpix Snip also supports PDF search, including for handwritten content, and including math symbols in equations.

Disclaimer: I’m the founder.

alexcg1OP3y ago

Oh, nifty! This is more a demo of a PDF search engine that you could (in parts 1 thru x of the series) deploy to an intranet (for internal knowledge search) or internet (for general search), rather than a collaborative tool.

For handwritten/math symbols, I'm sure it wouldn't be too hard to integrate something. The Jina Flow [0] concept makes integrating new Executors [1] pretty easy.

I LOVE the testimonials on the site btw!

[0] https://docs.jina.ai/fundamentals/flow/

[1] https://docs.jina.ai/fundamentals/executor/

ok_computer3y ago

Mathpix snip for pdf to Latex is excellent. Thank you for the free tier. It is helpful transcribing pdf math homework sets to use in the solution document without bugging the instructor for their source.

fzliu3y ago· 2 in thread

I just tried this on all the papers I downloaded over the past couple months - cool stuff.

How well would this work in a production setting, e.g. when searching over millions of PDFs on arxiv (soon to be tens of millions)? Follow-up: have you tried using a vector database such as Milvus as the key piece of underlying infrastructure to avoid having to implement deletes, failover, scaling, etc? https://zilliz.com/learn/what-is-vector-database

alexcg1OP3y ago

In terms of matching embeddings and performing similarity search on text/images - folks are already using the framework (Jina) for that and getting decent results.

In terms of processing the PDFs and extracting that data. idk. That depends on a lot of factors - e.g. do you need to OCR the PDFs or can just extract text directly? Either way, should be possible to write a module and then easily scale it up (Jina supports shards/replicas). Anyway, lemme know. I'm in talks with folks about this kind of shitshow...uh...use case now.

Jina supports multiple vector database backends, like Weaviate, Qdrant and others. For others (like Milvus), suggest you ask on the Slack [0] - responses tend to be fast.

[0] https://slack.jina.ai

redskyluan3y ago

We should probably try to implement a PDF search demo on top of Milvus.. LOL

CShorten3y ago· 2 in thread

Congratulations Alex, super cool!

alexcg1OP3y ago

Thanks man!

alexcg1OP3y ago

Nice to meet another person in the super-obvious-username club

alexcg1OP3y ago· 2 in thread

Wow, this post really took off! If anyone wants to read some of my blog posts on building PDF search engines (and the pain, torment and anguish that it causes) read:

- https://medium.com/jina-ai/building-an-ai-powered-pdf-search...

- https://medium.com/jina-ai/search-pdfs-with-ai-and-python-pa...

Malp3y ago

Great stuff, I went down the rabbit hole of building something similar for synthesizing flash cards + Q/A pairs from textbook PDFs about a year ago, and I would also emphasize that PDF search is a janky nightmare to get within the ballpark of usability :')

alexcg1OP3y ago

I feel your pain my brother(?) [0] in suffering. That's why I started simple in the notebook. Even trying to go a little more complex just leads to exponential rabbit holes and footguns.

[0] based on typical HN demographics, no assumptions here

Stampo003y ago

Pardon me while I go add Optimus Prime to my corporate letterhead.

1 more reply

j / k navigate · click thread line to collapse

51 comments

38 comments · 9 top-level

divan3y ago· 8 in thread

Can anyone recommend how to build the following solution?

- Full-text search on modern era PDFs (i.e no need for OCR)

- Exact word search would suffice (fuzzy/contextual search actually is less desirable)

- Cross-platform frontend part that highlights and jumps to the found text within the document. Frontend should be embeddable (i.e. not a SaaS or just standalone UI)

- As lightweight as possible (i.e. no Java, Python or Ruby)

- Long-term oriented stack (i.e. minimum dependencies, ideally promise of compatibility)

I'm looking at Mellisearch or Bleve for indexing/backend, and Syncfusion Flutter PDF viewer for frontend, but it still needs a lot of gluing code and I would love to explore more options.

Google Pinpoint is pretty cool, and I use it a lot, but there is only hosted Google version, plus it's too smart (still can't get it to do exact word search).

simonw3y ago

divan3y ago

Interesting, thanks! I'll take a look (datasette is amazing).

alexcg1OP3y ago

- Cross platform frontend - I've seen a lyrics search frontend [0] and I've built stuff in Streamlit before. Jina offers RESTful/gRPC/WebSockets gateways so it can't be too tough

- Lightweight? I mean how lightweight do you want it? C? Bash? Assembly? I've found Python good for text parsing

- Long-term: The notebook I wrote has a few (each of which have their own), but compared to others they're relatively lightweight.

I'm more into the search side then the PDF stuff. The PDF side I've had experience with through bitter suffering and torment. Not a fun format to work with (unless you're into sado-masochism)

[0] https://github.com/jina-ai/examples/tree/master/multires-lyr...

divan3y ago

Thanks for elaborated answer.

1 more reply

capableweb3y ago

> - As lightweight as possible (i.e. no Java, Python or Ruby)

divan3y ago

1 more reply

snowstormsun3y ago

pdfgrep with some formatting to add links open the correct page?

alexcg1OP3y ago

Getting the URI of original PDF would be straightforward enough - I could whack that into the code tomorrow with a few lines.

shubham_saboo3y ago· 7 in thread

Wao, this is a really cool way to build full fledged search that too in a notebook!

Does it work end-to-end with PDF as a data structure or do we have to use OCR and parse the text first to be able to search it, really curious?

alexcg1OP3y ago

That said, I'm planning future notebooks where you can perform text-to-image or image-to-image search, integrate OCR, scale it up, serve it, deploy it, etc.

[1] https://medium.com/jina-ai

shubham_saboo3y ago

Awesome, will be on the lookout for that!

1 more reply

rahimnathwani3y ago

Under the hood, it uses https://github.com/pdfminer/pdfminer.six which expects the text to be stored as text.

alexcg1OP3y ago

You mean the PDFSegmenter Executor in the notebook?

1 more reply

spaetzleesser3y ago

"PDF as a data structure"

Don't. PDF is a terrible format for storing machine readable data. You lose a ton of Information while you create the PDF which you then painstakingly have to get back later (if that's even possible)

alexcg1OP3y ago

I may have misworded it (if I wrote those words - PDF rots the brain and my memory likewise).

Agreed on the rest. PDFs don't store machine-readable data. Often just pixelated scanned hot garbage dumpster fire text.

I hate PDFs but have to work with the satanforesaken things. Hence the notebook. It's my little way of trying to give my little PDF-bespoked-hellscape a tiny little glow-up.

1 more reply

alexcg1OP3y ago

Incidentally Jina Hub [0] has a few OCR Executors [1][2] you could integrate into my notebook (though you'd have to do some rewiring to take images into account since it's a text-based notebook)

[0] https://hub.jina.ai/

[1] https://hub.jina.ai/executor/w4p7905v

[2] https://hub.jina.ai/executor/78yp7etm

gapovaj7423y ago· 4 in thread

okay but what if my PDF is non parseable? Not sure if Python's any good for that

nicodjimenez3y ago

simonw3y ago

I built a tool for running OCR against every PDF in an S3 bucket (which costs about $1.50/thousand pages) here: https://simonwillison.net/2022/Jun/30/s3-ocr/

ydant3y ago

Textract really does do a good job of balancing cost, ease of use, and quality, at least for my hobbyist needs.

[0] https://us-east-1.console.aws.amazon.com/textract/home?regio... [1] https://github.com/mbafford/textract-cli

1 more reply

alexcg1OP3y ago

In that case I'd use:

[0] https://hub.jina.ai/executor/78yp7etm

[1] https://hub.jina.ai/executor/w4p7905v

[2] https://hub.jina.ai

PaulHoule3y ago· 2 in thread

Does it really work better than a simple tfidf?

alexcg1OP3y ago

Needless to say, working with PDFs makes me want to pull my hair out.

For testing, Jina allows you to swap out encoders with just a couple of lines of code, so trying different methods out should work just fine.

PaulHoule3y ago

I dunno, you can download a million or so PDFs from arxiv.org and even more from archive.org. They aren't hard to find.

I'd agree that spacy's sentence segmentation is better than many of the alternatives.

2 more replies

nicodjimenez3y ago· 2 in thread

Mathpix Snip also supports PDF search, including for handwritten content, and including math symbols in equations.

Disclaimer: I’m the founder.

alexcg1OP3y ago

For handwritten/math symbols, I'm sure it wouldn't be too hard to integrate something. The Jina Flow [0] concept makes integrating new Executors [1] pretty easy.

I LOVE the testimonials on the site btw!

[0] https://docs.jina.ai/fundamentals/flow/

[1] https://docs.jina.ai/fundamentals/executor/

ok_computer3y ago

fzliu3y ago· 2 in thread

I just tried this on all the papers I downloaded over the past couple months - cool stuff.

alexcg1OP3y ago

In terms of matching embeddings and performing similarity search on text/images - folks are already using the framework (Jina) for that and getting decent results.

Jina supports multiple vector database backends, like Weaviate, Qdrant and others. For others (like Milvus), suggest you ask on the Slack [0] - responses tend to be fast.

[0] https://slack.jina.ai

redskyluan3y ago

We should probably try to implement a PDF search demo on top of Milvus.. LOL

CShorten3y ago· 2 in thread

Congratulations Alex, super cool!

alexcg1OP3y ago

Thanks man!

alexcg1OP3y ago

Nice to meet another person in the super-obvious-username club

alexcg1OP3y ago· 2 in thread

Wow, this post really took off! If anyone wants to read some of my blog posts on building PDF search engines (and the pain, torment and anguish that it causes) read:

- https://medium.com/jina-ai/building-an-ai-powered-pdf-search...

- https://medium.com/jina-ai/search-pdfs-with-ai-and-python-pa...

Malp3y ago

alexcg1OP3y ago

I feel your pain my brother(?) [0] in suffering. That's why I started simple in the notebook. Even trying to go a little more complex just leads to exponential rabbit holes and footguns.

[0] based on typical HN demographics, no assumptions here

Stampo003y ago

Pardon me while I go add Optimus Prime to my corporate letterhead.

1 more reply

j / k navigate · click thread line to collapse