- Full-text search on modern era PDFs (i.e no need for OCR)
- Exact word search would suffice (fuzzy/contextual search actually is less desirable)
- Cross-platform frontend part that highlights and jumps to the found text within the document. Frontend should be embeddable (i.e. not a SaaS or just standalone UI)
- As lightweight as possible (i.e. no Java, Python or Ruby)
- Long-term oriented stack (i.e. minimum dependencies, ideally promise of compatibility)
I'm looking at Mellisearch or Bleve for indexing/backend, and Syncfusion Flutter PDF viewer for frontend, but it still needs a lot of gluing code and I would love to explore more options.
Google Pinpoint is pretty cool, and I use it a lot, but there is only hosted Google version, plus it's too smart (still can't get it to do exact word search).
One example of a search engine I've built like this is the one on the Datasette website: https://datasette.io/-/beta?q=fts - I wrote about how that works here: https://simonwillison.net/2020/Dec/19/dogsheep-beta/
- Exact word search - pretty simple. I've focused on more advanced stuff because color vs colour is same same but different. Also just because it's pretty easy since I'm just using pre-defined building blocks, not manually integrating stuff
- Cross platform frontend - I've seen a lyrics search frontend [0] and I've built stuff in Streamlit before. Jina offers RESTful/gRPC/WebSockets gateways so it can't be too tough
- Lightweight? I mean how lightweight do you want it? C? Bash? Assembly? I've found Python good for text parsing
- Long-term: The notebook I wrote has a few (each of which have their own), but compared to others they're relatively lightweight.
- Gluing code: I've been using pre-existing building blocks, and writing new Executors (i.e. building blocks) is relatively straightforward, and then scaling them up with shards, replicas, etc is just a parameter away.
I'm more into the search side then the PDF stuff. The PDF side I've had experience with through bitter suffering and torment. Not a fun format to work with (unless you're into sado-masochism)
[0] https://github.com/jina-ai/examples/tree/master/multires-lyr...
Most of my use cases have to deal with 10-100 PDF small documents, some – 1000-2000, but I don't want the solution to choke on 10GB of huge PDFs (I was just uploading those to Google Pinpoint). So Go or Rust for backend should be good fit.
By cross-platform frontend I meant web/ios/android/desktop. It's probably only Flutter, but I'm looking for other plugins than Syncfusion's one to try. I know that sounds like overkill for many people (website with search suffice), but I already have cross-platform apps that would benefit from this functionality, and web is a fallback there, not the main option.
I don't have suggestions for you, but I do have a question regarding this point. Why wouldn't Java be considered lightweight? Java literally runs on your SIM card, which is a very bare-bones environment to run something on, I'd probably consider something like that pretty lightweight.
When I think about how many stuff needs to be moved in cpu/memory/io bus just to launch simple "Hello, World" in Java - I just cannot accept it. I do realize that for large programs that overhead is small, but still the JVM concept is something I want to avoid as much as possible. Plus the sheer scale of Java SDK and amount of legacy and complexity behind it exceeds my treshold of "avoiding complexity" by orders of magnitude. And the nail to the coffin of "no java" stance is, of course, experience with desktop Java applications. Consistenly the worst UX experience and performance I've seen in 25 years among desktop apps.
Opening up the correct page? I don't know of any standardized PDF reader that supports that kind of thing. And the format has such a history that even if it were supported (technically by Adobe - don't even get me started on what PDF readers support what formats), there's no guarantee the file itself would even have that cooked in.
Does it work end-to-end with PDF as a data structure or do we have to use OCR and parse the text first to be able to search it, really curious?
That said, I'm planning future notebooks where you can perform text-to-image or image-to-image search, integrate OCR, scale it up, serve it, deploy it, etc.
Don't. PDF is a terrible format for storing machine readable data. You lose a ton of Information while you create the PDF which you then painstakingly have to get back later (if that's even possible)
Agreed on the rest. PDFs don't store machine-readable data. Often just pixelated scanned hot garbage dumpster fire text.
I hate PDFs but have to work with the satanforesaken things. Hence the notebook. It's my little way of trying to give my little PDF-bespoked-hellscape a tiny little glow-up.
I built a tool for running OCR against every PDF in an S3 bucket (which costs about $1.50/thousand pages) here: https://simonwillison.net/2022/Jun/30/s3-ocr/
I was inspired by another recent comment you posted on HN, and after some testing of the Textract console [0] I wrote a simple "local only" command-line version [1] (Python, boto3) that does similar things to your tool.
I used my tool to OCR a few hundred comic strip images I've been meaning to OCR for a while now - the service did beautifully where other tools I've tried in the past struggled with the handwritten text on the comics. Textract is fast enough that running serially was fine for a one-off without involving the more parallelized S3 workflow.
[0] https://us-east-1.console.aws.amazon.com/textract/home?regio... [1] https://github.com/mbafford/textract-cli
1. PDFSegmenter (in the notebook) - extract the images of the text (yup, it does images too) 2. An OCR Executor [0][1] from Jina Hub [2] to extract the text from the images 3. Actually splice the text chunks together to be what you'd expect - that's the tricky part. Even text splitting over pages can be tricky to reassemble properly. PDFs are a pain the butt frankly.
[0] https://hub.jina.ai/executor/78yp7etm
I worked on a neural search engine just when deep networks were taking off and we knew that it worked because we had test data that said certain documents were relevant for certain queries so we could compute precision and recall curves. My experience was that if the AUC metric is substantially improved customers really notice the difference.
Very few search vendors do this kind of testing because it is expensive and because enterprise customers seem to care more that there are connectors to 800+ external systems than if the search results are any good.
The main trouble I see with pdf search is that test extracted from pdf files is full of junk punctuation including spaces so if you are trying a bag of words based search the words are corrupted. Seems to me you could build a neural model that works around the brokenness of PDF but that isn’t ‘download a model from spacy and pray’ but would be a big job that starts with getting 10 GB+ of PDF text.
Needless to say, working with PDFs makes me want to pull my hair out.
I also ended up writing the SpacySentencizer Executor instead of using a "vanilla" sentencizer. That led to consistent sentence splitting (so "J.R.R. Tolkien turned to pg. 3" would be one sentence, not 5)
For testing, Jina allows you to swap out encoders with just a couple of lines of code, so trying different methods out should work just fine.
There is something to say for roundtripping PDFs from source you control (you can accurately model the corruption produced by a particular system) but you will certainly see new and different phenomena if you try more.
I'd agree that spacy's sentence segmentation is better than many of the alternatives.
Disclaimer: I’m the founder.
For handwritten/math symbols, I'm sure it wouldn't be too hard to integrate something. The Jina Flow [0] concept makes integrating new Executors [1] pretty easy.
I LOVE the testimonials on the site btw!
How well would this work in a production setting, e.g. when searching over millions of PDFs on arxiv (soon to be tens of millions)? Follow-up: have you tried using a vector database such as Milvus as the key piece of underlying infrastructure to avoid having to implement deletes, failover, scaling, etc? https://zilliz.com/learn/what-is-vector-database
In terms of processing the PDFs and extracting that data. idk. That depends on a lot of factors - e.g. do you need to OCR the PDFs or can just extract text directly? Either way, should be possible to write a module and then easily scale it up (Jina supports shards/replicas). Anyway, lemme know. I'm in talks with folks about this kind of shitshow...uh...use case now.
Jina supports multiple vector database backends, like Weaviate, Qdrant and others. For others (like Milvus), suggest you ask on the Slack [0] - responses tend to be fast.
- https://medium.com/jina-ai/building-an-ai-powered-pdf-search...
- https://medium.com/jina-ai/search-pdfs-with-ai-and-python-pa...
- https://medium.com/jina-ai/search-pdfs-with-ai-and-python-pa...
[0] based on typical HN demographics, no assumptions here