Ask HN: Is there anything similar to DocumentCloud for non-journalists?

1 pointsbglenn0914y ago1 comments

I'm looking to be able to store pdf documents in the cloud and be able to search them. DocumentCloud looks perfect but I'm not in journalism. I'm having trouble finding an obvious alternative. Do you guys like any other services or know of a simple way to do this with NoSql? I was looking into using a hosted mongodb service but I can't find any information on searching binary data. Thanks for any pointers.

1 comments

1 comments · 1 top-level

Skywing14y ago

You're not going to be able to simply upload a PDF and search for text using the raw file data. It's not readable. You're going to have to either use a tool to extract embedded text, or perform OCR on the document if it's image-only. A really good tool, that I have used before, is called Aspose. If you are allowing users to upload these PDFs, you'd also need some sort of distributed task queue, because performing the PDF file operations is not something you want the user to have to wait on. I've used RabbitMQ for this, and haven't had many issues. Once you have OCR'd the document and extracted the text, then you can store the text as well as the native document in a database like MongoDB. You would maybe even benefit from using a full-text search engine, like ElasticSearch.

j / k navigate · click thread line to collapse