Building A Full-Text Index In Javascript (opens in new tab)

(garysieling.com)

70 pointsolivernn13y ago11 comments

11 comments

9 comments · 3 top-level

knowtheory13y ago· 4 in thread

This is pretty cool, but the fundamental problem is still that you (or someone else) have to load an entire PDF (or set of PDFS) before you can use the full text indexing to search it.

If you're running a service (say like DocumentCloud) you're way better off precomputing a full text index on ingest and providing a search API than shunting over substantial parts of your stored documents.

Definitely cool as a piece of gear, but not terribly practical from a client-side perspective i'd think.

garysieling13y ago

Yes, that is certainly true. The other issue with the technique I see is if I tried to scale this I'd probably hit some maturity issues with these libraries.

For what it's worth, it looks like DocumentCloud uses Open Calais, which is a Thomson Reuters product - I used to work there in a different division, they have a bunch of interesting products in this space.

knowtheory13y ago

Oh neat, what'd you do at Thomson Reuters?

I notice your blog is filled with NLP related goodies. I've been meaning to screw around with Stanford NER lib, to see if i can train up some custom recognizers for particular document domains of any utility.

1 more reply

xaritas13y ago

> Definitely cool as a piece of gear, but not terribly practical from a client-side perspective i'd think.

Perhaps for PDFs are proprietary or sensitive. A related use case is transformation and extraction. I used this same technique recently for a client to turn VB6-generated PDF reports into HTML tables for preview, and sending the actual data to a service endpoint as JSON.

arafalov13y ago

Lunr.js does provide a way to pre-compile the index on the server side. Check out the discussion and implementation in progress of using the pre-compilation for Jekyll: https://github.com/olivernn/lunr.js/issues/26 .

Ygg213y ago· 1 in thread

Now all we need is for someone to port an LibreOffice editor in JavaScript :)

garysieling13y ago

Yeah, the thought crossed my mind. There are enough people trying to make online products like Google Docs or places to post Powerpoint presentations that it may have already happened somewhere internally. Or, maybe everyone is just using LibreOffice/Muhimbi and doing it all on the server.

binarymax13y ago· 1 in thread

lunr.js looks pretty nice, seems very useful for tiny browser based stuff. For something a bit more heavyweight, I've used natural node[1] which is quite good - though not available in browser.

https://github.com/NaturalNode/natural

garysieling13y ago

That one looks neat - it has some interesting NLP features like Wordnet integration and bayes classification.

j / k navigate · click thread line to collapse

11 comments

9 comments · 3 top-level

knowtheory13y ago· 4 in thread

This is pretty cool, but the fundamental problem is still that you (or someone else) have to load an entire PDF (or set of PDFS) before you can use the full text indexing to search it.

Definitely cool as a piece of gear, but not terribly practical from a client-side perspective i'd think.

garysieling13y ago

Yes, that is certainly true. The other issue with the technique I see is if I tried to scale this I'd probably hit some maturity issues with these libraries.

knowtheory13y ago

Oh neat, what'd you do at Thomson Reuters?

1 more reply

xaritas13y ago

> Definitely cool as a piece of gear, but not terribly practical from a client-side perspective i'd think.

arafalov13y ago

Ygg213y ago· 1 in thread

Now all we need is for someone to port an LibreOffice editor in JavaScript :)

garysieling13y ago

binarymax13y ago· 1 in thread

lunr.js looks pretty nice, seems very useful for tiny browser based stuff. For something a bit more heavyweight, I've used natural node[1] which is quite good - though not available in browser.

https://github.com/NaturalNode/natural

garysieling13y ago

That one looks neat - it has some interesting NLP features like Wordnet integration and bayes classification.

j / k navigate · click thread line to collapse