LinkedIn open sources IndexTank: search engine and service (opens in new tab)

(engineering.linkedin.com)

198 pointswvl14y ago31 comments

31 comments

29 comments · 11 top-level

mattdeboard14y ago· 7 in thread

What is the differentiation between using this and using Solr? ElasticSearch?

What does "real-time" mean in this context? Is it indexing database content in real-time? Is it in reference to the look-ahead, predictive query completion LinkedIn has?

What would compel someone like me -- a dev who has ownership over the very significant search piece of my company's primary product -- to give this serious evaluation?

nl14y ago

I looked at it some before IndexTank was bought (and I've done a reasonable amount of Solr work).

The biggest conceptual difference seemed to be that IndexTank was specifically written to autoscale - it was designed from the ground up to run on cloud providers, and to instantiate new resources as needed. It also has no central point of failure.

Solr Cloud (and things like Solandra) deliver some of this functionality to Solr.

Argorak14y ago

Well, elasticsearch is written with this in mind as well - so whats the huge difference in those?

sandGorgon14y ago

If you had to incorporate search today - would you use indextank or solr ?

1 more reply

gnubardt14y ago

I'd imagine they mean indexing (and being able to search on) data in real time. Given LinkedIn's previous open source projects around real time search (http://javasoze.github.com/zoie/).

Lucene (which Solr uses as its index) cannot expose newly indexed data immediately after it's added.

Lucene exposes IndexReaders for searches, which offer a snapshot view of the index. In order to search across new documents IndexReaders need to be re-opened, a somewhat expensive operation. Expensive enough to prevent it from happening after each document is added, especially if they're added frequently.

The latest version of Lucene supports "near real time" search, but afaik it's not widely used (with Solr).

mattdeboard14y ago

Yeah, NRT is 4.0; our content is such that right now that kind of flexibility isn't required. (Once-a-day batch db writes that update the index in NRT via signaling)

nl14y ago

IndexTank is built on Lucene too. I'm not sure if it is the real time branch or not, though.

1 more reply

zfran14y ago

http://indextank.com/documentation/faq

emmett14y ago· 2 in thread

This is awesome news. Massively advances the current state of the art in open source search.

Definitely considering replacing our search backend at TwitchTV with this...

hajrice14y ago

Hey Emmet, we're one of the companies interested in continuing IndexTank's platform.

If you want to hear more, just ping me at emil@helpjuice.com

citricsquid14y ago

related to your start up and not this post, you should work on your introduction/explanation video on the home page. Just from a quick watch it has some problems, the lack of any script (or if you had one you didn't rehearse it) means time I am investing in watching your pitch to me as a potential customer is time spent watching you think and decide on what to do next. The video on your tour page (http://helpjuice.com/tour) isn't great, but it is much much better as an introduction video to your product.

2 more replies

toisanji14y ago· 2 in thread

I'd like to see how this compares to lucene/solr. With solr its easy to index 100's of millions of docs, but its a pain to write a custom scorer.

espeed14y ago

IndexTank provides real-time document indexing and its algorithm incorporates real-time metrics, like vote data. And it scales horizontally.

riffraff14y ago

why did you find writing s custom scorer a pain? I've done it for raw lucene and it's trivial (in my case I added real time data in the formula using an external value source), I am not sure why it would be harder for Solr (I always got away with sorting order until now :).

SlightGenius14y ago· 2 in thread

Does IndexTank still integrate social inputs?

"IndexEngine: a real-time fulltext search-and-indexing system designed to separate relevance signals from document text. This is because the life cycle of these signals is different from the text itself, especially in the context of user-generated social inputs (shares, likes, +1, RTs)."

diego14y ago

It integrates anything that can be represented as a number. Prices, number of badges, importance of titles, it doesn't matter. You can combine any of those inputs into a relevance formula that is evaluated at query time. Of course IndexTank won't find those inputs for you, you have to provide them.

mgkimsal14y ago

Are the historical values of those signals kept and queryable? Such that I could check document ranking with signals X, Y and Z today and 3 days ago and check the impact of the signal changes?

lobster_johnson14y ago· 2 in thread

Anyone know about how IndexTank's facets scale with the cardinality of the attribute? We tried using ElasticSearch's facet system for tags, but we have about 150k tags, and this does not play well with ES. (It's very stupid about how it caches them.)

santip14y ago

IndexTank categories are not designed for the tags use case, and will not work properly. It's intended for a relatively small amount of categories for which each document has a single value. The amount of different values of a category can be large but the amount of categories cannot. If you want to implement something like tags, then each tag should be a category because you'll want more than a single tag per document. We were in the process of designing a new feature to support this kind of use cases, and maybe we'll start a branch to implement it and hopefully the community will colaborate.

lobster_johnson14y ago

Thanks for clearing that up.

biznickman14y ago· 1 in thread

Great news but I'm still willing to pay for someone to manage the operational side of this :) Know of any solutions? I'm aware of websolr but their configuration process wasn't as simple as IndexTank

nestlequ1k14y ago

Same here. Indextank service and pricing was great. Hoping someone can match it.

alexro14y ago· 1 in thread

Last time I read about IndexTank I noticed that their query language isn't that sophisticated, it could basically find only matches. Did it improve, is it possible to do fuzzy matches?

ADD: also, does it support non-english languages at all?

nachopg14y ago

IndexTank right now supports preffix search, stemming and a basic implementation of a Did You Mean feature. Regarding languages, it supports tokenization for every western language, and not long ago, we added support for CJK too.

gexla14y ago· 1 in thread

And a new startup offers a hosted IndexTank service in 3,2,1...

For anyone looking for a job at LinkedIn, making impactful contributions to this project could be a way in.

sycr14y ago

Yeah, really though.

The indextank repo proper is interesting (and useful) enough, but indextank-service (https://github.com/linkedin/indextank-service) made my jaw drop a little. It's a full administrative stack for deploying indextank as a service.

swah14y ago

Those kinds of services are mostly being written in Java these days, and everyone would aggree they constitute awesomer software than another Javascript blablabla library... so how can Java be dead? I should learn Java...

fufulabs14y ago

In terms of ease of installation > working state, how does it compare to ElasticSearch or Solr?

iag14y ago

very impressive linkedin. Good move.

j / k navigate · click thread line to collapse

31 comments

29 comments · 11 top-level

mattdeboard14y ago· 7 in thread

What is the differentiation between using this and using Solr? ElasticSearch?

What does "real-time" mean in this context? Is it indexing database content in real-time? Is it in reference to the look-ahead, predictive query completion LinkedIn has?

What would compel someone like me -- a dev who has ownership over the very significant search piece of my company's primary product -- to give this serious evaluation?

nl14y ago

I looked at it some before IndexTank was bought (and I've done a reasonable amount of Solr work).

Solr Cloud (and things like Solandra) deliver some of this functionality to Solr.

Argorak14y ago

Well, elasticsearch is written with this in mind as well - so whats the huge difference in those?

sandGorgon14y ago

If you had to incorporate search today - would you use indextank or solr ?

1 more reply

gnubardt14y ago

I'd imagine they mean indexing (and being able to search on) data in real time. Given LinkedIn's previous open source projects around real time search (http://javasoze.github.com/zoie/).

Lucene (which Solr uses as its index) cannot expose newly indexed data immediately after it's added.

The latest version of Lucene supports "near real time" search, but afaik it's not widely used (with Solr).

mattdeboard14y ago

Yeah, NRT is 4.0; our content is such that right now that kind of flexibility isn't required. (Once-a-day batch db writes that update the index in NRT via signaling)

nl14y ago

IndexTank is built on Lucene too. I'm not sure if it is the real time branch or not, though.

1 more reply

zfran14y ago

http://indextank.com/documentation/faq

emmett14y ago· 2 in thread

This is awesome news. Massively advances the current state of the art in open source search.

Definitely considering replacing our search backend at TwitchTV with this...

hajrice14y ago

Hey Emmet, we're one of the companies interested in continuing IndexTank's platform.

If you want to hear more, just ping me at emil@helpjuice.com

citricsquid14y ago

2 more replies

toisanji14y ago· 2 in thread

I'd like to see how this compares to lucene/solr. With solr its easy to index 100's of millions of docs, but its a pain to write a custom scorer.

espeed14y ago

IndexTank provides real-time document indexing and its algorithm incorporates real-time metrics, like vote data. And it scales horizontally.

riffraff14y ago

SlightGenius14y ago· 2 in thread

Does IndexTank still integrate social inputs?

diego14y ago

mgkimsal14y ago

Are the historical values of those signals kept and queryable? Such that I could check document ranking with signals X, Y and Z today and 3 days ago and check the impact of the signal changes?

lobster_johnson14y ago· 2 in thread

santip14y ago

lobster_johnson14y ago

Thanks for clearing that up.

biznickman14y ago· 1 in thread

Great news but I'm still willing to pay for someone to manage the operational side of this :) Know of any solutions? I'm aware of websolr but their configuration process wasn't as simple as IndexTank

nestlequ1k14y ago

Same here. Indextank service and pricing was great. Hoping someone can match it.

alexro14y ago· 1 in thread

Last time I read about IndexTank I noticed that their query language isn't that sophisticated, it could basically find only matches. Did it improve, is it possible to do fuzzy matches?

ADD: also, does it support non-english languages at all?

nachopg14y ago

gexla14y ago· 1 in thread

And a new startup offers a hosted IndexTank service in 3,2,1...

For anyone looking for a job at LinkedIn, making impactful contributions to this project could be a way in.

sycr14y ago

Yeah, really though.

swah14y ago

fufulabs14y ago

In terms of ease of installation > working state, how does it compare to ElasticSearch or Solr?

iag14y ago

very impressive linkedin. Good move.

j / k navigate · click thread line to collapse