If you only need very basic word search, ES is probably not worth the complexity in your stack, especially if you're already running a SQL database with decent plaintext search.
Where elasticsearch shines is in complex queries: "Show me every match where this field contains 'extinction' within 10 words of 'impact crater' but NOT containing 'oceanic' and the publish date is > last month and one of the subjects is anthropology"
One application I worked on indexes a Postgres database into Elasticsearch for live front-end queries. We index every single field, sometimes hundreds of fields in a single index. ES does this easily. Thanks to Lucene's quasi-columnar/quasi-LSM tree storage, new indexed fields aren't very expensive, and searches -- even fairly complicated ones -- are very fast.
ES is also extremely fast at aggregations. Even complex multi-level aggregations (e.g. group by date, then multiple nested buckets by different fields with "top k" results for each) take just a few hundred milliseconds for latge million-document datasets.
Where ES has problems are areas like replication, consistency and memory usage. It's very hard to tune ES; due to JVM GC and caches, it's basically impossible to predict how much RAM ES will need, and OOMs are common. There's also still no way to ask for a consistent index on query; the best you can do is use "waitfor=refresh" on indexing, which is the wrong time for it. I'd love a consistent Raft-based ES.
1. Does it return relevant results?
2. Can it handle complex queries?
2) is only required in specific use-cases, but when it's needed it's _really needed_.1) is the main measure users care about, and in my experience is best evaluated by building a search in each system with the same corpus and giving to subject-matter experts.
Without a good search engine you might have the results you needed plus lots of other results. You'd have to scroll to page 20 of your results to actually see the result that you wanted, which means it wasn't very precise.
Think of internet search engines pre-google. With e.g. alta vista you had great recall but extremely poor precision. You'd often be scrolling multiple pages of results. Google turned that around by having great precision and similar recall. They made it so good that they implemented the "i feel lucky" button.
The trick with search is to have great precision and still good enough recall. That's super hard because what is precise is very subjective and highly dependent on your usecases, data, languages, etc.
This is why Elasticsearch is such a hugely complicated product: it includes a lot of solutions for essentially any use case you can imagine around search.
I have no experience with Redisearch; so I'll reserve my judgment. But this article is not doing it any favors.
There are competing things out there for Elasticsearch. Most of the serious ones also use Apache Lucene (e.g. Solr). Some of the upcoming ones are attempting to rebuild what Lucene does and may or may not be good enough depending on your use case. There have been some lucene ports over the years, including a C port. Most of those have fallen behind or are no longer maintained. The Java implementation is actually pretty good as is and has had a lot of performance and optimization work done to it over the years. You'd be hard pressed to build something as good and as fast without essentially using the same algorithms and reinventing a lot of the same wheels.
IMHO the current effort to build a search engine in Rust makes a lot of sense. The language is uniquely suited to doing the kinds of things Lucene does and they seem to be pretty serious about doing things properly.
That's why some of these benchmarks (redis and the go search engine posted last week) seem a little apples/oranges to me.
I was under the impression that if you wanted to do auto-complete, you need to handle mis-spellings, and that ElasticSearch is one of the best options for this.
I have a search use case. I want to create a simple language model where each token in the lexicon gets a unique ID (or ordinal) that I can use to create a more sophisticated model where each document is represented as a vector as wide as there are unique tokens and use clustering and give each cluster a unique ID (or ordinal) so that I can create an even more sophisticated language model, one with built-in semantic understanding. A natural language data structure, if you will, with multiple layers. I want to store the entire WWW in such a model. So I'm building a language model framework that is not build on Lucene because I'm not obliged to use ES in that capacity.
I feel you are wrong to call my use case simple and small scale.
If the "Multi-tenant indexing benchmark" is accurate it seems like it might be a robustness concern for ES. "Elasticsearch crashed after 921 indices and just couldn’t cope with this load." -- does that mean memory exhaustion or some other crash? If it's the latter, it seems like a quality problem more than a performance one.
This benchmark used 4605 shards (5 per index) on a single node, which is way above the recommended number.
Also, to prevent oversharding, the default number of shards per index has been changed to 1 in 7.0.
I think we can all agree that misusing a tool, after appropriate documentation has been published, shouldn't be a considered a fault of the tool.
[0] https://www.elastic.co/guide/en/elasticsearch/guide/current/...
Agree with the general claim that this benchmark is poor though. A real study of complex searches with faceting, ranking and ordering against both databases in a distributed setup would be much more interesting.
...and then aggregate into time-based buckets, and within each bucket split the results by this field, and then...
I experimented with RediSearch using 20 GB of Reddit posts and I was very underwhelmed.
First, 20 GB of raw data explodes into 75 GB once it's in RediSearch with zero fault tolerance. While I'd expect some expansion with inverted indexes and word frequencies by document, a 3.75 multiple seems high.
And since this is Redis, it's all in RAM, including indexes and raw documents, all uncompressed. That's not cheap. Add replicas for fault tolerance and the RAM needed for a decent sized cluster could be 10x the size of the raw data.
Then the tooling and documentation is very limited. Redis Labs provides a Python client, but it doesn't support basic features like returning the score with each document, even though RediSearch provides this capability if you query it directly.
Finally, I found stability issues with Redis when the RediSearch module is installed. Using the Python client provided by RedisLabs, certain queries would predictably crash every node in the cluster.
Redis itself is rock solid, but Redis with the RediSearch module feels fragile.
Overall, interesting concept but not ready for production use by any means.
- Show the code that runs the bench mark
- Give opportunities for everyone to recreate the benchmark
- Give opportunities for every technology to 'respond' and point out where the benchmark/tech configuration is wrong (ie "PRs welcome")
Otherwise, this just looks like cherry-picked data points, and even those I won't trust. Nor would I show this to any of my clients (whom I help select search engine technology). I dearly hope nobody makes real decisions based on this blog post until the code, and everything is opened up.
>RediSearch: Dedicated engine based on modern and optimized data-structures
>ElasticSearch: 20 years old Lucene engine
The implications made here make me actually angry.
RedisSearch: new shiny thing built on top of Redis that is used in a couple of niche places.
I’ll take Lucene please
I've seen a lot of posts like this easily make it to the front page only because a lot of HN-ers are Redis fanboys (rightfully so: Redis is great). But then you read the post and it _appears_ to be marketing garbage.
No sane elasticsearch engineer would make a new index for each product. They would just have a single index with a product_id field for each sub-item. If you needed product level information, you would create a second index for that. You'd use two indexes not O(#Product) indexes.
They just created a botched benchmark by using ES incorrectly. It's like driving a car backwards and then complaining it has poor max speed. ES could easily handle this type of problem if done correctly.
It's hard to mistake documents for indices. Both original and the currently edited statement sound strongly suspect and make me question the benchmarking methodology used. What caused the ES to crash after indexing 921 documents? Why is comparing indexing speeds on a 1-node setup even a legit benchmarking test?
"wikidump" links to https://dumps.wikimedia.org/enwiki/latest/ , which has thousands of files, none of which are 5GB and make sense. That's a very poor corpus link!
It says "Feb 7, 2019", so it probably means https://dumps.wikimedia.org/enwiki/20190120/ or https://dumps.wikimedia.org/enwiki/20190201/ ... maybe. They don't have any obvious 5.3GB files.
Note: clustering is only available in RediSearch’s Enterprise version
https://redislabs.com/redis-enterprise/technology/redis-sear...
At least with ES i can build and play with the clustering of the nodes. This is probably why they only made a 1 node ES, because they would have to push their Enterprise software to do make a cluster of RediSearch. Maybe i am wrong.
Raw latency is usually not the primary concern most of the time and having everything in RAM can be a major cost problem, further compounded by the lack of compression available as with other persistent stores. The RESP protocol is also overloaded and hard to work with when dealing with json and search queries.
With that it’s just 2 points in space which gives us little information to deduce 58% faster at X or whatever.
This is why the only reliable benchmark is the one you do on your data.
PS: Crashes are never good though...
This is a massive misconfiguration of an elastic search cluster. 50k indices? 500 documents per index?
500 records per index at 5shards/index is 100 records per shard.
Yeah, let's shard our data so much that we introduce tremendous amounts of disk i/o overhead!!!
Author should learn how to properly configure an ES cluster before posting ridiculous benchmarks like this.
What an utter pile of garbage benchmark this is.
To expand a little bit, the whole point of using multiple shards per index in an ES cluster is so that the shards spread across multiple nodes (servers) and distribute the load (disk i/o) and handle redundancy. ES automatically scales and reshuffles its shards across multiple nodes in the cluster to handle fault-tolerance as well. If one or more nodes go down, the cluster still has all of the data through replica shards etc...
Either way, in this particular case, the data is so small, having 5 shards per index with 50k indices results in 250k shards for 5GBs of data.
5GB / 250k shards = 20kb per shard.
You have shards of size ~ 20kb ... total cluster misconfiguration.
The specific test deployment was multitenant anyway-- you can't account or optimize for what tenants are going to index.
So in other words:
"If your specific use case is supporting 50,000 customers each having around 500 documents and only needing basic text search queries and relevance is not a major concern, RedisLabs Search might give you better performance than ElasticSearch!"
(This is assuming there isn't a different way to configure ElasticSearch to work for this scenario, that gives similar performance.)
When you point out the flawed methodology you come across like a luddite or sour grapes or whatever else.
It's about minimizing the effort needed to find what you're looking for. Speed of index construction time, unless we're talking orders of magnitude, isn't really meaningful. I don't know if this is just a really clumsy attempt at "marketing" or what, but I can't imagine this is going to convince anyone to drop es for this thing.
Lucene is a pretty rock-solid open source project that has been battle tested over those 20 years and had some of the best engineers in the world improve over a long time frame. That's an asset for Lucene!
Damn I feel old. I remember when Lucene was hot new kid in the block.