I quickly realized what a pain it is to use Elasticsearch, for a simple app like mine.
Pain points:
1) You have to setup and recreate part of your database in elastic search. So you essentially end up with two databases. Which now you have to keep in sync.
2) I was getting unpredictable query results from Elasticsearch, which after a few days, and much head scratching turned out to be that I was running out of memory.
3) When a user added a new event, it was not being added to elastcsearch index automatically. I could not figure out how to do this reliably. I could make it work reliably only after a sync of the entire Elasticsearch index. But this meant that it was next to useless, to use for the Upcoming Events List. Since I only wanted to sync the index once a day. Confusing the users, as to why their event was not showing up. And I gave up, and just ended up implementing the Upcoming Events List directly from my database in python.
4) Elasticsearch came without some security settings not set by default, and after a few months it was hacked. I had to download a new version and wasted more time.
I still use Elasticsearch, but only for search, and not the upcoming event list. And I don't think it was worth the complexity that it added to my project.
It’s similar to bringing a F1 car to a go-cart race and then being surprised you aren’t able to finish the race because you don’t have a pit crew able to maintain your vehicle.
I’ve built and owned large Elasticseach clusters at Fortune 50 companies for providing log search as well as document search. Like anything, administering an ES cluster requires planning, engineering, and process/documentation.
I wouldn’t consider using ES if I didn’t have a dedicated ops team to help in its administration unless possibly using a managed service like the one AWS provides.
It’s a very powerful tool; it was a mistake to think you can just casually throw it in your stack without fully understanding its complexity.
Dynamic mappings can mess things up really easily, so its best to disable them in favor of using a pre-defined static map for the type of documents you will be ingesting. What I've encountered in the past that usually causes things to break, is when 90% of your documents contain a field called "Date" that contains a ANSI date field, but the other 10% contain "Null" (string instead of an ANSI date). Since the documents don't match the dynamically generated mapping, they fail to be indexed.
Shard management is also critical and this largely depends on the type of data you are indexing. If the data in ES is unique (not just a copy of a database you already have), you will want to have some sort of cross-region/DC replication strategy as well as a backup strategy.
Fortunately both of these are pretty easy. ES has a mechanism of using tags that allows you to define things like regions, data centers, really whatever you want, and shards can be routed based on rules defined over these tags.
A setup I've used in the past is to have 5 nodes in LAX DC 5 nodes in LAS DC, any data that is ingested into LAX is replicated into shards in LAS and vice versa.
Backup to S3 is rather trivial now thanks to the built in export options in the newer versions of ES.
With a little bit of planning ES can be a great addition to your stack, just be sure you do the initial engineering so you can avoid a big headache in the future.
* DataDome is a security company, and gets web traffic in near real-time for clients; a lot of traffic in some cases with very specific numbers given, like daily peak loads.
* DataDome only retains records for 30 days, and the most attention is given to the most recent traffic, to detect attacks
* an ElasticSearch deployment records all of the traffic records downstream from Apache Flink; a new feature added to ES this year, improves the management of ES indexing, and that solved problems that DataDome was having.. things are better! write an engineering blog post !
* re-indexing is done nightly, and implemented in a cloud environment that can handle the (heavy) work to rebuild the indexing.
These numbers are impressive. Earlier criticisms of ES are being addressed, and ES is stable and a cornerstone of the architecture. A company called DataDome is providing real services in near real-time. Congratulations to the team and an interesting read.
> Storing 50 million of events per second
> A few numbers: our cluster stores [...], 15 trillion of events
> We provide up to 30 days of data retention to our customers. The first seven days of data were stored in the hot layer, and the rest in the warm layer.
15e12 / 50 MHz is 3.5 days.
I guess 50 MHz is the peak ingest rate.
I'm not saying you shouldn't have written this post, but rather suggest you be fair to your readers (and yourself). Otherwise you could just make up random titles like "Writing 1 trillion log lines per second" (by storing 1,000,000 1-byte, newline-separated log lines per document).
> We have set “replica 0” in our indexes settings
> Now let’s assume that node 3 goes down:
> As expected, all shards from node 3 are moved to node 1 and node 2
No, as there are no shards that can be moved, as number of replicas was set to zero and one node went down. Not sure what they are trying to explain here.
> In order to resolve this issue, we introduced a job which runs each day in order to update the mapping template and create the index for the day of tomorrow, with the right number of shards according to the number of hits our customer received the previous day.
This is a very common use-case(eg. logging), but it's surprising that Elastic has nothing to automate this.
You can set an index template to be used on new indices that match a pattern, which is a very common thing to do. It sounds like what they did was modify the template daily, which is less common IME. It's not clear why they had to manually create the index, though. That should happen automatically.
It is, but how can you tell in your template you want to keep shard sizes under 50GB? You can't.
The best thing you can do (as they did) is, based on historical data, update the template, so that the new index will have shards that (hopefully) are under 50GB.
Ok, 3 nodes, each with one primary shard. No replicas. 1 node goes down, one shard is no longer found in the cluster, because it was in the missing node. That particular index, and in fact the whole cluster, are now RED.
Unless you discard that shard (force reroute, with accept_data_loss), nothing is going to be recovered and the missing shards will not be allocated anywhere.
> introduction of a job which runs each day this job has many purposes: 1) because of rollover, our indexes are now suffixed by -000001 then -000002 etc...our applications no matter of the rollover post and get doc by the alias in front of these indexes suffixed by -000001... So if you don't create the index for the next day in a daily basis index design, your application will push at 00:00:01 a new index with the alias name and it won't be in rollover "mode". 2) because we are using ILM feature, we need to define the ILM rollover alias in the template and it changes each day because our indexes name are "index_$date" 3) performance issues: we have a lot of traffic and if we do not create index before for the next day, we will except a lot of unassigned shards, cluster yellow etc...
in fact, yes, it's a common use-case (daily based index) but maybe not with rollover + ilm
Ah, got it. So maybe it would be best said as "for the following example, ignore any replicas".
> in fact, yes, it's a common use-case (daily based index)
But it is not automated by the Elastic folks. Do you have any intentions of open-sourcing a portion of this job?
What is this 50M in the title?
Curiouser and curiouser.