Elasticsearch node crashes can cause data loss (opens in new tab)

(github.com)

112 pointsfelipehummel11y ago50 comments

50 comments

37 comments · 10 top-level

rdtsc11y ago· 12 in thread

Mandatory reading -- Last year's Call Me Maybe : Elasticsearch

https://aphyr.com/posts/317-call-me-maybe-elasticsearch

I've been hearing a lot of people talk about Elasticsearch lately. I get the same gut feeling I was getting about MongoDB back during the "Webscale" days.

bkeroack11y ago

In my experience, Elasticsearch is the single most common source of infrastructure downtime and service failure. It's basically my arch nemesis.

willejs11y ago

I am interested to hear a bit more about this, as I find it hard to believe. I have only ran it at pretty small scale - x8 servers, around 300 million documents indexed a day, peak index rate 30k docs/sec. I found that you have to monitor it correctly, tune the JVM slightly (Mostly GC), give it fast disks, lots of ram, and the correct architecture (search, index & data nodes) to get the most out of it. Once I did that it was one of the most reliable components of my infrastructure, and still is. I would recommend chatting to people on the elasticsearch irc, or mailinglist, everyone was a great help to me there.

1 more reply

riceo10011y ago

Same here. A single node failure has lead to the whole cluster crashing down around me on more than one occasion.

AnkhMorporkian11y ago

Really? Perhaps I was never running it at a large enough scale, but even pre-v1.0 I've basically never had any troubles with it (outside of operation concerns like occasionally confusing query syntax.) Then again, I never had more than 11 servers in the cluster so again I may just have never run into problems at scale.

flippyhead11y ago

While I don't necessarily disagree, I do find that this depends entirely on how ES is used. All too often people dive headfirst into using elastic search in ways it really should not be used.

lobster_johnson11y ago

It can't be worse than RabbitMQ... can it?

thejosh11y ago

I use ES only for search (indexes from a DB), so losing data isn't a massive drama, it's great for my usecase.

rdtsc11y ago

That sounds like the indended use. I should qualify my comment, I heard it advocated for a primary data storage.

2 more replies

digitalzombie11y ago

Elasticsearch is just a text search engine base on lucene. You either use ES, Solr, or Lucene library if you want fuzzy search and such.

You really want to use it in tandem with a storage db PostgreSQL, Cassandra, MongDB. Where ES or any lucene based indexer/db would be use for text searching.

I personally like PostgreSQL and Cassandra, would use it in tadem with ES. Solr, last I check was a bit complicated to cluster.

threeseed11y ago

Agreed. Cassandra is especially nice if you have the DataStax Enterprise version which allows for seamless integration between the two.

m-i-l11y ago

> Solr, last I check was a bit complicated to cluster

SolrCloud, with Zookeeper, is relatively new and not too difficult to set up.

1 more reply

PhilipA11y ago

What about storing data for analytics? Wouldn't it be better to use ES than Postgres for that?

tedchs11y ago· 6 in thread

The advice I've heard from serious people using Elasticsearch for serious things indicate that you should definitely not use Elasticsearch as a primary data store (i.e. it should be treated as a cache).

lobster_johnson11y ago

This is true. On the other hand, even a secondary data store that's considered "lossy" poses a challenge — how do you know if its integrity has been compromised?

In other words, if you're firehosing your primary data store into ElasticSearch, you'll want to know whether it's got all the data you pushed to it at any given time.

I suppose you could use some kind of heuristic to detect this, like posting a "checksum" document occasionally that contains the indexing state and thus acts as a canary that lets you detect loss. On the other hand, this document would be sharded, so you'd want one such document per shard. Is this a solved problem?

fizx11y ago

A logical "SELECT COUNT(*) WHERE updated_at < now()" is probably reasonably fast on your primary store and ElasticSearch.

1 more reply

po11y ago

It is often advocated as a datastore for logging data... which means (in that case) it's usually the primary datastore but perhaps not mission-critical.

alrs11y ago

It's a great index for log data.

Spew your log data into a standard syslog server, while also pumping it into Logstash.

Using Elasticsearch as your canonical log storage would be ridiculous.

rodgerd11y ago

Once you start relying on it to understand the state of whatever it's logging, it's mission-critical.

tomjen311y ago

It would probably be good enough as a store for A/B testing information - losing data here isn't critical but writing speed is.

smegel11y ago· 4 in thread

Funnily enough I have seen a slew of technical bulletins from Cloudera warning of similar issues with HDFS.

Maybe not so funny if your multiply redundant cluster loses data because a single node dies...

teraflop11y ago

Wow, that sounds bad and I don't remember hearing about it. Do you have any pointers to bug reports or descriptions of the problem?

HDFS uses chain replication, so I would have expected that by the time the client got acknowledgement of a write, it would already be acknowledged by all replicas (3 by default). So even if there's a bug causing one of the nodes to go down without fsyncing, there shouldn't be any actual data loss.

growse11y ago

I think the client assumes the data is written after $dfs.namenode.replication.min blocks have been written, which I think is 1 by default.

What it actually means inside HDFS when it claims 'written', I'm not sure - I'd assume flushed to the dirty page buffer at a minimum and would hope fsync.

EdwardDiego11y ago

Yeah, I'm very interested in this also.

smegel11y ago

>>> OK its not simply that a node dies, but that disks on a node are replaced (which might sort of be related to a node dying).

TSB 2015-51: Replacing DataNode Disks or Manually changing the Storage IDs of Volumes in a Cluster may result in Data Loss Printable View Rate This Knowledge Article (Average Rating: 3.3) Show Properties « Go Back Information

Purpose Updated: 4/22/2015

In CDH 4, DataNodes are identified in HDFS with a single unique identifier. Beginning with CDH 5, every individual disk in a DataNode is assigned a unique identifier as well.

A bug discovered in HDFS, HDFS-7960, can result in the NameNode improperly accounting for DataNode storages for which Storage IDs have changed. A Storage ID changes whenever a disk on a DataNode is replaced, or if the Storage ID is manually manipulated. Either of these scenarios causes the NameNode to double-count block replicas, incorrectly determine that a block is over-replicated, and remove those replicas permanently from those DataNodes.

A related bug, HDFS-7575, results in a failure to create unique IDs for each disk within the DataNodes during upgrade from CDH 4 to CDH 5. Instead, all disks within a single DataNode are assigned the same ID. This bug by itself negatively impacts proper function of the HDFS balancer. Cloudera Release Notes originally stated that manually changing the Storage IDs of the DataNodes was a valid workaround for HDFS-7575. However, doing so can result in irrecoverable data loss due to HDFS-7960, and the release notes have been corrected.

Users affected:

Any cluster where Storage IDs change can be affected by HDFS-7960. Storage IDs change whenever a disk is replaced, or when Storage IDs are manually manipulated. Only clusters upgraded from CDH 4 or earlier releases are affected by HDFS-7575.

Symptoms If data loss has occurred, the NameNode reports “missing blocks” on the NameNode Web UI. You can determine to which files the missing blocks belong by using FSCK. You can also search for NameNode log lines like the following, which indicate that a Storage ID has changed and data loss may have occurred: 2015-03-21 06:48:02,556 WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock request received for blk_8271694345820118657_530878393 on 10.11.12.13:1004 size 6098 Impact:

The replacement of DataNode disks, or manual manipulation of DataNode Storage IDs, can result in irrecoverable data loss. Additionally, due to HDFS-7575, the HDFS Balancer will not function properly.

Applies To HDFS All CDH 5 releases prior 3/31/15, including: 5.0, 5.0.1, 5.0.2, 5.0.3, 5.0.4, 5.0.5 5.1, 5.1.2, 5.1.3, 5.1.4 5.2, 5.2.1, 5.2.3, 5.2.4 5.3, 5.3.1, 5.3.2 Cause Instructions Immediate action required:

Do not manually manipulate Storage IDs on DataNode disks. Additionally, do not replace failed DataNode disks when running any of the affected CDH versions.

Upgrade to CDH 5.4.0, 5.3.3, 5.2.5, 5.1.5, or 5.0.6. See Also/Related Articles Apache.org Bug HDFS-7575

HDFS-7960 Attachment

1 more reply

klapinat0r11y ago· 3 in thread