https://aphyr.com/posts/317-call-me-maybe-elasticsearch
I've been hearing a lot of people talk about Elasticsearch lately. I get the same gut feeling I was getting about MongoDB back during the "Webscale" days.
You really want to use it in tandem with a storage db PostgreSQL, Cassandra, MongDB. Where ES or any lucene based indexer/db would be use for text searching.
I personally like PostgreSQL and Cassandra, would use it in tadem with ES. Solr, last I check was a bit complicated to cluster.
SolrCloud, with Zookeeper, is relatively new and not too difficult to set up.
In other words, if you're firehosing your primary data store into ElasticSearch, you'll want to know whether it's got all the data you pushed to it at any given time.
I suppose you could use some kind of heuristic to detect this, like posting a "checksum" document occasionally that contains the indexing state and thus acts as a canary that lets you detect loss. On the other hand, this document would be sharded, so you'd want one such document per shard. Is this a solved problem?
Spew your log data into a standard syslog server, while also pumping it into Logstash.
Using Elasticsearch as your canonical log storage would be ridiculous.
From reading the code in Jepsen it looks like kill -9 is all that's being used to start failures. So there's a real bug here: https://github.com/aphyr/jepsen/blob/master/elasticsearch/sr...
So given these claims:
> Per-Operation Persistence. Elasticsearch puts your data safety first. Document changes are recorded in transaction logs on multiple nodes in the cluster to minimize the chance of any data loss.
One would hope they at least flushed the user space buffers.
> by not fsync'ing each operation (though one can configure ES to do so).
It may not be default, but we've seen, again and again, how people are influenced by what they read about a database (e.g. MongoDB).
The lesson by now should be: always know your DB.
True, but Elasticsearch is not intended to be a permanent datastore.
https://www.quora.com/Why-should-I-NOT-use-ElasticSearch-as-...
or maybe they're spending that money on marketing and re-branding.
Maybe not so funny if your multiply redundant cluster loses data because a single node dies...
HDFS uses chain replication, so I would have expected that by the time the client got acknowledgement of a write, it would already be acknowledged by all replicas (3 by default). So even if there's a bug causing one of the nodes to go down without fsyncing, there shouldn't be any actual data loss.
What it actually means inside HDFS when it claims 'written', I'm not sure - I'd assume flushed to the dirty page buffer at a minimum and would hope fsync.
TSB 2015-51: Replacing DataNode Disks or Manually changing the Storage IDs of Volumes in a Cluster may result in Data Loss Printable View Rate This Knowledge Article (Average Rating: 3.3) Show Properties « Go Back Information
Purpose Updated: 4/22/2015
In CDH 4, DataNodes are identified in HDFS with a single unique identifier. Beginning with CDH 5, every individual disk in a DataNode is assigned a unique identifier as well.
A bug discovered in HDFS, HDFS-7960, can result in the NameNode improperly accounting for DataNode storages for which Storage IDs have changed. A Storage ID changes whenever a disk on a DataNode is replaced, or if the Storage ID is manually manipulated. Either of these scenarios causes the NameNode to double-count block replicas, incorrectly determine that a block is over-replicated, and remove those replicas permanently from those DataNodes.
A related bug, HDFS-7575, results in a failure to create unique IDs for each disk within the DataNodes during upgrade from CDH 4 to CDH 5. Instead, all disks within a single DataNode are assigned the same ID. This bug by itself negatively impacts proper function of the HDFS balancer. Cloudera Release Notes originally stated that manually changing the Storage IDs of the DataNodes was a valid workaround for HDFS-7575. However, doing so can result in irrecoverable data loss due to HDFS-7960, and the release notes have been corrected.
Users affected:
Any cluster where Storage IDs change can be affected by HDFS-7960. Storage IDs change whenever a disk is replaced, or when Storage IDs are manually manipulated. Only clusters upgraded from CDH 4 or earlier releases are affected by HDFS-7575.
Symptoms If data loss has occurred, the NameNode reports “missing blocks” on the NameNode Web UI. You can determine to which files the missing blocks belong by using FSCK. You can also search for NameNode log lines like the following, which indicate that a Storage ID has changed and data loss may have occurred: 2015-03-21 06:48:02,556 WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock request received for blk_8271694345820118657_530878393 on 10.11.12.13:1004 size 6098 Impact:
The replacement of DataNode disks, or manual manipulation of DataNode Storage IDs, can result in irrecoverable data loss. Additionally, due to HDFS-7575, the HDFS Balancer will not function properly.
Applies To HDFS All CDH 5 releases prior 3/31/15, including: 5.0, 5.0.1, 5.0.2, 5.0.3, 5.0.4, 5.0.5 5.1, 5.1.2, 5.1.3, 5.1.4 5.2, 5.2.1, 5.2.3, 5.2.4 5.3, 5.3.1, 5.3.2 Cause Instructions Immediate action required:
Do not manually manipulate Storage IDs on DataNode disks. Additionally, do not replace failed DataNode disks when running any of the affected CDH versions.
Upgrade to CDH 5.4.0, 5.3.3, 5.2.5, 5.1.5, or 5.0.6. See Also/Related Articles Apache.org Bug HDFS-7575
HDFS-7960 Attachment
I was thinking of NodeJS.
But the comment is correct, ES is not a db but a indexer and search engine.
edit:
Oh god, don't use it for storage. It index stuff.
You got a document, it'll store it in root words form so you can fuzzy search. It'll also do other NLP stuff to your document such as removing stop words. Once it hit an index you can store index value that point to your primary storage (cassandra, postgresql).
At least that's how I used it. If there is any better alternative I'd like to know about it.
edit:
I highly recommend: http://www.manning.com/ingersoll/
Taming text by Ingersoll and it won Dr. Dobb award too as a good book.
As long as we can cash in our options before the lawsuits come in, we win! Just like my last job in finance, actually.
Plus SQL's gross. You can't even webscale with it, and old people like it, so it must suck.