undefined | Better HN

story

0 pointsfizx11y ago0 comments

A logical "SELECT COUNT(*) WHERE updated_at < now()" is probably reasonably fast on your primary store and ElasticSearch.

0 comments

lobster_johnson11y ago

Given that ElasticSearch is "eventually consistent", how do you know when it has caught up? How do you know what data is missing once the count is wrong?

It's solveable, of course, but it's a problem that pops up with any synchronization system, and I'm surprised nobody (apparently) has written one, because it requires a fairly good state machine that can also compute diffs. Once a store grows to a certain state, you do not want to trigger full syncs, ever.

The best, most trivial (in terms of complexity and resilience) solution I have found is to sync data in batches, give each batch an ID, and record each batch both in the target (eg., ElasticSearch) and in a log that belongs to the synchronization process. The heuristic is then to compute a difference beetween the two logs to see how far you need to catch up.

This will only work in a sequential fashion; if ElasticSearch loses random documents, it won't be picked up by such a system. You could fix this by letting each log entry store the list of logical updates, checksummed; and then do regular (eg., nightly) "inventory checks".

fizxOP11y ago

> Given that ElasticSearch is "eventually consistent", how do you know when it has caught up?

That's nice, in practice, though, ElasticSearch, doesn't behave like an eventually consistent system--it behaves like a flawed fully consistent system. It doesn't self-repair enough to be eventually consistent. If you get out of sync by more than a few seconds, you're going to have to repair the system manually in some fashion. It never "catches up."

Additionally, also in practice, most of the data loss (and real eventual consistency behavior) you'll see in an ElasticSearch+primary-data-store system isn't coming from within ES--it's coming from queues people typically use in the sync process. So there's a degree where you're going to need to handle this on an application-specific basis.

> How do you know what data is missing once the count is wrong?

In practice, people just ignore it or do a full re-index. Theoretically, you should be building merkle trees.

> Once a store grows to a certain state, you do not want to trigger full syncs, ever.

This is not really true. You need to maintain enough capacity for full syncs, because someone will need to do schema changes and/or change linguistic features in the search index.

lobster_johnson11y ago

Of course you need to be able to do full syncs, and the sync is not a problem. But one needs to solve the two challenges I have described:

1. Determine how to do an incremental update, given that only the tail of the stream of updated documents is missing. Not as simple as just counting.

2. Determine when you must give up and fall back to a full sync; this is when not just the tail is missing, and finding the difference is computationally non-trivial. You'll only want to do this once you're sure that you need to.

My point remains that ElasticSearch's consistency model means it's hard to even do #1, which is the day-to-day streaming updates.

My second point was that this — streaming a "non-lossy" database as a change log into one or more "lossy" ones — is such a common operation that it should be a solved problem. It certainly requires something more than a queue.

(In my experience, queues are terrible at this. One problem is that it's hard to express different priorities this way. If you have a batch job that touches 1 million database rows, you don't want these to fill your "real-time" queue with pending indexing operations. Using multiple queues leaves you open to odd inconsistencies when updates are applied out of order. And so on. Polling triggered by notifications tends to be better.)

1 more reply

j / k navigate · click thread line to collapse

0 comments

lobster_johnson11y ago

Given that ElasticSearch is "eventually consistent", how do you know when it has caught up? How do you know what data is missing once the count is wrong?

fizxOP11y ago

> Given that ElasticSearch is "eventually consistent", how do you know when it has caught up?

> How do you know what data is missing once the count is wrong?

In practice, people just ignore it or do a full re-index. Theoretically, you should be building merkle trees.

> Once a store grows to a certain state, you do not want to trigger full syncs, ever.

This is not really true. You need to maintain enough capacity for full syncs, because someone will need to do schema changes and/or change linguistic features in the search index.

lobster_johnson11y ago

Of course you need to be able to do full syncs, and the sync is not a problem. But one needs to solve the two challenges I have described:

1. Determine how to do an incremental update, given that only the tail of the stream of updated documents is missing. Not as simple as just counting.

My point remains that ElasticSearch's consistency model means it's hard to even do #1, which is the day-to-day streaming updates.

1 more reply

j / k navigate · click thread line to collapse