It's solveable, of course, but it's a problem that pops up with any synchronization system, and I'm surprised nobody (apparently) has written one, because it requires a fairly good state machine that can also compute diffs. Once a store grows to a certain state, you do not want to trigger full syncs, ever.
The best, most trivial (in terms of complexity and resilience) solution I have found is to sync data in batches, give each batch an ID, and record each batch both in the target (eg., ElasticSearch) and in a log that belongs to the synchronization process. The heuristic is then to compute a difference beetween the two logs to see how far you need to catch up.
This will only work in a sequential fashion; if ElasticSearch loses random documents, it won't be picked up by such a system. You could fix this by letting each log entry store the list of logical updates, checksummed; and then do regular (eg., nightly) "inventory checks".
That's nice, in practice, though, ElasticSearch, doesn't behave like an eventually consistent system--it behaves like a flawed fully consistent system. It doesn't self-repair enough to be eventually consistent. If you get out of sync by more than a few seconds, you're going to have to repair the system manually in some fashion. It never "catches up."
Additionally, also in practice, most of the data loss (and real eventual consistency behavior) you'll see in an ElasticSearch+primary-data-store system isn't coming from within ES--it's coming from queues people typically use in the sync process. So there's a degree where you're going to need to handle this on an application-specific basis.
> How do you know what data is missing once the count is wrong?
In practice, people just ignore it or do a full re-index. Theoretically, you should be building merkle trees.
> Once a store grows to a certain state, you do not want to trigger full syncs, ever.
This is not really true. You need to maintain enough capacity for full syncs, because someone will need to do schema changes and/or change linguistic features in the search index.
1. Determine how to do an incremental update, given that only the tail of the stream of updated documents is missing. Not as simple as just counting.
2. Determine when you must give up and fall back to a full sync; this is when not just the tail is missing, and finding the difference is computationally non-trivial. You'll only want to do this once you're sure that you need to.
My point remains that ElasticSearch's consistency model means it's hard to even do #1, which is the day-to-day streaming updates.
My second point was that this — streaming a "non-lossy" database as a change log into one or more "lossy" ones — is such a common operation that it should be a solved problem. It certainly requires something more than a queue.
(In my experience, queues are terrible at this. One problem is that it's hard to express different priorities this way. If you have a batch job that touches 1 million database rows, you don't want these to fill your "real-time" queue with pending indexing operations. Using multiple queues leaves you open to odd inconsistencies when updates are applied out of order. And so on. Polling triggered by notifications tends to be better.)