Eventual consistency is embracing this philosophy of a lack of consistency for computer systems too, on the basis that maintaining actual consistency would be too expensive/complex/slow, which is frequently the case.
This of course, in principle, can lead to ever degrading consistency and since you can't assume everything is consistent, you also cannot really verify consistency in any other way than heuristically, as another commenter suggested.
Eventual consistency is a design driven by practical needs. It is never a path to reach complete data purity.
And this applies both to streaming and batch tasks alike.
Maintaining actual consistency is seldom more complex - the opposite is true, eventual consistency can lead to mind boggling complexity (because it's very hard to reason about your guarantees anymore... even the "eventual correctness" guarantee; in practice it's more often than not a handwavy "yeah, it's likely probably correct in many cases, and if you find something wrong, we'll take it as a bug and fix it. Or at least claim to fix it, because you know, it might be hard to reproduce". Good enough for usecases like advertising, I guess)
Too expensive/slow is the typical reason for eventual consistency - but the whole point of materialize.io is to challenge this "too expensive/slow" assumption.
how exactly is it challenging it. Spanner is too expensive.
If you instead implement low-latency systems where each step along a dataflow involves a round-trip through replicated highly available storage, Spanner if you like or even just Kafka, then 100% you might reasonably conclude that eventual consistency is the right call. This is roughly the situation that microservice implementors currently find themselves in. I don't think it is a great situation to be in, personally.
The value proposition with something like Materialize (and there are other options) is that you can get consistency and performance if you can express your computation as something more structured than imperative code that writes to and reads from storage. In our case, the "something" is SQL.
Hope that helps!
The less guarantees of correctness on your daily/weekly/whatever releases, the messier your downstream data is gonna be. Monday's data is partially missing due to a bug in the client; Tuesday's data is weird/nonrepresentative because of a server bug that caused 5% of sessions to get disconnected; Wednesday's data is good; Thursday's data is good but was a release day and the feature changed so it means different stuff...
That companies have collected more data than they can pay for processing of is a separate issue, I think.
It's great that your DB is ACID and anyone who queries it gets the latest greatest but in reality you also have out of date caches, ORM models that haven't been persisted, apps where users modifying data that hasn't been pushed back to the server and a million other examples.
I'm sure it's possible to create a consistent system but I'm also sure it's not practical. No one does it.
Instead of constantly fighting eventual consistency just learn to embrace it and its shortcomings. Design systems and write code that are resilient to splits in HEAD and provide easy methods to merge back to a single truth.
This is on top of regular "nope, can't do that" code that you would write in both systems.
Billions of dollars flow through fully consistent systems every day. The basic IT concept for smaller hedge funds is "buy the biggest MSSQL machine available on the planet and move on." The big ones have custom frameworks that resemble Frank's arguments here, though the abstractions are different.
And the result is exactly what grandparent was complaining about: sure your database server is full ACID, but a trader is looking at numbers on their screen that are out of date (and pressing the trade button on that basis), and that's what actually matters.
Oh, some people do. I used this EXACT phrase when I came in to fix an analytics system at a healthcare company that was plagued with analytics problems. They had 5 senior engineers, fulltime, working on this system for years. It had persistent problems and could not be modified in any meaningful way. Upstream systems sent data through multiple SQS topics (duplicate and out of order data) fed into lambda, fed into a giant cache-db which tried to catch dupes and order data, fed into files, processed in batch. It was a horror show in complexity and costing (despite the near-free lambdas). A distributed set of large data streams we feeding into a singular database which was processed, multiple times and put back in the same database. What's billions of inserts into an amazon postgres db, per hour? The company cloud infrastructure gave 0 other tools to work with. I shored up the batch processing (which had all kinds of try catch everywhere, despite a fixed schema) and went on to another company. Medical company software is always a ball of fail.
I was hoping for a happier ending there. What could have potentially fixed their situation?
We have consistency across our distributed system (~75 services currently) for all the fundamentals of our business. It is not difficult to do at all.
This article reads as though the author hadn't shifted mindset from "the database will solve it for me" to "I'm taking on the relevant subset of problems in my use case". This seems off given that they're trying to sell a streaming product. They claim their product avoids problems by offering "always correct" answers which requires a footnote at the very least but none was given.
Point of note: The consistency guarantee is that upon processing to the same offset in the log that, given that you have taken no other non-constant input, you will have the same computational result as all other processes executing semantically equivalent code.
I take this sort of comment as abusive of the reader:
> What does a naive application of eventual consistency have to say about > > -- count the records in `data` > select count(*) from data > > It’s not really clear, is it?
A naive application of eventual consistency declares that along some equivalent of a Lamport time stamp across the offsets of shards in the stream, the system will calculate account of records in data as of that offset. Given the ongoing transmission of events that can alter the set data, that value will continue changing as appropriate and in a manner consistent with the data it processes. The new answers will be given when the query is run again or it may even issue an ongoing stream of updates to that value.
Maybe it got better as the article went on...
This is a good article from a high-profile author. The way you are criticizing it comes across as ignorant and narcissistic.
"This article reads as though the author hadn't shifted mindset from..."
You are claiming that the author is looking at the problem in the wrong way, and you came to that conclusion before even reading the entire post. It's okay to not be interested in a topic, but stating this in public does not add anything of substance to the discussion.
"I take this sort of comment as abusive of the reader"
This is just offensive, you're basically saying that the comment is stupid. It seems like you are looking for some validation of your intelligence, by the article or by the comment section here.
I'm sorry for expressing ignorant perspectives and behaving narcissistically. My intention was not to call the author stupid or ask for validation of my intellect but I can see how that comes out - thank you for identifying that. I'm sorry and thank you for making it through my comment and being willing to point my failure out to me. I really appreciate it.
I'm actually quite interested in the topic of stream processing and eventual consistency. I was one of the contributors to an open source stream processing based project that won architectural awards at a reasonably big conference, spent a little time working in a machine learning oriented startup based on stream processing with some of the creators of Apache Beam/DataFlow, and now my daily labor is implementing stream processing in another startup. This area of systems design (event sourcing specifically) has been an a bit of professional obsession and labor of love for me over the last five years or so now.
I very clearly failed to express any of that in my post. I further clearly failed to manage my emotional reaction to the article and threw anything useful I may have had an opportunity to add under the bus of my words. I'm sorry for that too.
Again - Thank you very much for helping me grow. I keep working on being a better human and appreciate the help.
yeah i was expecting to see what tradeoffs materialize made to get 'always correct' result. There is definitely something 'lost' for 'always correct' too.
I can only attribute this one sided take to deviousness. Personally , I would avoid whatever this company is selling.
I'm familiar with streaming, as a concept, from the likes of Beam, Spark, Flink, Samza - they do computations over data, producing intermediate results consistent with the data seen so far. These results are, of course, not necessarily consistent with the larger world because there could be unprocessed or late events in a stream, but they are consistent with the part of the world seen so far.
The advantage of streaming is the ability to compute and expose intermediate snapshots of the world that don't rely on the stream closing (As many streams found in reality are not bounded, meaning intermediate results are the only realizable result set). These intermediate results can have value, but that depends on the problem statement.
To examine one of the examples, let's use example 2, this aligns with the idea that we actually don't have a traditional streaming problem. The question being asked is "What is the key which contains the maximum value". There is a difference between asking "What is the maximum so far today" and "What was the maximum result today" -- the tense change is important because in the former the user cares about the results as they exist in the present moment, whereas the other cares about a view of the world in a time frame that is complete. It seems like the idea of "consistent" is being conflated with "complete", wherein "complete" is not a guaranteed feature of an input stream.
If anyone could clarify why the examples here isn't just a case of expecting bounded vs unbounded streams?
I think when folks say that eventual consistency is okay, they're thinking about simple aggregates - where transient incorrectness in the result is indistinguishable from noise.
But if you want to do joins, you really want to be able to reason about your unbounded streams causally - Flink, Beam, (and as another commenter points out, Firebase as well) provide stronger consistency guarantees on computations over unbounded streams.
This might still be fine, depending on your needs, but IMO a legitimate distinction.
> Existing computational models for processing continuously changing input data are unable to efficiently support iterative queries except in limited special cases. This makes it difficult to perform complex tasks, such as social-graph analysis on changing data at interactive timescales, which would greatly benefit those analyzing the behavior of services like Twitter. In this paper we introduce a new model called differential computation, which extends traditional incremental computation to allow arbitrarily nested iteration, and explain—with reference to a publicly available prototype system called Naiad—how differential computation can be efficiently implemented in the context of a declarative dataparallel dataflow language. The resulting system makes it easy to program previously intractable algorithms such as incrementally updated strongly connected components, and integrate them with data transformation operations to obtain practically relevant insights from real data streams.
See also this friendlier (and lengthier) online book: https://timelydataflow.github.io/differential-dataflow/
Isn't this a core feature of distributed systems? How can you be "consistent" if there's a network failure between some writer and the stream? How can you tell a network failure from a network delay? How can you tell a network delay from any other delay?
And finally, how can you even talk about "up-to-date" data if the reader doesn't provide their "date" (ie, a logical timestamp)?
This is covered by the CAP theorem. https://en.wikipedia.org/wiki/CAP_theorem
The basic solution is: If you need consistency and there's too much network failure (or delay), you'll have to pause operations and wait until the network is fixed.
If there's only a bit of network failure (or delay), consistency stays possible using quorum protocols such as Paxos and Raft.
> how can you even talk about "up-to-date" data if the reader doesn't provide their "date" (ie, a logical timestamp)?
Implicit causality helps.
You're right that there may be no definite logical time, but it often doesn't matter.
When a program issues a read command, the logical timestamp is, implicitly, greater than the timestamp of all results previously received from the network that were inputs to produce the read command.
So the rest of the network "knows" something about the logical time of the read command. It's not an exact logical time, and if the timestamps aren't passed around, it might not even be an inequality. It's more like a logical property that relates dependent values.
If done right, that's enough to ensure strict consistency in observable results.
Unless the program issuing reads does wild things with value speculation. You may have heard how much things can go wrong with speculative execution...
Pushing in a timestamp along with the max/variance change stream[1]. And then using the timestamp to synchronize the join[2] would naturally produce a consistent output stream.
I quoted flink because they have the best docs around. But it should be possible in most streaming systems. Disclaimer, I used to work for the fb streaming group and have collaborated with the flink team very briefly.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/t...
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.11...
If the title was something more honest such as “How product X solves for Y” I’d feel more compelled to put trust on the analysis being objective.
is that a correct interpretation?