Like most of the issues, the solution requires experience to know when you are at the Goldilocks point (Just Right). This specific issue has a lot in common with managing database migrations in django or any other migration system.
The ideal situation is to create migrations that can always be rolled back, but sometimes this is not possible to do operationally. For example, a schema change that restricts a field from NVARCHAR to INTEGER can only be generically rolled back if all of the unconvertable data is persisted. This can be mitigated by structuring the database to avoid these dead-ends, and that really is only gained by hard-won experience.
The problem with undoing operations via new events is the same thing -- unless you have foreknowledge of this kind of problem, it is very easy to accidentally create events that perform un-undoable actions. A very simple example of a problematic event would be something that modifies a foreign-key relationship -- let's say it's a digital asset in a game and you want the ability to transfer ownership of the item from one player to another.
The simple solution is a event like SetAssetOwner(asset_pk, player_pk). This would set the item's player foreign key field to the player's primary key. Easy. However, you have lost knowledge of where the asset came from and cannot undo this operation. A better solution would be to make an event SwapAssetOwner(asset_pk, owner_pk, recipient_pk). Yes, the owner_pk is technically redundant, but it provides a check against someone trying to steal an item with a maliciously crafted SetAssetOwner event that performs no checking. Even better, this operation can be reverted by sending the same event with the owner and recipient arguments reversed. Since these properties are part of the event message, they will be persisted in the event history and all of the information to undo the event is self-contained.
I think ideally you want something like Rich Hickey speaks about when he speaks of Datomic. An append only database. You can see what the previous values for that row were, along with schema changes.
On the eventual consistency point, I've found you can get quite far with having the read model managing the race condition. This probably doesn't work everywhere, but in our system, multiple users can accept an invitation, so we have something like `InvitationAccepted{invitation_id, user_id}`. It's possible that multiple users might accept the invitation at roughly the same time, but the command-side doesn't really have to be concerned with this - it can happily allow multiple users to accept an invitation. It's up to the read model to ask, 'has this invitation already been accepted?' - if not, the acceptance is successful and will be indicated when queried, otherwise the acceptance is unsuccessful (and as a bonus, we can separately record who unsuccessfully tried to accept it). From the user's point of view, when they accept the invitation, they see a spinner until we confirm with the read model one way or the other (this could be done by polling the read model, but in our case we have an event sent back to the client).
Coming up with the event schema and versioning/granularity are hard. We have version numbers on all our events to make this a bit more manageable/explicit (`InvitationAccepted1`, for example). Storing events in a relational database does make it a bit easier to go back and edit/upgrade/delete them (sort-of cheating, but also relevant for GDPR). Also, I think we're going to end up suffering a bit from the 'whole system fallacy', but at the moment namespacing all the events (keeping in mind their expected volume) makes it a lot easier to manage.
I feel like you've skipped over the interesting part of your strategy here. If it's an eventually consistent system, what keeps the read model from having the wrong answer to this question?
In CQRS eventual consistency does not mean that we have multiple servers such that we have 2 servers with 2 different answers. It means from a command is issued to an event is propagated to all read models, there is a delay.
You need to handle the race condition at some point or another. From a CQRS point of view, a user accepting an invitation is just an event like any other event. What happens based on that event must account for the possibility that multiple users have accepted, and it's a rather straight forward thing to solve with an ES. The "accept" event with the lowest sequence number is the first.
Having the read side handle it would probably mean that when you have a read model for accepted invitations, you ignore all but the first accepted of an "invitationId".
I explicitly keep a version in the event - which is a date. It's similar in how Stripe versions their API. We're also planning to handle the events in the same way as stripe's API: each event has side effects; the side effects may change depending on the version and each version has its own application logic (cascading, so you can have 2018-08-01 run all of 2018-07-30 plus its own changes).
This lets us replay events as they happened, run an event using two different versions and perform only the diffs etc.
Our system is probably not a typical CQRS/event sourcing setup.
The event system itself idempotent: you take an event with all input data necessary to run the event (form data, necessary current state), so the system can run independently. This means that every event is typed such that the input data dictates what is necessary.
The event handler validates the event, returning errors if necessary.
Then, the event handler runs all side effects and returns operations to perform: update model X attributes to Y, insert new record Z.
In effect, we go:
User -> API -> (generate event) -> Event Handler -> (error|response) -> save event and side effects in a transaction.
This means our DB is a cache of the event and all previous data, so we're not really event sourcing — we're audit trailing.
The main benefits are:
1. Medical records are complex and we always need audit trails.
2. If a doctor submits a prescription, we can show all side effects that happened for visibility (ie. this triggered a lab task, push notification, sent this message). We can verify this in the UI and see what happened for each patient without relying on assumptions.
3. Engineers know that the API produces events and can look up exactly the side effects that happen when an event occurs (we're using Rails for the API logic right now and this isn't always obvious).
4. We can ensure that we validate when an event happens based on input and current state without complex code, catching edge cases.
5. We can choose to save the event and side effects or not. This lets us "preview" actions or "replay" actions without actually changing any world state (you toggle a "test" flag in the event which also means the event handlers know not to trigger outside side effects).
6. The "side effects" response from the event handler can be sent to a websocket observable and consumed by frontends, ensuring that the doctor UI always has an up to date version of patient data.
Random thoughts:
- It's really just a framework for the logic of an application controller that's typed and ensures everything is consistent. Plus, similar to Stripe, it allows us to version events and write migrations/upgrade paths etc.
- What about conflicts? We have a plan to use hashes of the previous data to ensure consistency with medical records: if you're modifying fields A, B, C, you send over a hash of the previous data for A, B, C alongside the request. If the event handler can't verify the hash the data must've changed in the meantime.
----
We're producing events now but the handlers aren't yet in place, so this is currently still being planned. Essentially, we're using the API as authentication, authorization, routing/HTTP management, transaction/database management while the event/controller logic is being placed into a structured framework to ingest form data, current state and produce output.
"However the events in a event store are immutable and can’t be deleted, to undo an action means sending the command with the opposite action"
Well they don't have to be immutable. I don't see why you can't update/migrate events.
CQRS/ES is a pattern and doesn't need to be shoe-horned into every solution.
I work at Transport for London and we have a CQRS/ES system for managing the data driven design data. It is synchronous and is incredibly useful for ensuring it has both strong business logic and a fast readable side, while providing auditing for free.
We still have problems with EC and CAP, and they are not due to CQRS/ES. Make the system distributed and you will have to handle all the new scenarios. Those are the trade-offs.
Then you're giving up several of the benefits of CQRS, and might as well just not bother with the additional complexity.
> Well they don't have to be immutable. I don't see why you can't update/migrate events.
The events are your source of truth. When you "migrate" your source of truth, you're in somewhat dangerous waters (and if you're using CQRS for scale, a 1% failure to migrate data might be millions of records). You also now have the issue that your source of truth is being migrated, and probably can't accept writes (or you need to emit writes in both new format and old format for whenever your change over happens)
Even without asynchronous read/write, there are still benefits worth the (arguably, small) additional complexity. For instance, the ability to add new functionality without having to migrate existing data is amazing.
Another thing is strongly consistent models, there may be valid requirements in some problem areas to have a strongly consistent and normalized model and use it for command validation. This helps especially well in the case when all requirements are not known upfront and/or business domain changes very frequently. A small change in business may require to completely redo the aggregate roots and logic if you follow standard approach, this is very expensive. A better decision could be to use a normalized SQL database instead of an aggregate root. Such approach may be more flexible in certain cases and have it's own benefits as well as cost and drawbacks.
But if strong consistency was abandoned because someone wrote general statements in favor of eventual consistency...
The solution is said to be CQRS, and nothing in the article shows how you solve that.
If you unwind the unnecessarily winded florid poetic waxing, it all comes down to “dump it to a database, and query that”.
Yes.
In some cases, "dump" is a fold/reduce, and your database is just an in memory data structure, and depending on how much latency is permitted by your service level objectives you might cache the data structure as opposed to regenerating it every time.
There's no magic.
The pattern is analogous to what you would do if your book of record were an RDBMS, and you had to run graph queries. "Dumping" the data into a graph database and running the query there would be a tempting solution, no?
I think it would cover a few of the use cases that people are turning to event sourcing for(excluding scale).
While this clearly wouldn't work in high-volume cases (i.e. where you _actually_ need CQRS), it seems like this would be the simplest option for many systems. I see a lot of articles advocating for immediately jumping into CQRS, which seems like a big increase in architectural complexity.
Does anyone have opinions/experience on this approach?
It's been running in production for a couple of months now, and it's been working great! The main drawback I can think of was that it obviously didn't work well with Swagger out-of-the-box, had to do some custom handling for showing all the available event types.
It solves the consistency problem because you can create your event inside a transaction, which will rollback if another event touching the same source is created simultaneously.
E.g. if you have these incompatible events in a ledger:
CreditAccount(account_id=123, amount=100)
DebitAccount(account_id=123, amount=60)
DebitAccount(account_id=123, amount=60)
You'd want one of the debit transactions to fail, assuming you want to preserve the invariant that the account's balance is always positive. You could put the `account_id` UUID as an `Event.source` field, which would allow you to lock the table for rows matching that UUID.
TLDR: orm’s + large transactions resulted in a lot of unexpected complexity.
At some point you get really big transactions because one write triggers five process managers which all trigger more writes and so on and so on. Performance was not a problem but I was surprised by the complexity of these big transactions in combination with an orm.
I dont have a concrete example but over the course of two years we have encountered multiple bugs that took days to solve. Theres one of these bugs that we fixed without identifying the root cause until this day.
One real database that works like this is Datomic, which is competitive with SQL for most kinds of read-heavy data modeling loads that SQL and CQRS is used for.
If you do even the most basic research into Event Sourcing you will find people warning of these trade offs (foremost amongst them Greg Young, the person who originally set out the idea)
So, where does that happen? When getting written into the read model? Then what, it emits an event saying clearly that the previous action failed? All while there's a client waiting for results?
I'd love to see a worked example based on discrete resources. Such as two people, a box, and a ball; where the ball can be held by either person or in the box, and a person can take the ball from the box but not from another person.
I find the concept intriguing but I don't really get it and I haven't been able to identify in any of the writing where "the buck stops".
I'm still working on event/message driven systems (today with Elixir mostly) but I've started to make architectural compromises to move away from especially event sourcing. Event Sourcing + CQRS may be prescriptive but it's very hard for new developers to pick up and understand the layers of abstraction underneath eg Akka + the Persistence Journal. And I'm not sure I can trust many of the open source libraries outside of Akka to be honest. I've had to dig into the depths of postgres journal implementations and apply windowing to journal queries for example because they weren't burned in at the scale I was working with (partially because I inherited an application with a single entity in a context which had many million line long journals - this highlights a design error though but hopefully you can see my point.)
You don't _need_ to use these patterns but you can still apply DDD and event/message based abstractions, and publish events. An entity can write its state to a record and then apply the state in memory as well without using a journal given you handle exactly once processing semantics correctly. This means there are knobs and dials. The problem with event sourcing in the greater picture is that it's descriptive of an approach, and there aren't many clear alternatives that people are talking about that work in similar system designs. If you have very long lived entities, or only a few of them, it gets especially difficult to keep the system alive over time, but for those use cases it doesn't mean you should stop receiving commands and emitting events.
You always here about the idea of the approach, never the reality of maintaining these systems, or the inappropriate use. In one implementation, there is one entity that receives thousands of events a day, and lives forever. How do you maintain the journal while changing the code, keep it alive over time? I've watched event sourcing and CQRS sink projects and teams. I've watched well paid contractors unable to figure out how to cluster and scale these systems. The barrier to entry for people to become effective can be high and you should understand the long view in terms of people required and cost over time and validate the approach for your use case very carefully. Again, the fact that everyone talks about event sourcing and no closely related alternatives makes it seem like the gold standard or the only option but there are other (simpler) ways to deal with your persistence in an overall similar architectural approach.
Suppose ES/CQRS is designed well, and a well-understood mechanism, would you still move away from it? For which scenarios? I can imagine exactly-once delivery not being a problem in all scenarios.
And what is the problem with long-lived entities? Solely the fact it takes longer to reconstruct current state?
And yes long recovery times are an issue unless you want to keep every entity in memory forever and ever, or you can tolerate extremely slow turnaround times. There are knobs and dials here that are subtleties that people won't think about until after they launch.
If you have a hammer everything looks like a nail kind of thing. There are other ways to handle problems that are only marginally different in how they persist and recover state, yet are far more usable for certain use cases. I'm extremely skeptical when I see someone gung-ho for event sourcing if they've never used it though. I tend to look at the problem very hard to see what else we could do for persistence while still maintaining the "source of truth" in memory.
ES specifically has a large list of drawbacks, so not everybody should use it. It would be great if people could tell about how there are many changes of gray between "CRUD-only we forget everything that is not current" and ES, but in a hush to make a point people act like those are the only known possibilities.
I find it useful to have a GenServer for complex entities like state machines modeling business processes. I'm fine with the simple entities using the database schema as the state definition. However with the complex entities I still find I run into the classic ORM problem where the database structure doesn't perfectly fit the domain's pure business logic representation.
1. Keep a traditional RDBM system. Place in here everything that needs consistency (e.g., the bank's accounts, balances and transactions) or has to be stored indefinitely.
2. That part of the system generates events. Those events are used to maintain Queryable stores that are better structured for your required queries. For instance, you could store per-client transaction lists in here. Or you could fill in an OLAP database, or whatever.
3. For each query-able store, implement a process that can populate it from the RDBMS data. This serves several purposes:
a) You can rebuild any such stores at any time. This removes the durability requirement from these stores, so they may be simpler and more efficient.
b) You can compare a rebuilt store against the current one. If there are any differences, there's either a bug in your event tracking code or a bug in the populator.
c) This consistency-check procedure is really useful during development and testing. When you make a change, it is really hard to get both the event-sourced store and the populator wrong but consistent with each other. This only happened to me due to mis-specified or mis-understood specifications, completely fencing out pure programming mistakes.
Of course, this system only works so long as your write volume can be ingested by your RDBMS. Luckily, this is the case on all of my current projects (and I would argue most projects out there). Notice that scaling reads should be much easier!
Finally, this architecture just accepts that read-errors may happen (e.g., a client doesn't see a transaction in their transactions list because some event got lost). This hasn't been a huge problem for us, since the mistake will be repaired on the next "reconciliation" process (along with a warning to get devs to investigate what happened). These reconciliations can be run as frequently or as sparsely as desired, or even on some user-initiated action (e.g.: the support guy has a button to "force reconciliation now" and instructions to press it whenever a customer complains about missing transactions).
Because the events coalesce into kafka, you can use a mechanism like logstash to spew this log data into eg elastic search and then use that for your queries. Or write to the db here (in process) or there (cqrs style.) There are different architectural approach used, and they all yield good results, of very high reliability in processing, if slightly "eventual."
It works well. We do use rdbms too, but a lot of the time we avoid reading from it after initialization (or after encountering an exception causing an actor to restart). Depends on the data though. Some places I read on every command because I know the risk front is small and it's easier for a jr to get in there and understand what's happening. A good rule is that a single process should own that data so you don't have to worry much about consistency related concerns. In microservices good practices, each service should really own its own data completely but I believe that it's fair enough to say a bounded context owns its own data (even if in a shared rdbms) if building at a smaller scale.
In terms of code, I'm usually building with some variation of onion architecture these days, where all context is build on the outside (eg receiving a request, getting info from state or db), and pure domain logic 100% effect free exists on the inside that receives all of the data from the outside and turns out some commands or events in response. This makes the important logic very testable, and easy to reason about. Core domain logic has no effects - not even logging - but only generates events/commands in response to commands and events. You never mock that stuff - it's the star of the show. Nor do you ever have to. The generated commands and events are later "applied" by the outer layer, logging, shipping them off to kafka, and/or writing some stuff to a database.
That's where I'm at today. I am working in realtime systems so it's a good fit for the approaches. I'm finding the software is turning out very well. We're working fairly quickly, the software is very encapsulated to a bounded context, the domain logic is clear in the core, easy to test and read, and change. Or even rewrite if needed.
Would only do this with a good team and a real problem where this would be useful.
Rdbm goes very far
Fowler's post is predictably excellent: https://www.martinfowler.com/bliki/CQRS.html
Until your are in the facebook-big-data scaling challenging, you don't need the madness of lose ACID. You can get terabytes of data easily on a single server, and even partition by company, customer or similar that still allow to keep domain-level consistence.
Now, I put the events after my normal CRUD operations (ie: I emit CRUD + Save to event in a single transaction). Is super easy of operate and keep coding familiar and predictable
----
I think the ideal EE database engine must be like:
Have a log Table, and a index/subtables for each validation (ie: to check uniques, aggregates, counts, etc.)
So, if a have a customer related events, I have:
- Index on: code, name
- SubTable: code, name, isactive
all the other fields are not need for validation so are recovered from the log.
In ACID:
- POST Command
- Validate data with the index,
- Save to log
- Emit blocking events (events that need to be at the same time after the save)
- Commit
Eventual:
- Emit lazy events (events that not need ACID, like to fill external sources)
Wait, event sourcing / immutable data doesn't scale.
I was doing event sourcing in 2010, and loved it. It was incredible. But by the time I started hearing people call it "event sourcing" I had already moved on to:
State-based, graph CRDTs.
They combine the best of event sourcing with the best of distributed state replication, and are super scalable!
Now even the Internet Archive[1] is running it (in 2014 I implemented it into a library that they are now using - https://github.com/amark/gun )
Consider you have a database where you store the account balance.
If you want to update the account balance you might update the row for that customer, e.g.
tblAccounts ------------ | AccountHolderId | AccountBalance |
update tblAccounts set AccountBalance = @NewAccountBalance;
In an EventSource database instead you wouldn't update the AccountBalance column. You would store something like:
AccountEvents | AccountHolderId | AccountBalance + 100.00 | AccountHolderId | AccountBalance + 150.00 | AccountHolderId | AccountBalance - 80.00
Then ifyou wnat o get the current balance you can just take the opening balance and add 100, add 150 and subtract 80.
Periodically you need to collapse these as querying for the balance could end up requiring going through a large log of events. So you snapshot at some point in time. So assuming an opening balance of zero, we could snapshot the above to 170.00.
It feels like you get auditing / logging of all changes out of the box, also if you are working in functional programming or a system where you constrain side / mutability as much as possible you sort of eliminate your db as a giant mutable object. But you also get the downsides this article talked about.
...Having written the above, I just examined the ERD for the COTS customer system in the office I'm in now. It stores not just balances but aged balances directly in the main customer table. Good grief.
That's all Event Sourcing is.
just think of it as bookkeeping, period. a ledger of all events. absolutely nothing to do with double entry (which for accounting; from wiki: "The double entry has two equal and corresponding sides known as debit and credit.")
I see this issue raised quiet often. If consistency is paramount you can make commands on certain aggregates be synchronous all the way to updating the read model. There's nothing that says you MUST have a queue in-between the event log for an aggregate and the logic that updates the read model. Use common sense.
Typically shines the most when pinpointing parts of the system that benefit from it, identifying a specific bounded context in DDD terms, but never on a whole system.
I feel like Greg Young has taken great pains to make this clear. This should be taken for granted when attempting CQRS.
Also your events will be based on a SomethingCreated or SomethingUpdated which has no business value at all. If the events are being designing like this then it is clear you’re not using DDD at all and you’re better of without event sourcing. Finally, depending on the requirements on how the synchronous the UI and the flow of the task is the eventual consistency can, and most of the times will, have a klinky feel to it and deliver a poor user experience.
If the read and write model are being updated asynchronously from the UI you're gonna have to adopt an optimistic caching scheme on the client. This is why GraphQL subscriptions are pretty much boilerplate for any client I build against a CQRS service. The Apollo client seems to handle this rather well.
Converting data between two different schemas while continuing the operation of the system is a challenge when that system is expected to be always available. Due to the very nature of software development new requirements are bound to appear that will affect the schema of your events that is inevitable.
I hereby give you permission to use the Strategy Pattern. Problem solved.
The events can’t be too small, neither too large they have to be just right. Having the instinct to get it right requires an extensive knowledge of the system, business and consumer applications, it’s very easy to choose the wrong design.
Greg Young and others have talked quite a bit about how to bound aggregates.
However the events in a event store are immutable and can’t be deleted, to undo an action means sending the command with the opposite action.
This is why bookkeeping systems have the idea of "journal entries". I haven't implemented one for an event sourced system but I can see how this might work.
Overall great post. Really enjoyed that the author took the time to walk us through all of these issues. Most are non-trivial.