Now when people argue “because decoupling,” I hear, “You don’t get as much notification that you just broke a downstream system.”
https://www.datadoghq.com/knowledge-center/distributed-traci...
Unless you have a single monolith, you’re going to face issues with versioning whether it’s event based or API based. In each case you can usually add new properties to a message, but you can’t remove properties or change their types. If you need that, create a new version.
The author does a lot of videos on the event sourcing topic. Event driven I get. It works well in several applications I’ve helped to build over the last 15 years. But event sourcing? I truly don’t get it. Yeah I get it’s nice in terms of auditing to see every change to an entity and who made it, or replay up to to change x on y date, but that really is a niche requirement.
I'm not sure what point is being made here. It's good that you can do that - but are you implying that that's not possible in an API-driven system?
It's not just about auditing, it's also about transactionality and atomicity.
If you want to withdraw $5 from your account, the traditional approach of locking, updating everything, unlocking (or in other words wrapping everything in a transaction) doesn't scale as well as the notion that you just record the transaction (event). Implementation-wise this withdrawal can involve, updating two accounts and updating the audit/account transaction logs. We also want this to scale since our bank has millions of customers all operating more or less concurrently. A distributed log (like Kafka) is easy to scale and easy to reason about. You just insert the transaction record and you have a distributed system that will scale and is easy to reason about.
Another driver/flavour for something like event sourcing is what some might call state-based or state-oriented programming. That is instead of modifying state directly you are synchronizing state via events. This lets you e.g. code state machines around those that can lead, again, to easier to reason about (and test) code.
There are of course other ways to do auditability.
Event Sourcing + Projections provide a nice way to build multiple models/views from the same dataset. This can provide a lot of simplification for client code.
There are also other companies, which do the typical snapshot and roll up to the current time, when they start the services, that need the data without having access to the database.
That's not exactly an obscure feature exclusive to datadog. From the top of my head, both AWS and Azure support distributed tracing with dedicated support for visualization in their x-ray and application insights services.
When you've never grown out of a single node domain but you do event driven "because scaling" or whatever, you've shot yourself in the foot amazingly hard.
But people often forget there are trade-offs to everything and if you don't have these hard problems, you're giving yourself only headaches.
My pet-peeve is "decoupling" - it's treated as holy with only benefits and no downsides. But it's actually again a level of complexity - unless you need it, tightly coupled code will be easier to write, read, debug etc.
As an event producer as long as you follow reasonable backwards-compatibility best practices then you should be pretty safe from breaking things downstream. As a consumer, follow defensive programming and allow for idempotency in case you need to reprocess an event. Pretty straightforward once you get the hang of things.
That can protect you from "downstream can't even read the message anymore" but it doesn't help you with the much more common "downstream isn't doing the right thing with the message anymore" problem. Schema evolution is kinda like schema'd RPC calls vs plain JSON: it will protect you from "oops, we sent eventId instead of event_id" type of errors, but won't prevent you from making logical errors. In a larger org, this can turn into delayed-discovery nightmares.
A synchronous API call could give you back an error response and alert your immediately to something being wrong. The system notifies you directly.
A downstream event consumer may fail in ways entirely off of your team's radar. The downstream team starts getting alerts. Whether or not those alerts make it immediately obvious to them that it's your fault... that depends on a bunch of factors.
I don't know how this could be true. Events are things - nouns which can be backed-up, replicated, stored, queried, rendered, indexed and searched over.
I generally like event-driven architecture, but I need to admit that debuggability is sacrificed where it matters most.
And as a consumer, many independent tasks can be triggered by the same event.
I'm working on a system right now and because of events, it's very easy for me to write a handler for when a certain type of record is created in the database. My feature depends on knowing that new record was made so we can send some emails and do other things.
The people that wrote the code that creates the record, didn't have to do anything to support the feature.
But I agree that it's not the right solution for every problem. But there are certain problems it solves really well.
Right up until you need to change something about the event because the business logic it represents has changed. Then you suddenly need to track down all the systems that have been relying on it, including that one that nobody knows anything about and always forgets exists because some guy decided to implement the service in erlang and nobody who ever touched it even works at the company anymore.
Don't take it into consideration and you're fucked.
Source: previous "seniors" didn't take it into consideration, they left
Same issue as microservices: there are people who want to use the paradigm but not do the investment in monitoring/tooling.
Event driven architecture, to me is itself an antipattern.
It seems like a replacement for batch processing. Replayable messages are AWESOME. Until you encounter the complexity for a system to actually replay them consistently.
As far as the authors video, while there was some truth in there, it was a little thin, compared to the complexity of these architectures. I believe that even though Kafka acts the part of "dumb pipe", it doesnt stay dumb for long, and the n distributions of Kafka logs in your organization could be 1000x more expensive than a monolithic DB and a monolithic API to maintain.
Yes it appears auditable but is it? The big argument for replayability is that unlike an API that falls over theres no data loss. If you work with Kafka long enough you’ll realize that data loss will become a problem you didnt think you had. You’ll have to hire people to “look into” data loss problems constantly with Kafka. Its just too much infrastructure to even care about.
Theres also, something ergonomically wrong with event drive architecture. People dont like it. And it also turns people into robots who are “not responsible” for their product. Theres so much infrastructure to maintain that people just punt everything back to the “enterprise kafka team”.
The whole point of microservices was to enable flexibility, smart services and dumb pipes, and effective CI/CD and devops.
We are nearing the end of microservices adoption whether it be event or request driven. In mature organizations it seems to me that request driven is winning by a large margin over event driven.
It may be counterintuitive, but the time to market of request driven architecture and cost to maintain is way way lower.
In my experience programmers are very happy to do everything in the application (something database people often complain about). What kind of problems do you see?
> If you work with Kafka long enough you’ll realize that data loss will become a problem you didnt think you had. You’ll have to hire people to “look into” data loss problems constantly with Kafka.
Not my experience at all, and I've used Kafka at a wide range of companies, from household-name scale to startups. Kafka is the boring just-works technology that everyone claims they're looking for.
I'm no fan of microservices, but Kafka is absolutely the right datastore most of the time.
Not to mention certain observability vendors bleeding you for all those logs you now need to keep an eye on it.
Absolutely agreed on every point
Also, people need to understand that "event driven" has nothing to do with "event sourcing". Just don't keep all the events until eternity, because you can (and because some people think you should because "kafka").
But when I've done that testing, Kafka hasn't been the problem.
The problem I've run into most is that ordering is a giant fucking pain in the ass if you actually want consistent replayability and don't have trivial partitioning needs. Some consumers want things in order by customer ID, other consumers want things in order by sold product ID, others by invoice ID? Uh oh. If you're thinking you could easily replay to debug, the size and scope of the data you have to process for some of those cases just exploded. Or you wrote N times, once for each of those, and then hopefully your multi-write transaction implementation was perfect!
[0] in fairness, a lot of applications also don't guarantee that they never drop requests at all, obviously. 500 and retry and hope that you don't run out of retries very often; if you do, it's just dropped on the ground and it's considered acceptable loss to have some of that for most companies/applications.
In pretty much all projects I worked with in recent years, people chop up the functionality into small separate services and have the events be serialised, sent over the network and deserialised on the other side.
This typically causes enormous waste of efficiency and consequently causes applications to be much more complex than they need to be.
I have many times worked with apps which occupied huge server farms when in reality the business logic would be fine to run on a single node if just structured correctly.
Add to that the amount of technology developers need to learn when they join the project or the amount of complexity they have to grasp to be able to be productive. Or the overhead of introducing a change to a complex project.
And the funniest of all, people spending significant portion of the project resources trying to improve the performance of a collection of slow nanoservices without ever realising that the main culprit is that the event processing spends 99.9% of the time being serialised, deserialised, in various buffers or somewhere in transit which could be easily avoided if the communication was a simple function call.
Now, I am not saying microservices is a useless pattern. But it is so abused that it might just as well be. I think most projects would be happier if the people simply never heard about the concept of microservices and instead spent some time trying to figure how to build a correctly modularised monolithic application first, before they needed to find something more complex.
Microservices make sense when there are very strong organizational boundaries between the parts (you'd have to reinterview to move from one team to the other), or if there are technical reasons why two parts of the code cannot share the same runtime environment (such as being written in different languages), and a few other less common reasons.
The MAIN reason for microservices was that you could have multiple teams work on their services independently from each other. Because coordinating work of multiple teams on a single huge monolithic application is a very complex problem and has a lot of overhead.
But, in many companies the development of microservices/agile teams is actually synchronised between multiple teams. They would typically have common release schedule, want to deliver larger features across multitude of services all at the same time, etc.
Effectively making the task way more complex than it would be with a monolithic application
I think it really matters what sort of application you are building. I do exactly this with my search engine.
If it was a monolith it would take about 10 minutes to cold-start, and it would consume far too much RAM to run a hot stand-by. This makes deploying changes pretty rough.
So the index is partitioned into partitions, each with about a minute start time. Thus, to be able to upgrade the application without long outages, I upgrade one index partition at a time. With 9 partitions, that's a rolling 10%-ish service outage.
The rest of the system is another couple of services that can also restart independently, these have a memory footprint less than 100MB, and have hot standbys.
This wouldn't make much sense for a CRUD app, but in my case I'm loading a ~100GB state into RAM.
Because deploying the whole monolith takes a long time. There are ways to mitigate this, but in $currentjob we have a LARGE part of the monolith that is implemented as a library; so whenever we make changes to it, we have to deploy the entire thing.
If it were a service (which we are moving to), it would be able to be deployed independently, and much, much quicker.
There are other solutions to the problem, but "µs are bad, herr derr" is just trope at this point. Like anything, they're a tool, and can be used well or badly.
- on the service provider, the implementation provides the actual functionality,
- on the client, the implementation of the interface is just a stub connecting to the actual service provider.
Thus you can sort of provide separation of services as an implementation detail.
However in practice very few projects elect to do this.
Proj:
|-proj-api
|-proj-client
|-proj-service
Both proj-client and proj-service consume/depend-on proj-api so they are in sync of what is going on.
Now, you can switch the implementation of the service to gRPC if you wanted with full source compatibility. Or move it locally.
The core orchestration of the system was done via events on the bus, and nobody had any idea what was happening when a bug occurred. People would pass bugs around, “my code did the right thing given the event it got”, “well, my code did the right thing too”, and nobody understood the full picture because everyone was stuck in their own silo. Event driven architectures encourage this: events decouple systems such that you don’t know or care what happens when you emit a message, until one day it’s emitted with slightly different timing or ordering or different semantics, and things are broken and nobody knows why.
The worst part is that software is basically “take user input, do process A on it, then do process B on that, then do process C on that.” It could have so easily been a simple imperative function that called C(B(A(input))), but instead we made events for “inputWasEmitted”, “Aoutput”, “Boutput”, etc.
What happens when system C needs one more piece of metadata about the user input? 3 PR’s into 3 repos to plumb the information around. Coordinating the release of 3 libraries. All around just awful to work with.
Oh and this is a very high profile piece of software with a user base in the 9 figure range.
(Wild tangent: holy shit is hard to get iOS to accept “do process” in a sentence. I edited that paragraph at least 30 times, no joke, trying every trick I could to stop it correcting it to “due process”. I almost gave up. I used to defend autocorrect but holy shit that was a nightmare.)
can you not just pick the original spelling in the autocomplete menu above the keyboard?
If you have some logic A and B running on user input, I wouldn't be splitting that across different services.
I can attest to this case study being 100% true. Our platform has been using EventStore as our primary database for 9 years going strong, and I'm still very happy with it. The key thing is that it needs to be done right from the very beginning; you can't do major architecture reworks later on and you need an architect who really knows what they're doing. Also, you can't half-ass it; event sourcing, CQRS, etc all had to embraced the entire time, no shortcuts.
I will say though, the biggest downside is that scaling is difficult since you can't always rely on snapshots of data, sometimes you need to event source the entire model and that can get data heavy. If you're standing up a new projector, you could be going through tens of millions of events before it is caught up which requires planning. It is incredible though being able to have every single state change ever made on the platform available, the data guys love it and it makes troubleshooting way easier since there's no secrets on what happened. The biggest con is that most people don't really understand it intuitively, since it's a very different way of doing things, which is why so many companies end up fucking it up.
Like I get the "message bus" architecture when you have a bunch of services emitting events and consumers for differing purposes but I don't think I would feel comfortable using it for state tracking. Especially when it seems really hard to enforce a schema / do migrations. CQRS also makes sense for this but only when it functions as a WAL and isn't meant to be stored forever but persisted by everyone who's interested in it and then eventually discarded.
I also tried doing it in a property setting, where profit margins were tight. The effort needed wasn’t worth the cost, and clients didn’t really care about the value proposition anyway. We pretty much replaced the whole layer with a more traditional crud system.
In web or business systems it works well for some(!) parts. You just shouldn’t do everything that way - but often people get too exited about a solution and then they tend to overdo it and apply it everywhere, even when not appropriate.
Always chose the golden middle path and apply patterns where they fit well.
Event driven and CQRS "entities" made logic and processing much easier to create/test/debug.
Primary issues: 1. Making sure you focus on the "Nouns" (entities) not the "Verbs". 2. Kafka requiring polling for consumers sucks if you want to "scale to zero". 3. Sharding of event consumers can be complicated. 4. People have trouble understanding the concepts and keep wanting to write "ProcessX" type functions instead of state machines and event handlers. 5. Retry/replay is complicated, better to reverse/replay. Dealing with side effects in replay is also complicated (does a replay generate the output events which trigger state changes in other entities?)
Been running now for 6 years, minimal downtime except for maintenance/upgrades.
In the process of introducing major new entity and associated changes, most of the system unaffected due to the decoupling.
(No stake in this one way or another, just curious.)
These system are working fine, but maybe a common ground : * very few services * the main throughput is "fact" events (so something that did happen) * what you get as "Event carried state transfer" is basically the configuration. One service own it, with a classical DB and a UI, but then expose configuration to all the system with this kind of event (and all the consumers consume these read only) * usually you have to deal with eventual consistency a lot in this kind of setup (so it scales well, but there is a tradeoff)
The WAL is an event log, and when you squint at its internal architecture, you’ll see plenty of overlap with distributed event sourcing.
Our users are small-businesses with organisation numbers, and we mostly think of them as unique. But they strictly aren't, so we 'overwrote' some companies with other companies.
Once we detected and fixed the bug, we just replayed the events with the fixed code, and we hadn't lost any data.
Every use I've seen sent events after database transactions, with the event not part of the transaction. This means you can get both dropped events, and out of order events.
My current company has analytics driven by a system like that. I'm sure there's some corrupted data as a result.
The main issue being people just don't know how to build and test distributed systems.
It sounded kind of impossible, I said as much, and then proposed a different approach. The interviewer persisted and claimed that it could be done with 'the outbox pattern'.
I disagreed and ended the interview there. Later when I was chatting about it with a former colleague, he said "Oh, they solved the two generals problem?"
> Every use I've seen sent events after database transactions, with the event not part of the transaction.
Maybe this is what they were doing.
However I've seen some frameworks where you can do collision imperatively. For example
if (sprite.collide(tilemap)) {do something}
These are generally on smaller less taxing frameworks (in this case I'm referring to haxeflixel) but they do exist!
So we ended up using protobufs over a local MQTT broker and adopted a macro-service architecture. This suited the project very well because it had a handful of obvious distinct parts and we took full advantage of Conway's law by making each devs work the part where their strengths and skills were maximized.
We made a few mistakes along the way but learned from them. Most of them relating to inter-service asynchronous programming. This article put words on concepts we learned through trial and errors, especially queries disguised as events.
I think it works well when it's the only thing that can work.
- Producer and consumer are decoupled. That’s a good thing m right? Good luck finding the consumer when you need to modify the producer (the payload). People usually don’t document these things
- Let’s use SNS/SQS because why not. Good luck reproducing producers and consumers locally in your machine. Third party infra in local env is usually an afterthought
- Observability. Of rather the lack of it. It’s never out of the box, and so usually nobody cares about it until an incident happens
It sounds like your alternative is a producer that updates consumers using HTTP calls. That pushes a lot of complexity to the producer and the team that has to sync up with all of the other teams involved.
> Let’s use SNS/SQS because why not. Good luck reproducing producers and consumers locally in your machine
At work we pull localstack from a shared repo and run it in the background. I almost forget that it's there until I need to "git pull" if another team has added a new queue that my service is interested in. Just like using curl to call your HTTP endpoints, you can simply just send a message to localstack with the standard aws cli
https://github.com/localstack/localstack
> Observability. Of rather the lack of it. It’s never out of the box, and so usually nobody cares about it until an incident happens
I think it depends on what type of framework you use. At work we use a trace-id field in the header when making HTTP calls or sending a message (sqs) which is propagated automatically downstream. This enables us to easily search logs and see the flow between systems. This was just configured once and is added automatically for all HTTP requests and messages that the service produces. We have a shared dependency that all services use that handles logging, monitoring and other "plumbing". Most of it comes out of the box from Spring, and the dependency just needs to configure it. The code imports a generic sns/http/jdbc producer and don't have to think about it
The amount of times I've come across someone who's inserted SQS into the mix to "speed things up"...
I just grep for the event's class name.
JavaScript
When I say increased, I mean we want the best answer but there are some answers the bank can’t know. If someone has transferred money into your account from another bank but we don’t know that yet, optimising for absolute correctness is pointless because the vast majority of wrong answers are baked in to the process. We can send you a message and you might read it a day later. Unless we delete the message from your phone, we can’t guarantee the message you read is fully consistent with our internal state.
Frankly our system is much better than the batch driven junk that is out of sync a second after it has executed. “Hey you have a reward.” “No I used it 2 hours ago you clowns.”
Note this isn’t cope. In some cases we started fully sync but relaxed it where there are tradeoffs that gave us better outcomes and we weren’t giving anything material up.
I've ended up in a lot of arguments about this while we were building larger distributed systems because I've come from a more request/response oriented message passing architectures. I.e. more synchronous. What I've found is that the event driven architecture did tend to lead to less abstractions and more leaked internal details. This isn't fundamental (you can treat events like an API) but was related to some details in our implementation (something along the line of CDC).
Another problem with distributed systems with persistent queues passing events is that if the consumer falls behind you start developing a lag. Yet another considerations is that the infrastructure to support this tends to have some performance penalties (e.g. going through Kafka with an event ends up being a lot more expensive than an RPC call). Overall it IMO makes for a lot of additional complexity which you may need in some cases, but if you don't then you shouldn't pay the cost.
What I've come to realize is that in many ways those systems are equivalent. You can simulate one over the other. If you have an event based system you can send requests as events and then wait for the response event. If you have a request/response system you can simulate events over that.
If we look at things like consensus protocols or distributed/persistent queues then obviously we would need some underlying resources (e.g. you might need a database behind your request/response model). So... Semantics. Don't know if others have a similar experience but when one system is mandated people will invent workarounds that end up looking like the other paradigm, which makes things worse.
There are things that conceptually fit well with an event driven architecture and then there are things that fit well with a request/response model. I'm guessing most large scale complex distributed apps would be best supporting both models.
I can recall software where I tried to wrestle a bunch of asynchronous things into looking more synchronous and then software where I really enjoyed working with a pure asynchronous model (Boost.Asio FTW). Usually the software where I want things to be synchronous is where for the most part I want to execute a linear sequence of things that depend on each other without really being able to use that time for doing other things vs. software where I want all things to happen at the same time all the time (e.g. being able to take in new connections over the network, serve existing connections etc.) and spinning threads for doing that is not a good fit (performance or abstraction-wise).
The locality of the synchronous model makes it easier to grok as long as you're ok with not being able to do something else while the asynchronous thing is going on. OTOH state machines, or statecharts to go further, which are an inherently asynchronous view, have many advantages (But are not Turing Complete).
I'd put it the other way: event driven architecture makes it safer to expose more internal details for longer, and lets you push back the point where you really need to fully decouple your API. I see that as an advantage; an abstract API is a means not an end.
> Another problem with distributed systems with persistent queues passing events is that if the consumer falls behind you start developing a lag.
Isn't that what you want? Whatever your architecture, fundamentally when you can't keep up either you queue or you start dropping some inputs.
> If you have a request/response system you can simulate events over that.
How? I mean you can implement your own eventing layer on top of a request/response system, but that's going to give you all the problems of both.
> If we look at things like consensus protocols or distributed/persistent queues then obviously we would need some underlying resources (e.g. you might need a database behind your request/response model).
Huh?
> Don't know if others have a similar experience but when one system is mandated people will invent workarounds that end up looking like the other paradigm, which makes things worse.
I agree that building a request/response system on top of an event sourcing system gives you something worse than using a native request/response system. But that's not a good reason to abandon the mandate, because building a true event-sourcing system has real advantages, and most of those advantages disappear once you start mixing the two. What you do need is full buyin and support at every level rather than a mandate imposed on people who don't want to follow it, but that's true for every development choice.
re: Huh. Sorry I was not clear there. What I meant is you can not create persistent queue semantics out of a request/response model without being able to make certain kinds of requests that access resources. Maybe that's an obvious statement.
re: mandate. I think I'm saying these sort of mandates inevitable result in poor design. even the purest of purest event sourcing systems actually use requests/response simply because that is the fundamental building block of systems. E.g. Kafka uses gRPCs from the client and waits for a response in order to inject something into a queue. The communication between Kafka nodes is based on messages. The basic building block of any distributed computer system is a packet (request) being sent from one machine to another, and a response being sent back (e.g. TCP control messages). A mandate that says though shall build everything on top of event sourcing is sort of silly in this context since it should be obvious the building blocks of event sourced systems use requests/response. Even without this nit-picking restricting application developers to only build on top of this abstraction inevitably leads to ugliness. IMO anyways and having seen this mandate at work in a large organizations. Use the right tool for your job is more or less what I'm saying or the other famous way of stating this is when all you have is a hammer everything looks like a nail.
re: isn't that what you want. well, if it is what you want then it is what you want, but many systems are ok with things just getting lost and not persisted. e.g. an HTTP GET request from a browser, in the absence of a network connection, is just lost, it's not persisted to be played later, and so there is no way to build a lagging queue with HTTP GET requests that are yet to be processed. Again, maybe an obvious statement.
I’ve used it with a good degree of success in some data pipeline and spark stuff to have stuff automatically kick off, without heinous conditional orchestration logic. I also use evented stuff over channels in a lot of my rust code with great success.
However, echoing the sentiments of some other comments: most articles about event driven stuff seem to be either marketing blogspam or “we tried it and it was awful”. To be honest I look at a lot of those blog posts and about half the time my thoughts are “no wonder that didn’t work out, that’s an insane design” but is that just “you’re-doing-it-wrong-cope”?
Are there success stories out there that just aren’t being written? Is there just no success stories? Is the architecture less forgiving of poor design and this “higher bar of entry” torpedoes a number of projects? Is it more susceptible to “architecture astronauts” which dooms it? Is it actually decent, but requires a somewhat larger mindset-change than most people take to it, leading to half-baked implementations?
I can’t help but feel the underlying design has some kernels of some really good ideas, but the volume of available evidence sort of suggests otherwise.
Generally, I found that when using event systems you have to be really careful not to over-use it, even small/single player games. Its super hard to debug when everything is an event - if you go this route, you essentially end up in a situation where everything is "global" and can be reached from anywhere (might as well just go full singleton mode at that point). Additionally, I found it difficult having to deal with event handlers which raise other events, or worse, async events, as then it becomes really hard to ensure the correct order of invocations.
If you plan to use an event system, my advice would be (in Unity): - Reference and raise events only on root Game Object scripts (e.g., have a root "Actor" script which subscribes/publishes events and communicates with its children via properties/C# events) - Never subscribe or publish events in regular "child" components - Use DI/service locator to fetch systems/global things and call them directly when possible from your "Actors"
Edit- I should say I never saw one in the wild, quick search found some academic projects https://scholar.google.com/scholar?q=event-driven+control+sy...
There is no avoiding it when dealing with, erm, events.
Events are things that happen that you cannot predict exactly when, where, and what.
The user clicked the mouse
The wind changed direction
Using Events to signal state change from one part of a system to another is a bad idea. Use a function call.
A rule of thumb is if the producer and the consu,er are in the same system then "Event Driven Architecture " is the anti pattern
I feel like a lot of teams out there can probably benefit from this simpler approach - it's probably what a lot of people are doing unwittingly.
> Commands only have a single consumer. There must be a single consumer. That’s it. They do not use the publish-subscribe pattern.
...oops.
Now the question is how much (more) time I want to spend on a(nother) rewrite.