We’ve been using Temporal quite successfully in Go (and more recently Python) for a little while now. It could do with being a bit easier to get up and running with, but day-to-day usage is very nice. I don’t think I could go back to plain out message queues, this paradigm is a real time saver.
The biggest challenge is deciding how many things are nails for the hammer that is Temporal. You tend to start out using it to replace an existing mess of task orchestration; but then you realise its actually a pretty good fit for any write operation that can’t neatly work in a single database transaction (because it’s hitting multiple services, technologies, third parties etc).
You have to be careful to keep your workflows deterministic, but once you get used to the paradigm, it’s enjoyable.
Durable execution systems run our code in a way that persists each step the code takes. If the process or container running the code dies, the code automatically continues running in another process with all state intact, including call stack and local variables.
Durable execution makes it trivial or unnecessary to implement distributed systems patterns like event-driven architecture, task queues, sagas, circuit breakers, and transactional outboxes. It’s programming on a higher level of abstraction, where you don’t have to be concerned about transient failures like server crashes or network issues.
aka start from scratch and write things a certain way
But how fast is this? IIRC each little insert in my DB was taking like 5ms, which would add up quickly if I were to spam it everywhere; I assume durable execution layers are better optimized for that. Do they really only snapshot before and after async JS calls, treating all other lines as hermetic and thus able to be rerun?
Starting a workflow is currently ~40ms, and I think we’ll be able to get down to 10ms this year. How long it takes to complete depends on how many persisted steps it takes (and whether it has to wait on an external event). The only steps that are persisted are workflow api calls like sleep(), startChildWorkflow(), or calling code that might fail (ie “Activity”, like a network request).
Ok, that's what I was wondering. Makes a lot more sense this way.