Show HN: Restate – Low-latency durable workflows for JavaScript/Java, in Rust (opens in new tab)

(restate.dev)

185 pointssewen2y ago109 comments

We'd love to share our work with you: Restate, a system for workflows-as-code (durable execution). With SDKs in JS/Java/Kotlin and a lightweight runtime built in Rust/Tokio.

https://github.com/restatedev/ https://restate.dev/

It is free and open, SDKs are MIT-licensed, runtime permissive BSL (basically just the minimal Amazon defense). We worked on that for a bit over a year. A few points I think are worth mentioning:

- Restate's runtime is a single binary, self-contained, no dependencies aside from a durable disk. It contains basically a lightweight integrated version of a durable log, workflow state machine, state storage, etc. That makes it very compact and easy to run both on a laptop and a server.

- Restate implements durable execution not only for workflows, but the core building block is durable RPC handlers (or event handler). It adds a few concepts on top of durable execution, like virtual objects (turn RPC handlers into virtual actors), durable communication, and durable promises. Here are more details: https://restate.dev/programming-model

- Core design goal for APIs was to keep a familiar style. An app developer should look at Restate examples and say "hey, that looks quite familiar". You can let us know if that worked out.

- Basically every operation (handler invocation, step, ...) goes through a consensus layer, for a high degree of resilience and consistency.

- The lightweight log-centric architecture gives Restate still good latencies: For example around 50ms roundtrip (invoke to result) for a 3-step durable workflow handler (Restate on EBS with fsync for every step).

We'd love to hear what you think of it!

Show HN: Restate – Low-latency durable workflows for JavaScript/Java, in Rust

(restate.dev)

185 pointssewen2y ago109 comments

We'd love to share our work with you: Restate, a system for workflows-as-code (durable execution). With SDKs in JS/Java/Kotlin and a lightweight runtime built in Rust/Tokio.

https://github.com/restatedev/ https://restate.dev/

It is free and open, SDKs are MIT-licensed, runtime permissive BSL (basically just the minimal Amazon defense). We worked on that for a bit over a year. A few points I think are worth mentioning:

- Core design goal for APIs was to keep a familiar style. An app developer should look at Restate examples and say "hey, that looks quite familiar". You can let us know if that worked out.

- Basically every operation (handler invocation, step, ...) goes through a consensus layer, for a high degree of resilience and consistency.

We'd love to hear what you think of it!

109 comments

90 comments · 30 top-level

yaj542y ago· 8 in thread

how do tools like this handle evolving workflows? e.g., if I have a "durable worklflow" that sleeps for a month and then performs its next actions, what do I do if I need to change the workflow during that month? I really like the concept but this seems like an issue for anything except fairly short workflows. If I keep my data and algorithms separate I can modify my event handling code while workflows are "active."

p10jkle2y ago

I wrote two blog posts on this! It's a really hard problem

https://restate.dev/blog/solving-durable-executions-immutabi...

https://restate.dev/blog/code-that-sleeps-for-a-month/

The key takeaways:

1. Immutable code platforms (like Lambda) make things much more tractable - old code being executable for 'as long as your handlers run' is the property you need. This can also be achieved in Kubernetes with some clever controllers

2. The ability to make delayed RPCs and span time that way allows you to make your handlers very short running, but take action over very long periods. This is much superior to just sleeping over and over in a loop - instead, you do delayed tail calls.

delusional2y ago

> Immutable code platforms (like Lambda) make things much more tractable

My job is admittedly very old-school, but is that actually doable? I dont think my stakeholders would accept a version of "well we can't fix this bug for our current customers, but the new ones wont have it". That just seems like a chaos nobody wants to deal with.

1 more reply

yaj542y ago

ah! this took me a second to grok, but from #2 above: "we just want to send the email service a request that we want to be processed in a month. The thing that hangs around ‘in-flight’ wouldn’t be a journal of a partially-completed workflow, with potentially many steps, but instead a single request message."

I'll have to think through how much that solves, but it's a new insight for me - thanks!

I like that you're working on this. seems tricky, but figuring out how to clearly write workflows using this pattern could tame a lot of complexity.

1 more reply

rockostrich2y ago

My org solved this problem for our use case (handling travel booking) by versioning workflow runs. Most of our runs are very shortlived but there are cases where we have a run that lasts for days because of some long running polling process e.g. waiting on a human to perform some kind of action.

If we deploy a new version of the workflow, we just keep around the existing deployed version until all of its in-flight runs are completed. Usually this can be done within a few minutes but sometimes we need to wait days.

We don't actually tie service releases 1:1 with the workflow versions just in case we need a hotfix for a given workflow version, but the general pattern has worked very well for our use cases.

p10jkle2y ago

Yeah, this is pretty much exactly how we propose its done (restate services are inherently versioned, you can register new code as a new version and old invocations will go to the old version).

The only caveat being that we generally recommend that you keep it to just a few minutes, and use delayed calls and our state primitives to have effects that span longer than that. Eg, to poll repeatedly a handler can delayed-call itself over and over, and to wait for a human, we have awakeables (https://docs.restate.dev/develop/ts/awakeables/)

More discussion: https://restate.dev/blog/code-that-sleeps-for-a-month/

delusional2y ago

Conceptually I think the only thing these tools add on to the mental model of separation of data and logic is that they also store the name of next routine to call. The name is late bond, so migration would amount to switching out the implementation of that procedure.

pavel_pt2y ago

Restate also stores a deployment version along with other invocation metadata. FaaS platforms like AWS Lambda make it very easy to retain old versions of your code, and Restate will complete a started invocation with the handlers that it started with. This way, you can "drain" older executions while new incoming requests are routed to the latest version.

You still have to ensure that all versions of handler code that may potentially be activated are fully compatible with all persisted state they may be expected to access, but that's not much different from handling rolling deployments in a large system.

p10jkle2y ago

not necessarily - we store the intermediary states of your handler, so it can be replayed on infrastructure failures. if the handler changes in what it does, those intermediary states (the 'journal') might no longer match this. the best solution is to route replayed requests to the version of the code that originally executed the request, but: 1. many infra platforms dont allow you to execute previous versions 2. after some duration (maybe just minutes), executing old code is dangerous, eg because of insecure dependencies.

1 more reply

rubyfan2y ago· 6 in thread

There’s a lot of jargon in this, is there a lay person explanation of what problem this solves?

p10jkle2y ago

Our goal is to make it easier to write code that handles failures - failed outbound api calls, infrastructure issues like a host dying, problems talking between services. The primitive we offer is that we guarantee that your handlers always run to completion (whether to a result or a terminal error)

The way we do that is by writing down what your code is doing, while its doing it, to a store. Then, on any failure, we re-execute your code, fill in any previously stored results, so that it can 'zoom' back to the point where it failed, and continue. It's like a much more efficient and intelligent retry, where the code doesn't have to be idempotent.

delusional2y ago

> where the code doesn't have to be idempotent

Is that true? I don't think that makes any theoretical sense, since I'm pretty sure the whole thing relies on transparent retries for external calls.

If I complete some action that can't be retried and then die before writing it to the log (completing an action unatomically) there would seem to be no way for this to recover without idempotency.

1 more reply

johtso2y ago

Doesn't anything involving requests to other services inherently have to be idempotent because there's still a chance of a communication error resulting in an unknown outcome of the action? You don't know if the "widget order" was successfully placed or not, and therefore there's no way to know if that action can safely be tried again.

2 more replies

fire_lake2y ago

This assumes that the APIs work this way?

What if the first call is to get a resource that expires and then the last call fails?

Now it will retry but with an expired resource (first call is saved).

1 more reply

rubyfan2y ago

So non-response time bound workloads that need to reliably dispatch other processes to completion?

Would a good example be something like, automated highway toll collecting? i.e. I drive past a scanner on the highway, my license plate is scanned and several state bound collection events need to be triggered until the toll is ultimately collected?

1 more reply

corytheboyd2y ago

What if the code was changed by the time it is retried? I imagine it would have to throw away its memorized instructions, and because the code isn’t idempotent…

1 more reply

BenoitP2y ago· 3 in thread

For context (because he's too good to brag) OP is among the original creators of Apache Flink.

Question for OP: I'd bet Flink's Statefuns comes in Restate's story. Could you please comment on this? Maybe Statefuns we're sort of a plugin, and you guys wanted to rebase to the core of a distributed function?

sewenOP2y ago

Thank you!

Yes, Flink Stateful Functions were a first experiment to build a system for the use cases we have here. Specifically in Virtual Objects you can see that legacy.

With Stateful Functions, we quickly realized that we needed something built for transactions, while Flink is built for analytics. That manifests in many ways, maybe most obviously in the latency: Transactional durability takes seconds in Flink (checkpoint interval) and milliseconds in Restate.

Also, we could give Restate a very different dev ex, more compatible with modern app development. Flink comes from a data engineering side, very different set of integrations, tools, etc.

mikeqq20242y ago

Does the efficiency come from the raft implementation of distributed transactions or something else?

1 more reply

pavel_pt2y ago

I hope @sewen will expand on this but from the blog post he wrote to announce Restate to the world back in August '23:

> Stateful Functions (in Apache Flink): Our thoughts started a while back, and our early experiments created StateFun. These thoughts and ideas then grew to be much much more now, resulting in Restate. Of course, you can still recognize some of the StateFun roots in Restate.

The full post is at: https://restate.dev/blog/why-we-built-restate/

hintymad2y ago· 3 in thread

I'm not sure "In Rust" serve any marketing value. A product's success rarely has to do with the use of a programming language, if not at all. I understand the arguments made by Paul Graham on the effectiveness of programming languages, but specifically for a workflow manager, a user like me cares literally zero about which programming language the workflow system uses even if I have to hack into the internal of the system, and latency really matters a lot less than throughput.

tempaccount4202y ago

You are free to ignore it. Personally I like to see new projects be made in Rust, because it means they're easier to contribute to than projects in other unmanaged non-GC languages.

threeseed2y ago

Having spent a lot of time recently writing Rust it's a major negative for me.

It's a terrible language for concurrency and transitive dependencies can cause panics which you often can't recover from.

Which means the entire ecosystem is like sitting on old dynamite waiting to explode.

JVM really has proven itself to be by far the best choice for high-concurrency, back-end applications.

swyx2y ago

it does if it makes Hners click upvote...

hamandcheese2y ago· 3 in thread

Is this a competitor to Temporal? I admit that I have never used either, but it strikes me as odd that these things bring their own data layer. Is the workload not possible using a general purpose [R]DBMS?

pavel_pt2y ago

Disclaimer: I work on Restate together with @p10jkle.

You can absolutely do something similar with a RDBMS.

I tend to think of building services in state machines: every important step is tracked somewhere safe, and causes a state transition through the state machine. If doing this by hand, you would reach out to a DBMS and explicitly checkpoint your state whenever something important happens.

To achieve idempotency, you'd end up peppering your code with prepare-commit type steps where you first read the stored state and decide, at each logical step, whether you're resuming a prior partial execution or starting fresh. This gets old very quickly and so most code ends up relying on maybe a single idempotency check at the start, and caller retries. You would also need an external task queue or a sweeper of some sort to pick up and redrive partially-completed executions.

The beauty of a complete purpose-built system like Restate is that it gives you a durable journal service that's designed for the task of tracking executions, and also provides you with an SDK that makes it very easy to achieve the "chain of idempotent blocks" effect without hand-rolling a giant state machine yourself.

You don't have to use Restate to persist data, though you can - and you get the benefit of having the state changes automatically commit with the same isolation properties as part of the journaling process. But you could easily orchestrate writes into external stores such as RDBMS, K-V, queues with the same guaranteed-progress semantics as the rest of your Restate service. Its execution semantics make this easier and more pleasant as you get retries out of the box.

Finally, it's worth mentioning that we expose a PostgreSQL protocol-compatible SQL query endpoint. This allows you to query any state you do choose to store in Restate alongside service metadata, i.e. reflect on active invocations.

sewenOP2y ago

That's definitely a good question. A few thoughts here (I am one of the authors). The "bring your own data layer" has several goals:

(1) it is really helpful in getting good latencies.

(2) it makes it self-contained, so easy to start and run anywhere

(3) There is a simplicity in the deeply integrated architecture, where consensus of the log, fencing of the state machine leaders, etc. goes hand in hand. It removes the need to coordinate between different components with different paradigms (pub-sub-logs, SQL databases, etc) that each have their own consistency/transactions. And coordination avoidance is probably the best one can do in distributed systems. This ultimately leads also to an easier to understand behavior when running/operating the system.

(4) The storage is actually pluggable, because the internal architecture uses virtual consensus. So if the biggest ask from users would be "let me use Kafka or SQS FIFO" then that's doable.

We'd love to go about this the following way: We aim to provide an experience than is users would end up preferring to maintaining multiple clusters of storage systems (like Cassandra + ElasticSearch + X server and Y queues) though this integrated design. If that turns out to not be what anyone wants, we can still relatively easily work with other systems.

AhmedSoliman2y ago

Nothing prevents you from using your own data layer, but part of the power of Restate is the tight control over the short-term state and the durable execution flow. This means that you don't need to think a lot about concurrency control, dirty reads, etc.

sharkdoodoo2y ago· 3 in thread

I understand the need for writing this as an SDK over existing languages for adoption reasons, but in your opinion would a programming language purposely built for such a paradigm make more sense?

slinkydeveloper2y ago

(Disclaimer: I work at Restate on SDKs) This is a very interesting point. I did some investigation myself, and so far I'm torn apart on whether a novel language would really make such a big difference for durable execution engines like Restate.

Let me elaborate it: first of all, what would be the killer feature that justifies creating a whole new PL for durable execution? From what I can tell, the thing that IMO can really make a difference would be the ability to completely hide durable execution from the user, by being able to take snapshots of the execution at any point in time and then record those in the engine transparently. Now let's say such language exists, and it can also take those snapshots reasonably fast, it is still quite a problem to establish where it's logically safe to take a snapshot, and when the execution cannot continue because you need to wait acknowledgment for stored results. Say for example you have the following code:

val resultA = callA() val resultB = callB(resultA)

Both A and B do some non-deterministic operation, e.g. they perform HTTP calls to some other systems. Now let's say that when callB() completed, but before you got the HTTP response, your code for whatever reason crashes. If you didn't took any snapshot between callA() and callB(), you will completely lose forever the fact that B was invoked with resultA, and the next time you re-execute A, it might generate a result that is different from the one that was generated the first time. Due to this problem, you would still need to somehow manually define some "safepoints" where it's safe to take those snapshots. Meaning that we can't really hide the durable execution from the user, as you would still need some statement like "snapshot_here" to tell the engine where it's safe to snapshot or not.

In our SDKs we effectively implement that, by taking the safe approach of always waiting for storage acknowledgement when you execute two consecutive ctx.run().

But happy to be proven wrong!

sharkdoodoo2y ago

Oh wow, thanks for the depth in your reply. well I don't know anything about programming languages but something just made me ask this question out of curiosity. I may have to play around with restate a bit

p10jkle2y ago

Super interesting question! If we were inventing modern tech from scratch, I think there's space for this, definitely. Our goal though is that people can use their primitives in the systems they have already, which means Java, Go, Python, TS support are all table stakes

ko_pivot2y ago· 3 in thread

Being fairly familiar with Temporal, I definitely appreciate your cleaner architectural choices. Add a Go SDK and I’ll definitely give this a try.

p10jkle2y ago

Someone already contributed an MVP; in the next few months we'll be adopting it fully and upgrading it to 1.0 (we hired the awesome Azmy after he built it) https://github.com/muhamadazmy/restate-sdk-go

abtinf2y ago

I think this is a point worth highlighting much more prominently in your marketing.

In my mind, this moved restate from “huh, that’s cool” to “during tomorrow’s standup, I’m going to ask one of my engineers to build a poc.”

1 more reply

caust1c2y ago

When you do consider adopting it fully, I highly recommend trying to make the state handling as transparent as possible to the end consumer. For example, implementing an HTTP Client that wraps a http.RoundTripper versus what that SDK provides.

Evaluating a selection of these durable workflow SDKs for Go, I'm not keen on being tightly coupled to a vendor and the implementation shouldn't be that crazy to fit into existing Go interfaces.

1 more reply

senorrib2y ago· 2 in thread

Looks very interesting, but calling it Open Source is misleading. BSL is not "minimal Amazon defense". It effectively prevents any meaningful dynamic functionality to be built on top of it without a commercial subscription.

stsffap2y ago

We tried to design the additional usage grant (https://github.com/restatedev/restate/blob/39f34753be0e27af8...) as permissive as possible. Our intention is to only prevent the big cloud service providers from offering Restate as a managed service as it has happened in the past with other open source projects. If you find the additional usage grant still too restrictive, then let us talk how to adjust it to enable you while still maintaining our initial intention.

senorrib2y ago

Our use case is to allow users to customize workflows based on a few building blocks. Think of an ERP that would allow users to add or remove steps or different paths to their payroll workflow, for example.

The wording in the additional grant labels software like this as an Application Platform Service -- which is fair, and perhaps intended, but we're still not a big cloud service provider.

bilalq2y ago· 2 in thread

Could you share details on limits to be mindful of when designing workflows? Some things I'd love to be able to reference at a glance:

1. Max execution duration of a workflow

2. Max input/output payload size in bytes for a service invocation

3. Max timeout for a service invocation

4. Max number of allowed state transitions in a workflow

5. Max Journal history retention time

stsffap2y ago

1. There is no maximum execution duration for a Restate workflow. Workflows can run only for a few seconds or span months with Restate. One thing to keep in mind for long-running workflows is that you might have to evolve the code over its lifetime. That's why we recommend writing them as a sequence of delayed tail calls (https://news.ycombinator.com/item?id=40659687)

2. Restate currently does not impose a strict size limit for input/output messages by default (it has the option to limit it though to protect the system). Nevertheless, it is recommended to not go overboard with the input/output sizes because Restate needs to send the input messages to the service endpoint in order to invoke it. Thus, the larger the input/output sizes, the longer it takes to invoke a service handler and sending the result back to the user (increasing latency). Right now we do issue a soft warning whenever a message becomes larger than 10 MB.

3. If the user does not specify a timeout for its call to Restate, then the system won't time it out. Of course, for long-running invocations it can happen that the external client fails or its connection gets interrupted. In this case, Restate allows to re-attach to an ongoing invocation or to retrieve its result if it completed in the meantime.

4. There is no limit on the max number of state transitions of a workflow in Restate.

5. Restate keeps the journal history around for as long as the invocation/workflow is ongoing. Once the workflow completes, we will drop the journal but keep the completed result for 24 hours.

sewenOP2y ago

For a many of those values, the answer would be "as much as you like", but with awareness for tradeoffs.

You can store a lot of data in Restate (workflow events, steps). Logged events move quickly to an embedded RocksDB, which is very scalable per node. The architecture is partitioned, and while we have not finished all the multi-node features yet, everything internally is build in a partitioned scalable manner.

So it is less a question of what the system can do, maybe more what you want:

- if you keep tens of thousands of journal entries, replays might take a bit of time. (Side note, you also don't need that, Restate's support for explicit state gives you an intuitive alternative to the "forever running infinite journal" workflow pattern some other systems promote.)

- Execution duration for a workflow is not limited by default. More of a question of how long do you want to keep instances older versions of the business logic around?

- History retention (we do this only for tasks of the "workflow" type right now) as much as you are willing to invest into for storage. RocksDB is decent at letting old data flow down the LSM tree and not get in the way.

Coming up with the best possible defaults would be something we'd appreciate some feedback on, so would love to chat more on Discord: https://discord.gg/skW3AZ6uGd

The only one where I think we need (and have) a hard limit is the message size, because this can adversely affect system stability, if you have many handlers with very large messages active. This would eventually need a feature like out-of-band transport for large messages (e.g., through S3).

bilalq2y ago· 2 in thread

I still haven't gotten around to adopting Restate yet, but it's on the radar. One thing that Step Functions probably has over Restate is the diagram visualization of your state machine definition and execution history. It's been really neat to be able to zero in on a root cause at the conceptual level instead of the implementation level.

One big hangup for me is that there's only a single node orchestrator as a CDK construct. Having a HA setup would be a must for business critical flows.

I stumbled on Restate a few months ago and left the following message on their discord.

> I was considering writing a framework that would let you author AWS Step Functions workflows as code in a typesafe way when I stumbled on Restate. This looks really interesting and the blog posts show that the team really understands the problem space.

> My own background in this domain was as an early user of AWS SWF internally at AWS many, many years ago. We were incredibly frustrated by the AWS Flow framework built on top of SWF, so I ended up creating a meta Java framework that let you express workflows as code with true type-safety, arrow function based step delegations, and leveraging Either/Maybe/Promise and other monads for expressiveness. The DX was leaps and bounds better than anything else out at the time. This was back around 2015, I think.

> Fast-forward to today, I'm now running a startup that uses AWS Step Functions. It has some benefits, the most notable being that it's fully serverless. However, the lack of type-safety is incredibly frustrating. An innocent looking change can easily result in States.Runtime errors that cannot be caught and ignore all your catch-error logic. Then, of course, is how ridiculous it feels to write logic in JSON or a JSON-builder using CDK. As if that wasn't bad enough, the pricing is also quite steep. $25 for every million state transitions feels like a lot when you need to create so many extra state transitions for common patterns like sagas, choice branches, etc.

> I'm looking forward to seeing how Restate matures!

p10jkle2y ago

A visualisation/dashboard is a top priority! Distributed architecture (to support multiple nodes for HA and horizontal scaling) is being actively worked on and will land in the coming months

bilalq2y ago

That's exciting!

Out of curiosity, have you explored the possibility of a serverless orchestration layer? That's one of the most appealing parts of Step Functions. We have many large workflows that run just a couple times a day and take several hours alongside a few short workflows that run under a minute and are executed more frequently during peak hours. Step Functions ends up being really cost effective even through many state transitions because most of the time, the orchestrator is idle.

Coming from an existing setup where everything is serverless, the fixed cost to add serverfull stuff feels like a lot. For a HA setup, it'd be 3 EC2 instances and 3 NAT gateways spread across 3 AZs. Then multiply that for each environment and dev account, and it ends up being pretty steep. You can cut costs a bit by going single AZ for non-prod envs, but still...

I couldn't find a pricing model for Restate Cloud, but I'm including "managed services" under the definition of serverless for my purposes. Maybe that offering can fill the gap, but then it does raise security concerns if the orchestration is not happening on our own infra.

1 more reply

aleksiy1232y ago· 2 in thread

Looks really awesome. Always been looking for some easy to use async workflows + cronjobs service to use with serverless like Vercel.

Also something about this area always makes me excited. I guess it must be the thought of having all these tasks just working in the background without having to explicitly manage them.

One question I have is does anyone have experience for building data pipelines in this type of architecture?

Does it make sense to fan out on lots of small tasks? Or is it better to batch things into bigger tasks to reduce the overhead.

stsffap2y ago

While Restate is not optimized for analytical workloads it should be fast enough to also use it for simpler analytical workloads. Admittedly, it currently lacks a fluent API to express a dataflow graph but this is something that can be added on top of the existing APIs. As @gvdongen mentioned a scatter-gather like pattern can be easily expressed with Restate.

Regarding whether to parallelize or to batch, I think this strongly depends on what the actual operation involves. If it involves some CPU-intensive work like model inference, for example, then running more parallel tasks will probably speed things up.

gvdongen2y ago

Here is a fan-out example for async tasks: https://docs.restate.dev/use-cases/async-tasks#parallelizing... First, a number of tasks are scheduled, and then their results are collected (fan-in). This probably comes closest to what you are looking for. Each of those tasks gets executed durably, and their execution tracked by Restate.

netvarun2y ago· 2 in thread

Feedback: everybody’s question is going to be on why this over temporal? I’ve noticed you answered a little bit of that below. My advice would be to write a detailed blog post maybe on how both the systems compare from installation to use cases and administration, etc - I’ve been following your blog and while I think y’all are doing interesting stuff I still haven’t wrapped my head around how exactly is restate different from temporal which is a lot more funded, has almost every unicorn using them and are fully permissively licensed.

sewenOP2y ago

That blog post should exist, agree. Here is an attempt at a short answer (with the caveat that I am not an expert in Temporal).

(1) Restate has latencies that to the best of my knowledge are not achievable with Temporal. Restate's latencies are low because of (a) its event-log architecture and (b) the fact that Restate doesn't need to spawn tasks for activities, but calls RPC handlers.

(2) Restate works really well with FaaS. FaaS needs essentially a "push event" model, which is exactly what Restate does (push event, call handler). IIRC, Temporal has a worker model that pulls tasks, and a pull model is not great for FaaS. Restate + AWS Lambda is actually an amazing task queue that you can submit to super fast and that scales out its workers virtually infinitely automatically (Lambda).

(3) Restate is a self-contained single binary that you download and start and you are done. I think that is a vastly different experience from most systems out there, not just Temporal. Why do app developers love Redis so much, despite its debatable durability? I think it is the insanely lightweight manner they love, and this is what we want to replicate (with proper durability, though).

(4) Maybe most importantly, Restate does much more than workflows. You can use it for just workflows, but you can also implement services that communicate durably (exactly-one RPC), maintain state in an actor-style manner (via virtual objects), or ingest events from Kafka.

This is maybe not the first thing you build, but it shows you how far you can take this if you want: It is a full app with many services, workflows, digital twins, some connect to Kafka. https://github.com/restatedev/examples/tree/main/end-to-end-...

All execution and communication is async, durable, reliable. I think that kind of app would be very hard to build with Temporal, and if you build it, you'd probably be using some really weird quirks around signals, for example when building the state maintenance of the digital twin that don't make this something any other app developer would find really intuitive.

netvarun2y ago

Thanks for the detailed answer - please turn it into a blog post! Excited to see competition and different architectural approaches to tackle durable execution. Wishing you all the very best!

mikelnrd2y ago· 2 in thread

Hi. I'm excited to try this out. Does the typescript library for writing restate services run in Deno? And how about in a Cloudflare worker? These aren't quite nodejs environments but they do both offer comparability layers that make most nodejs libraries work. Just wondering if you know if the SDK will run in those runtimes? Thanks

p10jkle2y ago

Hey! I managed to get a POC running on Cloudflare workers, I had to make some small changes to the SDK eg to remove the http2 import, convert the Cloudflare request type into the Lambda request type, and add some methods to the Buffer type. I suspect similar things would be needed on Deno platforms. We have it on our todo list (scheduled within weeks not months) to make it possible to import a version of the library that just works out of the box on these platforms. I think if we had someone with a use case asking for it, we would happily build that even sooner - maybe come chat in our discord? https://discord.gg/skW3AZ6uGd

Once http2 stuff is removed, there's nothing particularly odd that our library does that shouldn't work in all platforms, but I'm sure there will be some papercuts until we are actively testing against these targets

tonyhb2y ago

Disclaimer: I work for Inngest (https://www.inngest.com), which works in the same area and released 2 years ago.

The restate API is extremely similar to ours, and because of the similarities both Restate and Inngest should work on Bun, Deno, or any runtime/cloud. We most definitely do, and have users in production on all TS runtimes in every cloud (GCP, Azure, AWS, Vercel, Netlify, Fly, Render, Railway, Cloudflare, etc).

akbirkhan2y ago· 2 in thread

Nice! Excited tools that makes using microservices easier.

Question tho, when will you guys have python support? I’m a ml researcher here and can you tell that most of my work is now pipelines between different services, e.g. Chaining multiple LLM services. Big bottleneck is if one service returns an error and crashes the full chain.

Big fan of this work nevertheless. Just think you have alpha on the table

pavel_pt2y ago

We don't have specific plans for our next SDK to build, but Python definitely comes up often - thank you for the input!

p10jkle2y ago

Probably one of our two most requested languages. We absolutely are going to do it, probably in the next 6-12 months :)

jamifsud2y ago· 2 in thread

Any plans for a Python SDK? We’re actively looking for a platform like this but our stack is TS / Python!

stsffap2y ago

We are actively looking for feedback on what SDK to develop next. Quite a few people have voiced interest in Python so far. This will make it more likely that we might tackle this soonish. We'll keep you posted.

mikeqq20242y ago

Python SDK +1

mnahkies2y ago· 2 in thread

Do you have anything comparing and contrasting with temporal?

I'm particularly interested in the scaling characteristics, and how your approach to durable storage (seems no external database is required?) differs

stsffap2y ago

We will create a more detailed comparison to Temporal shortly. Until then @sewen gave a nice summarizing comparison here: https://news.ycombinator.com/item?id=40660568.

And yes, Restate does not have any external dependencies. It comes as a single self-contained binary that you can easily deploy and operate wherever you are used to run your code.

mnahkies2y ago

Nice thanks. The ability to use cloud functions/lambdas is certainly intriguing and something I'd hoped would be possible with temporal when I first discovered it.

In a multi-replica / horizontally scaled setup:

- Does each replica get its own independent storage volume?

- How much state is replicated between each volume if so?

- What does a typical workloads journal look like in terms of storage size? How often does this get compacted / is there archival to cold storage?

- How do you manage upgrades to restate in the case of a change to the on-disk format? (Or is this designed to be static between releases)

The fact that it's without external dependencies also makes me wonder if it would be encouraged to have multiple independent deployments of restate managing different independent services/workflows to avoid noisy neighbors and single points of failure - seems like it might be lightweight enough that this is practical?

whoiskatrin2y ago· 2 in thread

The cloud setup was super fast! I used it for an existing app + restate TS sdk, really took a few steps to get things up and running! Looking forward to more support for nextjs/node

pavel_pt2y ago

Appreciate the feedback! What kind of support do you wish for, if there was one thing you would prioritize?

whoiskatrin2y ago

Pull handlers would make integration much easier, I think

1 more reply

dovys2y ago· 2 in thread

Handling durability for RPCs is a neat idea. Can you do chained rollbacks? ie an rpc down the call stack fails to revert the whole stack instead of retrying?

gvdongen2y ago

Here is another example in the examples repo which does compensation. There is also a Java one https://github.com/restatedev/examples/blob/main/basics/basi...

p10jkle2y ago

we talk a bit about compensations in the post: https://restate.dev/blog/graceful-cancellations-how-to-keep-... the gist is that you can just use catch statements and put rollback logic in it. Restate guarantees that handlers run to the end, so there's no risk that it somehow won't reach the catch statement due to an infra failure. So catch, rethrow, and then all the way up the stack, the compensations will run

jiehong2y ago· 2 in thread

This seems interesting!

I couldn’t find an equivalent of the codec server in temporal that basically encrypts all data in the event log. Is there something similar?

stsffap2y ago

Currently, Restate does not support this functionality out of the box. Since Restate does not need access to input/output messages or state (it ships it as bytes to the service endpoint), you could add your own client-side encryption mechanism. In the foreseeable future, Restate will probably add a more integrated solution for it.

p10jkle2y ago

We haven’t built any client side encryption tools yet. I don’t think it would be particularly difficult to do an MVP. If it’s very important to your use case, come chat to us in Discord? https://discord.com/invite/skW3AZ6uGd

sharkdoodoo2y ago· 2 in thread

Are there any theoretical underpinnings in the design of restate? Any papers/references. Thanks!

stsffap2y ago

Restate is built as a sharded replicated state machine similar to how TiKV (https://tikv.org/), Kudu (https://kudu.apache.org/kudu.pdf) or CockroachDB (https://github.com/cockroachdb/cockroach) are designed. Instead of relying on a specific consensus implementation, we have decided to encapsulate this part into a virtual log (inspired by Delos https://www.usenix.org/system/files/osdi20-balakrishnan.pdf) since it makes it possible to tune the system more easily for different deployment scenarios (on-prem, cloud, cost-effective blob storage). Moreover, it allows for some other cool things like seamlessly moving from one log implementation to another. Apart from that the whole system design has been influenced by ideas from stream processing systems such as Apache Flink (https://flink.apache.org/), log storage systems such as LogDevice (https://logdevice.io/) and others.

We plan to publish a more detailed follow-up blog post where we explain why we developed a new stateful system, how we implemented it, and what the benefits are. Stay tuned!

AhmedSoliman2y ago

It’s a mixed bag of design ideas. There is definitely inspiration from LogDevice (disclaimer, I am one LogDevice designers) and Delos for (Bifrost, our distributed log design). You can read about Delos in https://www.usenix.org/system/files/osdi20-balakrishnan.pdf

_1tan2y ago· 2 in thread

Cool, congrats on launching! Could this replace Jobrunr?

stsffap2y ago

From a quick glance at what JobRunr does (especially running asynchronous/delayed background tasks), it seems that Restate would be a very good fit for it as well. Restate will also handle persistence for you w/o having to deploy & operate a separate RDBMS or NoSQL store. Note that I am not a JobRunr expert, though.

p10jkle2y ago

Thanks! I'm not familiar with Jobrunr, but we can definitely help with orchestrating async tasks (as well as sync rpc calls), especially if its important that they run to completion

magnio2y ago· 1 in thread

How does Restate compare with Apache Airflow or Prefect?

sewenOP2y ago

Disclaimer, I am not an Airflow expert and even less of a Prefect expert.

One difference is that Airflow seems geared towards heavier operations, like in data pipelines. In contrast, would be that Restate is not by default spawning any tasks, but it acts more of a proxy/broker for RPC- or event handlers and adds durable retries, journaling, ability to make durable RPCs, etc.

That makes it quite lightweight: If the handlers is fast in a running container, the whole thing results in super fast turnaround times (milliseconds).

You can also deploy the handlers on FaaS and basically get the equivalent of spawning a (serverless task) per step.

The other difference would be the way that the logic is defined, can maintain state, can make exactly-once calls to other handlers.

qwertyuiop_2y ago· 1 in thread

Looks cool. Just out of curiosity, where did you find the template for your homepage? is there a content framework you are using ?

p10jkle2y ago

I don't think its a template, I'm afraid! Its a webflow site though

johtso2y ago· 1 in thread

The label "Sign in with your corporate ID" for GitHub sign in seems a little odd..

p10jkle2y ago

I think its a cognito default - will take a look!

sewenOP2y ago

A few links worth sharing here:

- Blog post with an overview of Restate 1.0: https://restate.dev/blog/announcing-restate-1.0-restate-clou...

- Restate docs: https://docs.restate.dev/

- Discord, for anyone who wants to chat interactively: https://discord.com/invite/skW3AZ6uGd

azmy2y ago

I have been following on this project on a while and i tried it on older version and was already amazing. I am so excited to try this version out! specially with the cloud offering

p10jkle2y ago

Hey all, I work with @sewen, and I focus on the cloud platform which also launched today (https://restate.dev/blog/announcing-restate-cloud-early-acce...) Happy to answer any questions :)

AhmedSoliman2y ago

"Virtual Objects" is a cool concept, the name might not reflect the power it brings though. Luckily, the documentation seems to explain it well.

swyx2y ago

techcrunch announcement here as well https://techcrunch.com/2024/06/12/restate-raises-7m-for-its-...

exabrial2y ago

Looks awesome! Have you ever considered EPL for an Amazon defense?

j / k navigate · click thread line to collapse

109 comments

90 comments · 30 top-level

yaj542y ago· 8 in thread

p10jkle2y ago

I wrote two blog posts on this! It's a really hard problem

https://restate.dev/blog/solving-durable-executions-immutabi...

https://restate.dev/blog/code-that-sleeps-for-a-month/

The key takeaways:

delusional2y ago

> Immutable code platforms (like Lambda) make things much more tractable

1 more reply

yaj542y ago

I'll have to think through how much that solves, but it's a new insight for me - thanks!

I like that you're working on this. seems tricky, but figuring out how to clearly write workflows using this pattern could tame a lot of complexity.

1 more reply

rockostrich2y ago

We don't actually tie service releases 1:1 with the workflow versions just in case we need a hotfix for a given workflow version, but the general pattern has worked very well for our use cases.

p10jkle2y ago

Yeah, this is pretty much exactly how we propose its done (restate services are inherently versioned, you can register new code as a new version and old invocations will go to the old version).

More discussion: https://restate.dev/blog/code-that-sleeps-for-a-month/

delusional2y ago

pavel_pt2y ago

p10jkle2y ago

1 more reply

rubyfan2y ago· 6 in thread

There’s a lot of jargon in this, is there a lay person explanation of what problem this solves?

p10jkle2y ago

delusional2y ago

> where the code doesn't have to be idempotent

Is that true? I don't think that makes any theoretical sense, since I'm pretty sure the whole thing relies on transparent retries for external calls.

If I complete some action that can't be retried and then die before writing it to the log (completing an action unatomically) there would seem to be no way for this to recover without idempotency.

1 more reply

johtso2y ago

2 more replies

fire_lake2y ago

This assumes that the APIs work this way?

What if the first call is to get a resource that expires and then the last call fails?

Now it will retry but with an expired resource (first call is saved).

1 more reply

rubyfan2y ago

So non-response time bound workloads that need to reliably dispatch other processes to completion?

1 more reply

corytheboyd2y ago

What if the code was changed by the time it is retried? I imagine it would have to throw away its memorized instructions, and because the code isn’t idempotent…

1 more reply

BenoitP2y ago· 3 in thread

For context (because he's too good to brag) OP is among the original creators of Apache Flink.

sewenOP2y ago

Thank you!

Yes, Flink Stateful Functions were a first experiment to build a system for the use cases we have here. Specifically in Virtual Objects you can see that legacy.

Also, we could give Restate a very different dev ex, more compatible with modern app development. Flink comes from a data engineering side, very different set of integrations, tools, etc.

mikeqq20242y ago

Does the efficiency come from the raft implementation of distributed transactions or something else?

1 more reply

pavel_pt2y ago

I hope @sewen will expand on this but from the blog post he wrote to announce Restate to the world back in August '23:

The full post is at: https://restate.dev/blog/why-we-built-restate/

hintymad2y ago· 3 in thread

tempaccount4202y ago

You are free to ignore it. Personally I like to see new projects be made in Rust, because it means they're easier to contribute to than projects in other unmanaged non-GC languages.

threeseed2y ago

Having spent a lot of time recently writing Rust it's a major negative for me.

It's a terrible language for concurrency and transitive dependencies can cause panics which you often can't recover from.

Which means the entire ecosystem is like sitting on old dynamite waiting to explode.

JVM really has proven itself to be by far the best choice for high-concurrency, back-end applications.

swyx2y ago

it does if it makes Hners click upvote...

hamandcheese2y ago· 3 in thread

pavel_pt2y ago

Disclaimer: I work on Restate together with @p10jkle.

You can absolutely do something similar with a RDBMS.

sewenOP2y ago

That's definitely a good question. A few thoughts here (I am one of the authors). The "bring your own data layer" has several goals:

(1) it is really helpful in getting good latencies.

(2) it makes it self-contained, so easy to start and run anywhere

(4) The storage is actually pluggable, because the internal architecture uses virtual consensus. So if the biggest ask from users would be "let me use Kafka or SQS FIFO" then that's doable.

AhmedSoliman2y ago

sharkdoodoo2y ago· 3 in thread

I understand the need for writing this as an SDK over existing languages for adoption reasons, but in your opinion would a programming language purposely built for such a paradigm make more sense?

slinkydeveloper2y ago

val resultA = callA() val resultB = callB(resultA)

In our SDKs we effectively implement that, by taking the safe approach of always waiting for storage acknowledgement when you execute two consecutive ctx.run().

But happy to be proven wrong!

sharkdoodoo2y ago

p10jkle2y ago

ko_pivot2y ago· 3 in thread

Being fairly familiar with Temporal, I definitely appreciate your cleaner architectural choices. Add a Go SDK and I’ll definitely give this a try.

p10jkle2y ago

abtinf2y ago

I think this is a point worth highlighting much more prominently in your marketing.

In my mind, this moved restate from “huh, that’s cool” to “during tomorrow’s standup, I’m going to ask one of my engineers to build a poc.”

1 more reply

caust1c2y ago

Evaluating a selection of these durable workflow SDKs for Go, I'm not keen on being tightly coupled to a vendor and the implementation shouldn't be that crazy to fit into existing Go interfaces.

1 more reply

senorrib2y ago· 2 in thread

stsffap2y ago

senorrib2y ago

The wording in the additional grant labels software like this as an Application Platform Service -- which is fair, and perhaps intended, but we're still not a big cloud service provider.

bilalq2y ago· 2 in thread

Could you share details on limits to be mindful of when designing workflows? Some things I'd love to be able to reference at a glance:

1. Max execution duration of a workflow

2. Max input/output payload size in bytes for a service invocation

3. Max timeout for a service invocation

4. Max number of allowed state transitions in a workflow

5. Max Journal history retention time

stsffap2y ago

4. There is no limit on the max number of state transitions of a workflow in Restate.

5. Restate keeps the journal history around for as long as the invocation/workflow is ongoing. Once the workflow completes, we will drop the journal but keep the completed result for 24 hours.

sewenOP2y ago

For a many of those values, the answer would be "as much as you like", but with awareness for tradeoffs.

So it is less a question of what the system can do, maybe more what you want:

- Execution duration for a workflow is not limited by default. More of a question of how long do you want to keep instances older versions of the business logic around?

Coming up with the best possible defaults would be something we'd appreciate some feedback on, so would love to chat more on Discord: https://discord.gg/skW3AZ6uGd

bilalq2y ago· 2 in thread

One big hangup for me is that there's only a single node orchestrator as a CDK construct. Having a HA setup would be a must for business critical flows.

I stumbled on Restate a few months ago and left the following message on their discord.

> I'm looking forward to seeing how Restate matures!

p10jkle2y ago

A visualisation/dashboard is a top priority! Distributed architecture (to support multiple nodes for HA and horizontal scaling) is being actively worked on and will land in the coming months

bilalq2y ago

That's exciting!

1 more reply

aleksiy1232y ago· 2 in thread

Looks really awesome. Always been looking for some easy to use async workflows + cronjobs service to use with serverless like Vercel.

Also something about this area always makes me excited. I guess it must be the thought of having all these tasks just working in the background without having to explicitly manage them.

One question I have is does anyone have experience for building data pipelines in this type of architecture?

Does it make sense to fan out on lots of small tasks? Or is it better to batch things into bigger tasks to reduce the overhead.

stsffap2y ago

gvdongen2y ago

netvarun2y ago· 2 in thread

sewenOP2y ago

That blog post should exist, agree. Here is an attempt at a short answer (with the caveat that I am not an expert in Temporal).

netvarun2y ago

Thanks for the detailed answer - please turn it into a blog post! Excited to see competition and different architectural approaches to tackle durable execution. Wishing you all the very best!

mikelnrd2y ago· 2 in thread

p10jkle2y ago

tonyhb2y ago

Disclaimer: I work for Inngest (https://www.inngest.com), which works in the same area and released 2 years ago.

akbirkhan2y ago· 2 in thread

Nice! Excited tools that makes using microservices easier.

Big fan of this work nevertheless. Just think you have alpha on the table

pavel_pt2y ago

We don't have specific plans for our next SDK to build, but Python definitely comes up often - thank you for the input!

p10jkle2y ago

Probably one of our two most requested languages. We absolutely are going to do it, probably in the next 6-12 months :)

jamifsud2y ago· 2 in thread

Any plans for a Python SDK? We’re actively looking for a platform like this but our stack is TS / Python!

stsffap2y ago

mikeqq20242y ago

Python SDK +1

mnahkies2y ago· 2 in thread

Do you have anything comparing and contrasting with temporal?

I'm particularly interested in the scaling characteristics, and how your approach to durable storage (seems no external database is required?) differs

stsffap2y ago

We will create a more detailed comparison to Temporal shortly. Until then @sewen gave a nice summarizing comparison here: https://news.ycombinator.com/item?id=40660568.

And yes, Restate does not have any external dependencies. It comes as a single self-contained binary that you can easily deploy and operate wherever you are used to run your code.

mnahkies2y ago

Nice thanks. The ability to use cloud functions/lambdas is certainly intriguing and something I'd hoped would be possible with temporal when I first discovered it.

In a multi-replica / horizontally scaled setup:

- Does each replica get its own independent storage volume?

- How much state is replicated between each volume if so?

- What does a typical workloads journal look like in terms of storage size? How often does this get compacted / is there archival to cold storage?

- How do you manage upgrades to restate in the case of a change to the on-disk format? (Or is this designed to be static between releases)

whoiskatrin2y ago· 2 in thread

The cloud setup was super fast! I used it for an existing app + restate TS sdk, really took a few steps to get things up and running! Looking forward to more support for nextjs/node

pavel_pt2y ago

Appreciate the feedback! What kind of support do you wish for, if there was one thing you would prioritize?

whoiskatrin2y ago

Pull handlers would make integration much easier, I think

1 more reply

dovys2y ago· 2 in thread

Handling durability for RPCs is a neat idea. Can you do chained rollbacks? ie an rpc down the call stack fails to revert the whole stack instead of retrying?

gvdongen2y ago

Here is another example in the examples repo which does compensation. There is also a Java one https://github.com/restatedev/examples/blob/main/basics/basi...

p10jkle2y ago

jiehong2y ago· 2 in thread

This seems interesting!

I couldn’t find an equivalent of the codec server in temporal that basically encrypts all data in the event log. Is there something similar?

stsffap2y ago

p10jkle2y ago

sharkdoodoo2y ago· 2 in thread

Are there any theoretical underpinnings in the design of restate? Any papers/references. Thanks!

stsffap2y ago

We plan to publish a more detailed follow-up blog post where we explain why we developed a new stateful system, how we implemented it, and what the benefits are. Stay tuned!

AhmedSoliman2y ago

_1tan2y ago· 2 in thread

Cool, congrats on launching! Could this replace Jobrunr?

stsffap2y ago

p10jkle2y ago

Thanks! I'm not familiar with Jobrunr, but we can definitely help with orchestrating async tasks (as well as sync rpc calls), especially if its important that they run to completion

magnio2y ago· 1 in thread

How does Restate compare with Apache Airflow or Prefect?

sewenOP2y ago

Disclaimer, I am not an Airflow expert and even less of a Prefect expert.

That makes it quite lightweight: If the handlers is fast in a running container, the whole thing results in super fast turnaround times (milliseconds).

You can also deploy the handlers on FaaS and basically get the equivalent of spawning a (serverless task) per step.

The other difference would be the way that the logic is defined, can maintain state, can make exactly-once calls to other handlers.

qwertyuiop_2y ago· 1 in thread

Looks cool. Just out of curiosity, where did you find the template for your homepage? is there a content framework you are using ?

p10jkle2y ago

I don't think its a template, I'm afraid! Its a webflow site though

johtso2y ago· 1 in thread

The label "Sign in with your corporate ID" for GitHub sign in seems a little odd..

p10jkle2y ago

I think its a cognito default - will take a look!

sewenOP2y ago

A few links worth sharing here:

- Blog post with an overview of Restate 1.0: https://restate.dev/blog/announcing-restate-1.0-restate-clou...

- Restate docs: https://docs.restate.dev/

- Discord, for anyone who wants to chat interactively: https://discord.com/invite/skW3AZ6uGd

azmy2y ago

I have been following on this project on a while and i tried it on older version and was already amazing. I am so excited to try this version out! specially with the cloud offering

p10jkle2y ago

Hey all, I work with @sewen, and I focus on the cloud platform which also launched today (https://restate.dev/blog/announcing-restate-cloud-early-acce...) Happy to answer any questions :)

AhmedSoliman2y ago

"Virtual Objects" is a cool concept, the name might not reflect the power it brings though. Luckily, the documentation seems to explain it well.

swyx2y ago

techcrunch announcement here as well https://techcrunch.com/2024/06/12/restate-raises-7m-for-its-...

exabrial2y ago

Looks awesome! Have you ever considered EPL for an Amazon defense?

j / k navigate · click thread line to collapse