Distributed coroutines with a native Python extension and Dispatch (opens in new tab)

(stealthrocket.tech)

47 pointspelletier2y ago29 comments

29 comments

22 comments · 8 top-level

faizshah2y ago· 5 in thread

I feel like temporal’s approach of replaying the computation is probably better than trying to serialize the running computation. Serializing the running coroutine gets ugly as is shown here with things like file handles and making pickle a central part of your compute platform is a little scary from an AppSec pov.

That being said, I like the idea and the blog post is wonderfully written.

achille-roussel2y ago

Dispatch takes a different approach with different trade-offs, but you're right that capturing the local program scope brings interesting challenges.

Serializing file handles doesn't work, but in our experience, programs rarely run into constructs where this becomes a problem, and when it happens, there are mitigation measures that are usually easy to implement (small restructure of the program, capturing resource metadata to reconstruct them later, etc...).

We have a few features on the roadmap to help mitigate the security implications as well, including allowing users to store their program state in a S3 bucket that they own. Our scheduler can operate with only metadata about the program, so splitting the two can be an effective model to mitigate those risks.

richieartoul2y ago

Storing the data in the customer's S3 bucket is a really good idea IMO. Means they just have to trust you to run the coordination reliably, but not necessarily to keep their data secure which is a much easier bar to clear.

traverseda2y ago

What about combining it with rpyc for network transparent object proxies? For resources that can't be picked.

sitkack2y ago

With the right abstractions, running computation is just data.

paulddraper2y ago

They abstraction being the universe.

kgeist2y ago· 5 in thread

I understand that it's a very interesting engineering problem - to create distrubuted coroutines and make them work in Python, but I'm not sure I understand why I would want to use this concept in our projects. I understand the idea of coroutines as a way to increase throughput (the CPU isn't stalled waiting for I/O to finish) - Go already has that. However, what are the advantages of rescheduling an already running coroutine on a different machine? Shouldn't it be the job of the load balancer, to choose the least busy machine, before any coroutine is run? Isn't it expensive to serialize coroutines and transfer them between machines? Also, I wonder what observability is like. If a coroutine crashes, what its stacktrace will look like?

zelphirkalt2y ago

> However, what are the advantages of rescheduling an already running coroutine on a different machine?

Your original machine might not have the required computational resources to run the job quickly enough.

> Shouldn't it be the job of the load balancer, to choose the least busy machine, before any coroutine is run?

It is not always simple or possible to know, when what part of a computation will be finished and as such cannot easily be perfectly planned. If you mean a load balancer as in traefik or similar, then that entails serializing your intermediate results or writing your code in a specific way, so that computation can be split using a load balancer.

> Isn't it expensive to serialize coroutines and transfer them between machines?

Probably, but not taking advantage of idle cores on another machine might be more expensive (in terms of time needed to finish the computation).

Also, I wonder what observability is like. If a coroutine crashes, what its stacktrace will look like?

No idea, have not used it.

This kind of thing is what Erlang excels at. Serializing functions and their entire environments is a difficult to solve problem. I think in Python it probably means moving into a whole different space of types and objects, because Python has not been developed with such a thing in mind from the start, while Erlang has.

kgeist2y ago

>Probably, but not taking advantage of idle cores on another machine might be more expensive (in terms of time needed to finish the computation).

If there are idle cores on another machine, why not just point the load balancer to schedule new jobs/coroutines on those idle machines? Why reschedule existing coroutines on another machine if they run just fine on the current node? (i.e. the node is at full capacity, and it's fine) I can see that the problem can arise if a coroutine wants to spawn new coroutines, and if we schedule them on the same node which is at full capacity, they can end up waiting for a while before they are run... It makes sense to schedule new jobs on different machines, but why reschedule existing coroutines? You have to serialize/deserialize stuff and send over the wire, and there are several gotchas as explained in the article (pickling arbitrary objects doesn't sound very reliable/safe, and judging from the article, can break in future versions of Python), plus there's a lot of magic involved which will probably make it harder to investigate things in production... I'd personally just stick with local coroutine sheduler + global load balancer which picks nodes for new jobs + coroutines, when created, only receive plain values as arguments (string, int, float and basic structures) to be reliably serializable and so that it was transparent what's going on... (i.e. do NOT store internal coroutine state, assume they are transient). Maybe I don't understand the idea.

1 more reply

achille-roussel2y ago

Your comment on this being solved by inventing new programming languages like Erlang is right on point.

Our take is that distributed coroutines can be brought to general-purpose programming languages like Python so that engineering teams can adopt these features incrementally in their existing applications instead of adding a whole new language or framework.

In my opinion, the value is in being able to reuse the software tools and processes you're familiar with, major shifts are rarely the right call.

achille-roussel2y ago

> Go already has that. However, what are the advantages of rescheduling an already running coroutine on a different machine?

Say your coroutine is performing a transactional operation with multiple steps, but for some reason the second step starts getting rate limits from an upstream API, you need to enter a retry loop to reschedule the operation a bit later. What if your program stops then? Without capturing the state somewhere, it remains volatile in memory and the operation would be lost when the program stops.

This scenario occurs in a wide range of systems, from simple async jobs for email delivery to complex workflows with fan-out and fan-in.

> Isn't it expensive to serialize coroutines and transfer them between machines?

We only capture the local scope of the call stack, so the amount of data that we need to record is pretty small and stays in the order of magnitude of the data being processed (e.g., the request payload).

> I wonder what observability is like. If a coroutine crashes, what its stacktrace will look like?

We maintain stack traces across await points of coroutine calls, even if the concurrent operations are scheduled on different program instances. When an error occurs, the stack traces show the entire code path that led to the error.

There's definitely a lot of opportunities for creating powerful introspection tools, but since Dispatch is just code, you can also use your typical metrics, logs, and traces to instrument your distributed coroutines.

kgeist2y ago

>We only capture the local scope of the call stack, so the amount of data that we need to record is pretty small and stays in the order of magnitude of the data being processed

What if the local stack references a deep object hierarchy? The classical "I wanted a banana, but got the gorilla holding the banana and the entire jungle"

>What if your program stops then? Without capturing the state somewhere, it remains volatile in memory and the operation would be lost when the program stops.

You persist coroutines on disk?

1 more reply

gigatexal2y ago· 2 in thread

I don't understand how this is priced -- or how it works. Why do I need to sign up with my Github account? Why do I need an API key?

achille-roussel2y ago

We've just launched in developed preview, so no pricing yet, but this is coming!

A good starting point to answer your other questions is the "getting started" section in our documentation https://docs.dispatch.run/dispatch/getting-started

If you have more questions, join our Discord! We're always happy to chat and get feedback on what's difficult to grasp.

gigatexal2y ago

For me this is DOA unless it can be run locally without having to call into some hosted API in the cloud.

I took a gander at the source I'll wait till/if there's ever a 100% local version of this. But good luck all the same!

1 more reply

halfcat2y ago· 1 in thread

When would one want to use this, compared to, say, Celery?

achille-roussel2y ago

Celery is one of those popular solutions that is commonly used. However, it comes with a complex framework for describing workflows, which is often difficult to master, especially for younger engineers.

With Dispatch, you just write code, as if you would write a local program, and our scheduler can manage the distribution and state of the execution. Testing the code especially gets really simple.

Having a hosted scheduler also means that you don't need to deploy any extra infrastructure like RabbitMQ or Redis.

theharshpat2y ago· 1 in thread

Interesting! I read some blogs. How "dispatch" compares to something like temporal?

achille-roussel2y ago

Thanks for joining our Discord and asking the question there as well! I'll reply here as well for the HN community.

People have often tried to use Temporal to solve similar problems. We think Dispatch is a very different solution because we don’t require you to architect your system in terms of a rigid framework of activities and workflows... You just write regular code with simple functions, enabling incremental adoption into existing applications and making the code easy to test.

A friend of mine once described Dispatch as "Temporal without the galaxy brain", it's a pretty good way of phrasing it in my opinion :)

pelletierOP2y ago

This is part 2 of https://news.ycombinator.com/item?id=39696423

pryz2y ago

Distributed coroutines are a perfect fit for Python! We're excited to explain how they work and what you can do with them!

Also happy to answer any questions :)

sitkack2y ago

How does this compare to systems like Ray and Dask?

j / k navigate · click thread line to collapse

29 comments

22 comments · 8 top-level

faizshah2y ago· 5 in thread

That being said, I like the idea and the blog post is wonderfully written.

achille-roussel2y ago

Dispatch takes a different approach with different trade-offs, but you're right that capturing the local program scope brings interesting challenges.

richieartoul2y ago

traverseda2y ago

What about combining it with rpyc for network transparent object proxies? For resources that can't be picked.

sitkack2y ago

With the right abstractions, running computation is just data.

paulddraper2y ago

They abstraction being the universe.

kgeist2y ago· 5 in thread

zelphirkalt2y ago

> However, what are the advantages of rescheduling an already running coroutine on a different machine?

Your original machine might not have the required computational resources to run the job quickly enough.

> Shouldn't it be the job of the load balancer, to choose the least busy machine, before any coroutine is run?

> Isn't it expensive to serialize coroutines and transfer them between machines?

Probably, but not taking advantage of idle cores on another machine might be more expensive (in terms of time needed to finish the computation).

Also, I wonder what observability is like. If a coroutine crashes, what its stacktrace will look like?

No idea, have not used it.

kgeist2y ago

>Probably, but not taking advantage of idle cores on another machine might be more expensive (in terms of time needed to finish the computation).

1 more reply

achille-roussel2y ago

Your comment on this being solved by inventing new programming languages like Erlang is right on point.

In my opinion, the value is in being able to reuse the software tools and processes you're familiar with, major shifts are rarely the right call.

achille-roussel2y ago

> Go already has that. However, what are the advantages of rescheduling an already running coroutine on a different machine?

This scenario occurs in a wide range of systems, from simple async jobs for email delivery to complex workflows with fan-out and fan-in.

> Isn't it expensive to serialize coroutines and transfer them between machines?

> I wonder what observability is like. If a coroutine crashes, what its stacktrace will look like?

kgeist2y ago

>We only capture the local scope of the call stack, so the amount of data that we need to record is pretty small and stays in the order of magnitude of the data being processed

What if the local stack references a deep object hierarchy? The classical "I wanted a banana, but got the gorilla holding the banana and the entire jungle"

>What if your program stops then? Without capturing the state somewhere, it remains volatile in memory and the operation would be lost when the program stops.

You persist coroutines on disk?

1 more reply

gigatexal2y ago· 2 in thread

I don't understand how this is priced -- or how it works. Why do I need to sign up with my Github account? Why do I need an API key?

achille-roussel2y ago

We've just launched in developed preview, so no pricing yet, but this is coming!

A good starting point to answer your other questions is the "getting started" section in our documentation https://docs.dispatch.run/dispatch/getting-started

If you have more questions, join our Discord! We're always happy to chat and get feedback on what's difficult to grasp.

gigatexal2y ago

For me this is DOA unless it can be run locally without having to call into some hosted API in the cloud.

I took a gander at the source I'll wait till/if there's ever a 100% local version of this. But good luck all the same!

1 more reply

halfcat2y ago· 1 in thread

When would one want to use this, compared to, say, Celery?

achille-roussel2y ago

With Dispatch, you just write code, as if you would write a local program, and our scheduler can manage the distribution and state of the execution. Testing the code especially gets really simple.

Having a hosted scheduler also means that you don't need to deploy any extra infrastructure like RabbitMQ or Redis.

theharshpat2y ago· 1 in thread

Interesting! I read some blogs. How "dispatch" compares to something like temporal?

achille-roussel2y ago

Thanks for joining our Discord and asking the question there as well! I'll reply here as well for the HN community.

A friend of mine once described Dispatch as "Temporal without the galaxy brain", it's a pretty good way of phrasing it in my opinion :)

pelletierOP2y ago

This is part 2 of https://news.ycombinator.com/item?id=39696423

pryz2y ago

Distributed coroutines are a perfect fit for Python! We're excited to explain how they work and what you can do with them!

Also happy to answer any questions :)

sitkack2y ago

How does this compare to systems like Ray and Dask?

j / k navigate · click thread line to collapse