undefined | Better HN

0 pointsdevxpy7y ago0 comments

I believe that since the Advent of zeromq, parallelism is possible in almost any language, including python.

My library lets you do parallelism in a unique way, where you do message passing parallelism without being explicit about it.

https://github.com/pycampers/zproc/

0 comments

16 comments · 3 top-level

TickleSteve7y ago· 6 in thread

You make some extremely large claims about ZProc, what advantages does it have over every other message-passing library for every other language ever built? (including the other zeromq bindings?)

TBH, you're claims sound like you've just "discovered" message-passing, of which many, many languages, runtimes and operating systems have been using for many years/decades. (https://en.wikipedia.org/wiki/Message_passing)

In other words... its not a revolution.

ZProc seems to simply be a simple library to pickle data structures thru a central (pubsub?) server.

This is not the way to get remotely close to "high performance". What you've created here is pretty much what multiprocessing gives you already in a more performant solution (i.e. no zeromq involved).

zbentley7y ago

> What you've created here is pretty much what multiprocessing gives you already in a more performant solution (i.e. no zeromq involved)

Minor point of pedantry which I'll state because it's an often-overlooked timesaver for folks developing on multiprocessing: not only is MP potentially faster for transferring data between processes compared to this solution, but it can also be way, way faster in situations where you have all your data before creating your processes/pool and just want to farm it out to your MP processes without waiting for it all to be chunked/pickled/unpickled.

Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant* time, if the data's already present in e.g. a global when children are created.

This pattern can be used to totally bypass all considerations of performance/CPU/etc. for pickling/unpickling data and lends a massive speed boost in certain situations--e.g. a massive dataset is read into memory at startup, and then ranges of that dataset are processed in parallel by a pool of MP processes, each of which will return a relatively small result-set back to the parent, or each of which will write its processed (think: data scrubbing) range to a separate file which could be `cat`ed together, or written in parallel with careful `seek` bookkeeping.

Unix-ish OSes only, though (unless the fork() emulation in WSL works for this--I have not tested that).

* Technically it's O(N) for the size of data you have in memory at process pool start, because fork() can take time, but the multiplier is small enough in practice compared to sending data to/from MP processes via queues or whatever that it might as well be constant.

blattimwind7y ago

> Because of copy-on-write fork magic

Note that this works for big objects, but not for small objects. E.g. if you fork-share a large list of integers or dicts or something like that, then you don't get any memory usage benefits, because every access will cause a refcount-write and that will copy the whole page containing the object.

> * Technically it's O(N) for the size of data you have in memory at process pool start

It's not quite that simple; sharing n pages can take very little time or a bit more time; it depends on how the pages are mapped; sharing a large mapping doesn't take longer than a small mapping.

1 more reply

srean7y ago

> Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant* time, if the data's already present in e.g. a global when children are created.

Have you tried this or got it working ? The fly in the ointment is the reference count. Add a reference and BOOM you suddenly have a huge copy. It can be made to work efficiently in certain cases but takes a lot of care.

2 more replies

devxpyOP7y ago

Performance doesn't equal Better software.

In fact, I think Performance centric development is a lesser known evil.

> have all your data before creating your processes/pool

Zproc exposes the required API for this (Nothing new, just the python API) :)

https://zproc.readthedocs.io/en/latest/api.html#zproc.Proces... (args and kwargs)

> a massive dataset

Wouldn't you be better off using a Database for that kind of work?

> Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant

time

Any resources on how to implement that?

1 more reply

devxpyOP7y ago

> "high performance"

I never claimed it to be performant!

"Above all, ZProc is written for safety and the ease of use."

(Read here - https://github.com/pycampers/zproc?files=1#faq)

> It's not a revolution

I totally agree. It's just a better way of doing things zmq already perfected. Like, tell me if you've ever seen a python object that has a `dict` API, but does message passing in the background.

> central (pubsub?) server.

Central server, yes. It uses PUB-SUB for state watching and REQ-REP for everything else.

> you've just "discovered" message-passing

Guess you're right? 2 years is a peanut on the time scale...

P.S. Thanks for all the feedback, I've been dying to hear something for a while now.

TickleSteve7y ago

I would suggest you don't make dramatic claims for a subject that has decades of theory behind it with a huge amount of nuance depending on the exact workload and characteristics of the machines in question.

Don't get me wrong, message-passing has some advantages, but they certainly aren't that it 'solves' parallelism. If you wish to know more, investigate:

- Smalltalk and Erlang (for message passing languages).

- QNX (for a message-passing OS)

- mpiPY (for a message-passing Python library, mpi is the grandfather of message passing libraries that runs everywhere).

- Occam & the transputer for an example of a hardware-mp implementation (actually its Communicating Sequential Processes, but for your purposes it would be enlightening).

- golang for a modern-day implementation of CSP.

- Python implementation of CSP (https://github.com/futurecore/python-csp)

- Discussion about MP (http://wiki.c2.com/?MessagePassingConcurrency, for more just google it)

Basically, its great that you want to learn about concurrency & parallelism, but you've come to a gun fight with a blunt butter knife.

2 more replies

adw7y ago· 4 in thread

> My library lets you do parallelism in a unique way

That's a big claim which you don't really back up as much as you need to. Unique is an extremely high bar in this very busy field.

There are several other similar red flags on the linked GitHub; I think your enthusiasm is running away from you a little. You might want to dial the ten-dollar language back a bit – it made me immediately suspicious ("utterly perfect", for example is another danger phrase).

It's the combination of grandiose language + solution-in-search-of-a-problem which leads to that.

If you're going to sell hard, what I would want to see is a large, complex, high-traffic system which makes extensive use of this; if you compare and contrast with Ray, which I've also only just encountered in this thread, there's a real problem (distributed hyperparameter optimization) which they've built a solution for with the library, and that immediately lends it credibility; I know the system can be used for something because it has been.

devxpyOP7y ago

'utterly perfect ' are not my words

http://zguide.zeromq.org/page:all#Multithreading-with-ZeroMQ

Thought linking it there would make it better, but I'll just remove it...

And you do make a good point. It doesn't really solve anything technically. But would you agree that it exposes a better API for doing much of the same stuff?

adw7y ago

I wouldn't know without using it. That's where "software using this library" is a really useful bit of social proof. Think of Django; even without looking at the code you have a lot of evidence that it can conveniently solve a wide range of real problems.

1 more reply

devxpyOP7y ago

Update - I hopefully made the language a little better?

https://github.com/pycampers/zproc/blob/master/README.md

adw7y ago

I think this is quite a lot better! Nice work.

heavenlyblue7y ago· 3 in thread

>> Zproc uses a Server, which is responsible for storing and communicating the state. >> >> This isolates our resource (state), eliminating the need for locks.

So you've just invented a new name for a coordinator process and called it a new fashion in computation?

TickleSteve7y ago

No, he's reinvented multiprocessing... pickling data structures across multiple processes.

Just without the 'niceties'.

zbentley7y ago

You're probably right, but see my comment above: not only is MP possibly superior at being a picking/arbitrating server, but it also supports taking advantage of copy-on-write semantics on Unix-ish systems to transfer memory to children at startup in constant time with no pickling/unpickling necessary.

1 more reply

devxpyOP7y ago

Great, now just "stating" how things work is equivalent to inventing them!

j / k navigate · click thread line to collapse

0 comments

16 comments · 3 top-level

TickleSteve7y ago· 6 in thread

You make some extremely large claims about ZProc, what advantages does it have over every other message-passing library for every other language ever built? (including the other zeromq bindings?)

In other words... its not a revolution.

ZProc seems to simply be a simple library to pickle data structures thru a central (pubsub?) server.

zbentley7y ago

> What you've created here is pretty much what multiprocessing gives you already in a more performant solution (i.e. no zeromq involved)

Unix-ish OSes only, though (unless the fork() emulation in WSL works for this--I have not tested that).

blattimwind7y ago

> Because of copy-on-write fork magic

> * Technically it's O(N) for the size of data you have in memory at process pool start

It's not quite that simple; sharing n pages can take very little time or a bit more time; it depends on how the pages are mapped; sharing a large mapping doesn't take longer than a small mapping.

1 more reply

srean7y ago

2 more replies

devxpyOP7y ago

Performance doesn't equal Better software.

In fact, I think Performance centric development is a lesser known evil.

> have all your data before creating your processes/pool

Zproc exposes the required API for this (Nothing new, just the python API) :)

https://zproc.readthedocs.io/en/latest/api.html#zproc.Proces... (args and kwargs)

> a massive dataset

Wouldn't you be better off using a Database for that kind of work?

> Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant

time

Any resources on how to implement that?

1 more reply

devxpyOP7y ago

> "high performance"

I never claimed it to be performant!

"Above all, ZProc is written for safety and the ease of use."

(Read here - https://github.com/pycampers/zproc?files=1#faq)

> It's not a revolution

I totally agree. It's just a better way of doing things zmq already perfected. Like, tell me if you've ever seen a python object that has a `dict` API, but does message passing in the background.

> central (pubsub?) server.

Central server, yes. It uses PUB-SUB for state watching and REQ-REP for everything else.

> you've just "discovered" message-passing

Guess you're right? 2 years is a peanut on the time scale...

P.S. Thanks for all the feedback, I've been dying to hear something for a while now.

TickleSteve7y ago

Don't get me wrong, message-passing has some advantages, but they certainly aren't that it 'solves' parallelism. If you wish to know more, investigate:

- Smalltalk and Erlang (for message passing languages).

- QNX (for a message-passing OS)

- mpiPY (for a message-passing Python library, mpi is the grandfather of message passing libraries that runs everywhere).

- Occam & the transputer for an example of a hardware-mp implementation (actually its Communicating Sequential Processes, but for your purposes it would be enlightening).

- golang for a modern-day implementation of CSP.

- Python implementation of CSP (https://github.com/futurecore/python-csp)

- Discussion about MP (http://wiki.c2.com/?MessagePassingConcurrency, for more just google it)

Basically, its great that you want to learn about concurrency & parallelism, but you've come to a gun fight with a blunt butter knife.

2 more replies

adw7y ago· 4 in thread

> My library lets you do parallelism in a unique way

That's a big claim which you don't really back up as much as you need to. Unique is an extremely high bar in this very busy field.

It's the combination of grandiose language + solution-in-search-of-a-problem which leads to that.

devxpyOP7y ago

'utterly perfect ' are not my words

http://zguide.zeromq.org/page:all#Multithreading-with-ZeroMQ

Thought linking it there would make it better, but I'll just remove it...

And you do make a good point. It doesn't really solve anything technically. But would you agree that it exposes a better API for doing much of the same stuff?

adw7y ago

1 more reply

devxpyOP7y ago

Update - I hopefully made the language a little better?

https://github.com/pycampers/zproc/blob/master/README.md

adw7y ago

I think this is quite a lot better! Nice work.

heavenlyblue7y ago· 3 in thread

>> Zproc uses a Server, which is responsible for storing and communicating the state. >> >> This isolates our resource (state), eliminating the need for locks.

So you've just invented a new name for a coordinator process and called it a new fashion in computation?

TickleSteve7y ago

No, he's reinvented multiprocessing... pickling data structures across multiple processes.

Just without the 'niceties'.

zbentley7y ago

1 more reply

devxpyOP7y ago

Great, now just "stating" how things work is equivalent to inventing them!

j / k navigate · click thread line to collapse