For example, some high-level options include Popen, multiprocessing.Process, multiprocessing.Pool, futures.ProcessPoolExecutor, and huge frameworks like Ray.
multiprocessing.Process includes some pickling magic and you can pick from multiprocessing.Pipe and multiprocessing.Queue, but you need to use either multiprocessing.connection.wait() or select.select() to read the process sentinel simultaneously in case the process crashes. Which one? Well connection.wait() will not be interrupted by an OS signal. It's unclear why I would ever use connection.wait() then, is there some tradeoff I don't know about?
For my use cases, process reuse would have been nice to be able to reuse network connections and such (useful even for a single process). Then you're looking at either multiprocessing.Pool or futures.ProcessPoolExecutor. They're very similar, except some bug fixes have gone into futures.ProcessPoolExecutor but not multiprocessing.Pool because...??? For example, if your subprocess exits uncleanly, multiprocessing.Pool will just hang, whereas futures.ProcessPoolExecutor will raise a BrokenProcessPool and the pool will refuse to do any more work (both of these are unreasonable behaviors IMO). Timing out and forcibly killing the subprocess is its own adventure for each of these too. I don't care about a result anymore after some time period passes, and they may be stuck in C code so I just want to whack the process and move on, but that is not very trivial with these.
What a nightmarish mess! So much for "There should be one--and preferably only one--obvious way to do it"...my God.
(I probably got some details wrong in the above rant, because there are so many to keep track of...)
My learning: there is no "easy way to [process] parallelism" in Python. There are many different ways to do it, and you need to know all the nuances of each and how they address your requirements to know whether you can reuse existing high-level impls or you need to write your own low-level impl.
Process is low-level and is almost never what you want. Pool is "mid-level", and usually isn't what you want. ProcessPoolExecutor is usually what you want, it is the "one obvious way to do it". That's not at all clear from the docs though.
The one obvious way to do it, in general, is: subprocess.run for running external processes, subprocess.Popen for async interaction with external processes, and concurrent.futures.ProcessPoolExecutor for Python multiprocessing.
Your other complaints about actually using the multiprocessing stuff are 100% valid. Error handling, cancellation, etc. is all very difficult. Passing data back and forth between the main process and subprocesses is not trivial.
But I do want to emphasize that there is a somewhat-well-defined gradient of lower- and higher-level tools in the standard library, and your "obvious way to do it" should usually start at the higher end of that gradient.
You might also want to look into the third-party Joblib library, which makes process parallelism a lot less painful for the straightforward use case of "run a function on a large amount of data, using multiple OS processes."
Imagining I'm a newbie to Python concurrency, I Googled "concurrency in Python" and picked the first result from the official docs. https://docs.python.org/3/library/concurrency.html It's a list of everything except asyncio, and the first item on the list is the low-level `threading` :S At least that page mentions ThreadPoolExecutor, queue, and asyncio as alternatives, but I'm still lost on what is the correct way.
You didn’t mention the recommended high level option for subprocess, ‘subprocess. run’.
There are other things I didn't mention that get thrown around too such as os.system() and os.fork().
Content: basically how to use ThreadPoolExecutor
Comment: Concurrency and parallelism aren't easy in Python.
How is this off-topic?
I like Python in general, but I avoid it for any kind of concurrent programming other than simple fan-out-fan-in.
I use multiprocessing and I am looking forward to the GIL removal.
I would really like library writers and parallelism experts to think on modelling computation in such a way that arbitrary programs - written in this notation - can be sped up without thinking about async or parallelism or low level synchronization primitives spreading throughout the codebase, increasing its cognitive load for everybody.
If you're doing business programming and you're using python Threads or Processes directly, I think we're operating against the wrong level of abstraction because our tools are not sufficiently abstract enough. (it's not your error, it's just not ideal where our industry is at)
I am not an expert but parallelism, coroutines, async is my hobby that I journal about all the time. I think a good approach to parallelism is to split you program into a tree dataflow and never synchronize. Shard everything.
If I have a single integer value that I want to scale throughput of updates to it by × hardware threads in my multicore and SMT CPU, I can split the integer by that number and apply updates in parallel. (You have £1000 in a bank account and 8 hardware threads you split the account into 8 bank accounts and each store £125, then you can serve 8 transactions simultaneously at a time) Then periodically, those threads can post their value to another buffer (ringbuffer) and then a thread that services that ringbuffer can sum them all for a global view. This provides an eventually consistent view of an integer without slowing down throughput.
Unfortunately multithreading becomes a distributed system and then you need consensus.
I am working on barriers inspired by bulk synchronous parallel where you have parallel phases and synchronization phases and an async pipeline syntax (see my previous HN comments for notes on this async syntax)
My goal would be that business logic can be parallelised without you needing to worry about synchronization.
Of course that's not true for everything, and depending on the domain tree dataflows can also be great. I remember them being very popular in GPGPU tasks because synchronization is very costly there.
Probably 20% of the effort shown in this post could have been expended to just write something very similar in Golang, and it would have taken less time, too. Because the way I see it this is trying to emulate futures / promises (and it looks like it's succeeding, at least on the surface). That can spiral out of comfortable maintainable code territory pretty quickly.
But especially for something as trivial as a crawler, I don't see the appeal of Python. You got a good deal of languages with lower friction for doing parallel stuff nowadays (Golang, Elixir, Rust if you want to cry a bit, hell, even Lua has some parallel libraries nowadays, Zig, Nim...).
It happened with me and many other former colleagues.
Though obviously, everyone decides for themselves when does that point come -- or if it comes at all.
What TFA doesn't say is that process pools are quite fragile, certainly on Mac and Windows, but Linux also. They rely on pickling which is also fragile.
That said, asyncio works surprisingly well if what you want is non-blocking execution and are happy with 1 cpu. But no parallel speed up.
I wish Python had similar solutions.
Parallelism in a Notebook isn't for everyone, but how would these changes affect it?
Delusion level: max.
You have to be in a very, very bad place when this marginal improvement over absolute horror-show that bare Process offers seemed "pretty decent".
Python doesn't have good tools for parallelism / concurrency. It doesn't have average tools. It doesn't have even bad tools. It has the worst. Though, unfortunately, it's not the only language in this category :(
> It's not the only language in this category
Soo....not the worst? :) Or tied for it?
What do you find difficult/wrong with pool executors?
Also, you reference "Process", but FYI the article talks about multiple threads, not multiple processes.
And they're still the worst version of this pattern, because despite using multiple OS-level threads with all the associated overhead, the GIL prevents most of the real parallelism from happening. And if you want full parallelism, you have to use multiprocessing.Pool, which adds pickling overhead and incompatibility.
Yeah... I know, it's hard to imagine that there could be more than one worst. But, as I have to practice these things with my 4 year old, I become more patient with adults who don't get the concept too.
Imagine you are in a class and the teacher gives everyone a pencil and a sheet of paper. Now, you want to find out who has the shortest pencil. All students compare their pencils and turns out that there are several pencils that are of the same exact length, and those are the shortest ones at the same time. So, more than one student has the shortest pencil.
But it doesn't end there. Not all sets which define a "greater than" relationship are totally ordered. In such sets it's possible to have multiple different smallest elements. Trivially, in a set that's not ordered, every element is the smallest.
> What do you find difficult/wrong with pool executors?
Difficult? -- I don't know.
Wrong? -- Well, it's pretty worthless... does it make it wrong? -- That's up to you to decide.
The idea of threads is bad for many reasons: one in particular is of how exceptions in threads are handled. But this isn't unique to Python. Python just made a bad decision to use threads in the language that's supposed to be "safe". Python thread implementation craps its pants when dealing with many aspects of threads. For example, thread-local variables. Since threads are objects in Python, you'd expect local variables to be properties on those objects... well the mechanism to use them is just idiotic and nothing like you would expect. When it comes to interacting with "native" code from Python, you'd expect some interaction with Python's scheduler so that the native code can portion its own execution, allow Python to interrupt it etc. but there's nothing of the kind.
Even though we haven't even gotten to the pools yet, pools, obviously, don't address any of the thread-related problems. If anything, they only amplify them. Specifically, the pool from concurrent package is worse than its relative from multiprocessing package because it uses "futures". The whole idea of "futures" is somehow broken in Python because of the neverending bugs related to deadlocking. It's been repeatedly "fixed", but every now and then deadlocks still happen. Here's the latest one I know of: https://bugs.python.org/issue46464 .
I've gone once down the rabbit hole of trying to make a native module work with Python threads... there's no good way to do it, but pools, be it from concurrent.futures or from multiprocessing are both very bad for many reasons. I was hoping to be able to give users an ability to control how parallel my native code is through the tools exposed by Python already, but that turned out to be such a disaster that I've given up on the idea. Python's thread wrappers are worthless for the native code that wants to actually execute concurrently -- they are only designed to execute Python code, non-concurrently. Like I already mentioned, Python has no infrastructure to communicate to the native code its scheduling decisions, no thread-safety in memory allocation, the code is overall poorly written (as in missing const, other imprecise typing, memory-inefficient data-structures)... there are no benefits to using that vs rolling your own. Only struggle with bad decisions.
How do you rank C, Perl, JavaScript, PHP, ... parallelism compared to execution pool + futures here? The absolute MAX WORST?
Trivially, in a collection that has no "worse than" relation you can define one that doesn't compare them at all, and declares them all "incomparable" -- which, again, would make them all worst.
Bonus question: can you imagine a collection where there is no worst element?
> How do you rank C, Perl, JavaScript, PHP
Well, none of these languages have their own parallelism / concurrency aspect. (Except Perl 5 maybe? I'm not really familiar with the language). They all rely on the system running them to do the parallelism.
So... all of these will go roughly into the same bin as Python?
Some languages have libraries that would allow them to do better (eg. you have PThreads in C), but that's not the function of the language.
C and Java threads are better than Python because, uh, they can actually run in parallel. Rust adds convenience and safety on top, plus its own event loops. Golang has Goroutines. Erlang has some very powerful solution that I don't remember.
IDK about PHP and Perl, barely touched them. Maybe they're worse than Python for this. Everything else isn't. Python was not originally built with these use cases in mind, which is totally fine, but I'm not going to pick Python if I'm doing complex concurrency/parallelism. For simple process pools, Python is good enough.
> When a request is waiting on the network, another thread is executing.
I'm guessing this is the meat, but what controls that? What other operations allow the GIL to switch to another thread?
All I/O functions in the standard library do this when blocked.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, [1, 2, 3]))
This forks out up to 5 processes. f(x) runs fully in parallel for each input. The inputs and outputs sent between processes via pickling.