How to share a NumPy array between processes (opens in new tab)

(superfastpython.com)

67 pointsjasonb052y ago16 comments

16 comments

13 comments · 5 top-level

arjvik2y ago· 3 in thread

Good content, but over SEO optimized. Would be nice to hear about the actual efficiency of using these methods.

For instance, does fork() copy the page of memory containing the array? I believe it's Copy-on-Write semantics, right? What happens when the parent process changes the array?

Then, how do Pipe and Queue send the array across processes? Do they also pickle and unpickle it? Use shared memory?

ikhatri2y ago

This is a super common occurrence in training loops for ML models because PyTorch uses multiprocessing for its dataloader workers. If you want to read more, see the discussion in this issue: https://github.com/pytorch/pytorch/issues/13246#issuecomment...

As you’ve pointed out fork() isn’t ideal for a number of reasons and in general it’s preferred to use torch tensors directly instead of numpy arrays so that you are not forced into using fork()

There’s also this write up which I found to be quite useful for details: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multip...

ahoka2y ago

“I believe it's Copy-on-Write semantics, right? What happens when the parent process changes the array?”

Yes and you just answered your question.

boppo12y ago

For a programming novice: Is CoW where a 'child' process operates on a copy of data stored elsewhere in memory, but you have some mechanism to always update the copy any time the parent process modifies/writes to the original data?

1 more reply

KeplerBoy2y ago· 3 in thread

Actually ran into this problem this week, toyed around with multiprocessing.shared_memory (which seems to also rely on mmaped-files, right?) and decided to just embrace the GIL.

Multiprocessing is not needed when all of your handful subprocesses are just calling Numpy-code and release their gil anyways.

Also some/most Numpy functions are multithreaded (depending on the BLAS implementation, linked against), take advantage of that and schedule huge operations and just let the interpreter sit idle waiting for that result.

blitzar2y ago

I have seen this movie before, it always ends the same way.

Look to go parallel for speed increase, burn a bunch of time and decide its way too hard and just wait the extra few minutes instead.

KeplerBoy2y ago

It's actually the opposite. My current job is making a piece of code faster my predecessor tried to make faster using multiprocessing. I now try to pull stuff back into a single process.

The code for example loads a file from disk in a separate process to have the main process available for number crunching. The problem: This data is then sent to the main process and another copy is created which takes time. Doing the same with threads is way faster, since data i/o operations are releasing the gil anyways.

fabioz2y ago

Actually, you get the speed increase if using threads on python with numpy because numpy will release the GIL internally, so, operations using numpy will actually run in parallel (i.e.: c libraries can release the GIL and do get the performance benefits when using multiple processors -- although yes, when you get back to running python code that won't be parallel due to the GIL).

2 more replies

westurner2y ago· 2 in thread

Though deprecated probably in favor of more of a database/DBMS like DuckDB, the arrow plasma store holds handles to objects as a separate process:

  $ plasma_store -m 1000000000 -s /tmp/plasma

Arrow arrays are like NumPy arrays but they're made for zero copy e.g. IPC Interprocess Communication. There's a dtype_backend kwarg to the Pandas DataFrame constructor and read_ methods:

df = pandas.Dataframe(dtype_backend="arrow")

The Plasma In-Memory Object Store > Using Arrow and Pandas with Plasma > Storing Arrow Objects in Plasma https://arrow.apache.org/docs/dev/python/plasma.html#storing...

Streaming, Serialization, and IPC > https://arrow.apache.org/docs/python/ipc.html

"DuckDB quacks Arrow: A zero-copy data integration between Apache Arrow and DuckDB" (2021) https://duckdb.org/2021/12/03/duck-arrow.html

Simon_O_Rourke2y ago

What's the latency delta of reading from a local database instance versus reading from memory - is there much additional overhead to slow things down?

westurner2y ago

With most databases, a `SELECT * FROM tblname` has to serialize a copy of the query result and let the ODBC or similar database driver unserialize, so the query result unavoidably gets copied at least once at query time.

Plasma and e.g. DuckDB do zero copy so that there is no: unmarshal of the complete database file into RAM, table scan in order to unindexed query, and then marshal/unmarshal (serialize/deserialize) of the query result for the database driver (which could allow access to a DB/DBMS over a local pipe(), a TCP socket, HTTP(S), protobufs over HTTPS)

If the memory is held in RAM by the plasma store, I'm not sure which queries are possible to service by handing an object reference to where the full non-virtual table is allocated in RAM (on only one node)? Presumably if there's no filtering or transformation in the query, and the query does not access "virtual" or "materialized" tables

pplonski862y ago

I was searching for similar article. Im working on AutoML python package where I use different packages to train ML models on tabular data. Very often the memory is not properly released by external packages so the only way to manage memeory is to execute training in separate processes.

schoetbi2y ago

There is also Apache Arrow that uses a similar use case. Maybe this is worth considering: https://arrow.apache.org/docs/python/memory.html#on-disk-and...

j / k navigate · click thread line to collapse