Presentation on the new Python GIL (opens in new tab)

(dabeaz.com)

77 pointsvoxcogitatio16y ago27 comments

27 comments

20 comments · 4 top-level

fauigerzigerk16y ago· 5 in thread

I have a slightly weird and probably completely unworkable "idea" that has nothing to do with this particular GIL improvement but with the broader GIL issue.

The problem with removing the GIL seems to be garbage collection. I wonder why it's not possible to introduce a new type of object reference that exempts referenced objects from being garbage collected.

Then multiple Python interpreters could be started in separate threads, each with its own unmodified GIL, and the only thing they could access would be their own thread local data and those special shared objects.

What this amounts to is basically an implementation of a multi process architecture on top of a multi threaded architecture. The crucial difference is that the memory shared among the interpreters could hold pointers and thus proper in memory data structures and not just a BLOB into which everything has to be serialized as in the case of conventional shared memory.

Of course the shared objects would have to be manually deleted.

One issue I see is that when such a special object is created, all the objects it creates recursively would also have to be allocated in that pool of special objects. But I think it should be possible to use some kind of global flag to special case the allocator.

Well, there is probably a huge number of issues with this kind of trickery and it's definately not a long term solution. But I'd love to use Python much more than I do and the GIL issue is what prevents that for me at the moment.

haberman16y ago

> The problem with removing the GIL seems to be garbage collection.

I'm not sure this is the real problem; after all, reference counts can be atomically incremented and decremented.

The real problem is that primitive operations in Python (like "foo.bar") cannot safely be performed in C without locking, because you need the hash table to remain consistent while you are doing lookups and/or insertions. This forces you to either wrap all such operations in locks (which has been tried, and slows down the single-threaded case by something like 2x) or reimplement them with lock-free data structures. The latter could be an interesting experiment; you could probably implement a tree-based map using RCU.

wmf16y ago

I think Erlang and MS Singularity use this technique to some extent.

http://doi.acm.org/10.1145/1029873.1029875 http://doi.acm.org/10.1145/1146809.1146813 http://research.microsoft.com/apps/pubs/default.aspx?id=6748...

thwarted16y ago

As does perl. You need to declare variables as shared, there is apparently some performance hit, but it's mostly localized to the shared variables, so reasonably manageable. I've actually found it easier to work with than implicit shared data across all threads because you actually need to think about the exposed surface area and how you can minimize it.

viraptor16y ago

People write (http://www.julmar.com/blog/mark/PermaLink,guid,3670d081-0276...) that even .NET itself has a similar kind of GC (partially concurrent, only address fixup stops the world). Unfortunately I can't check how it compares to Singularity.

Erwin16y ago

I'm pretty sure someone built a Python like that, but it did not work out. Someone whose name I don't recall from the #python freenode channel. I think perhaps C module were one of the issues, or sharing of global scopes (which then need fine-grained locks to synchronized access).

BTW, there's one multi-processing shared objects module here: http://poshmodule.sourceforge.net/

st3fan16y ago· 5 in thread

"""Does it work? Yes it's better: Sequential 23.5 seconds. Threaded 24.0 seconds."""

How is that better? It is an improvement over earlier versions of Python. Better would be if the threaded version would actually be twice as fast.

mitchellh16y ago

If you look at the code he used to test this, its completely CPU-bound, meaning that it requires the GIL to run. And since the GIL is, well, a "GIL," that means that only one can run at any single time.

Therefore, the expected behavior of running these two threads would be to have identical runtimes. That would be the "perfect" case with a GIL. So with this "better" GIL its actually near-perfect in this _simple_ case. (Like the presenter said, there needs to be more tests to study the behavior under heavier tasks)

What you're describing, with the threaded being "twice as fast" assumes that the threads will allow two python-instruction-bound tasks to run concurrently, which would require having no GIL.

jnoller16y ago

An improvement over the old implementation is better. "Fixed and Perfect" would be free-threading, where it would be twice as fast with threads.

I'll take incremental, concrete improvements anytime.

Luyt16y ago

If you really want to speed up your program, you have to parallelize and distribute the workload over multiple cores. Python has a module in the standard library for this, it's called 'multiprocessing'. See http://docs.python.org/library/multiprocessing.html

rbanffy16y ago

It won't help if you have to do stuff over a single shared data structure.

1 more reply

whatusername16y ago

From Slide 7: Sequential (Single Core) : 24.6s

Threaded (Dual Core) : 45.5s

Threaded (With a CPU core diabled) : 38.0s

23.5 / 24.0 is an improvement to both tests - and a significant improvement to the Threaded one.

sagarm16y ago· 4 in thread

Does anyone know why multiple kernel threads are used at all? If only one thread can run at a time, why not just use a single kernel thread?

kingkilr16y ago

Because CPython doesn't want to invent it's own scheduler (and various other things that your operating system already does for you), further multiple threads can be active at the same time if they're doing any operations that release the GIL (such as I/O), Python can't know whether your thread will do any I/O before you create it.

scott_s16y ago

However, the approach explained in this presentation does just that. The solution looks like a crude process operating systems scheduler.

1 more reply

haberman16y ago

Just to expand on the previous reply: a lot of C libraries have a blocking model where you call a function that blocks until it is complete. The simplest example is fread(). If the interpreter doesn't support multiple threads, you have to do a lot of gymnastics to make sure you never block the interpreter in an extension.

It can be done (this is how Ruby pre-1.9 works), but people complain about it a lot. Matz's rationale for why he added native OS threads to Ruby 1.9 was "people seem to like them."

In response to my fread() example you may be tempted to say that you can just put the underlying fd in non-blocking mode with fcntl(2). This is unfortunately unsafe (as I recall), though I cannot find the reference right now. Accessing the fd directly with read() and write() is non-blocking mode is of course safe, but then you have to do your own buffering.

Also, while poll() and select() give you most of what you need to make your I/O non-blocking, I recall a case where a pipe will be write-ready, but if you try to write too much data at the same time the write will block anyway, instead of returning a short count of bytes written. This was on RHEL3 Linux, so things may have changed since then.

neilc16y ago

I recall a case where a pipe will be write-ready, but if you try to write too much data at the same time the write will block anyway, instead of returning a short count of bytes written

That just sounds like a bug.

1 more reply

ShabbyDoo16y ago· 2 in thread

Knowing little of Python, is it the case that the GIL was not initially deemed a bad scheme because multi-processor x86 machines were a rarity in the mid-90's (when Python was young)? I presume that Sun desired Java to be multiprocessor friendly because it had multi-cpu SPARC machines at the time of Java's design?

And, for those using Python for high-volume production systems, how do you cope? Do you run one process (say a Zope instance) per core and load balance?

jart16y ago

Keep in mind the GIL is only a problem when you a) use threads and b) perform CPU-bound operations (unless you need soft real time)

When writing CPU-bound code, there are many libraries like NumPy that can do your heavy lifting in native code (thus avoiding the GIL)

So it really isn't as bad as it seems. The reason this patch is good, is because it prevents performance from falling apart if one of your threads takes up too much CPU.

zepolen16y ago

Yep, a process per core.

The beauty is that since your app has to be written to be multiprocess capable then you can also extend past one computer when the need arises.

You pay a memory tax overhead though.

j / k navigate · click thread line to collapse

27 comments

20 comments · 4 top-level

fauigerzigerk16y ago· 5 in thread

I have a slightly weird and probably completely unworkable "idea" that has nothing to do with this particular GIL improvement but with the broader GIL issue.

Of course the shared objects would have to be manually deleted.

haberman16y ago

> The problem with removing the GIL seems to be garbage collection.

I'm not sure this is the real problem; after all, reference counts can be atomically incremented and decremented.

wmf16y ago

I think Erlang and MS Singularity use this technique to some extent.

http://doi.acm.org/10.1145/1029873.1029875 http://doi.acm.org/10.1145/1146809.1146813 http://research.microsoft.com/apps/pubs/default.aspx?id=6748...

thwarted16y ago

viraptor16y ago

Erwin16y ago

BTW, there's one multi-processing shared objects module here: http://poshmodule.sourceforge.net/

st3fan16y ago· 5 in thread

"""Does it work? Yes it's better: Sequential 23.5 seconds. Threaded 24.0 seconds."""

How is that better? It is an improvement over earlier versions of Python. Better would be if the threaded version would actually be twice as fast.

mitchellh16y ago

What you're describing, with the threaded being "twice as fast" assumes that the threads will allow two python-instruction-bound tasks to run concurrently, which would require having no GIL.

jnoller16y ago

An improvement over the old implementation is better. "Fixed and Perfect" would be free-threading, where it would be twice as fast with threads.

I'll take incremental, concrete improvements anytime.

Luyt16y ago

rbanffy16y ago

It won't help if you have to do stuff over a single shared data structure.

1 more reply

whatusername16y ago

From Slide 7: Sequential (Single Core) : 24.6s

Threaded (Dual Core) : 45.5s

Threaded (With a CPU core diabled) : 38.0s

23.5 / 24.0 is an improvement to both tests - and a significant improvement to the Threaded one.

sagarm16y ago· 4 in thread

Does anyone know why multiple kernel threads are used at all? If only one thread can run at a time, why not just use a single kernel thread?

kingkilr16y ago

scott_s16y ago

However, the approach explained in this presentation does just that. The solution looks like a crude process operating systems scheduler.

1 more reply

haberman16y ago

It can be done (this is how Ruby pre-1.9 works), but people complain about it a lot. Matz's rationale for why he added native OS threads to Ruby 1.9 was "people seem to like them."

neilc16y ago

I recall a case where a pipe will be write-ready, but if you try to write too much data at the same time the write will block anyway, instead of returning a short count of bytes written

That just sounds like a bug.

1 more reply

ShabbyDoo16y ago· 2 in thread

And, for those using Python for high-volume production systems, how do you cope? Do you run one process (say a Zope instance) per core and load balance?

jart16y ago

Keep in mind the GIL is only a problem when you a) use threads and b) perform CPU-bound operations (unless you need soft real time)

When writing CPU-bound code, there are many libraries like NumPy that can do your heavy lifting in native code (thus avoiding the GIL)

So it really isn't as bad as it seems. The reason this patch is good, is because it prevents performance from falling apart if one of your threads takes up too much CPU.

zepolen16y ago

Yep, a process per core.

The beauty is that since your app has to be written to be multiprocess capable then you can also extend past one computer when the need arises.

You pay a memory tax overhead though.

j / k navigate · click thread line to collapse