The problem with removing the GIL seems to be garbage collection. I wonder why it's not possible to introduce a new type of object reference that exempts referenced objects from being garbage collected.
Then multiple Python interpreters could be started in separate threads, each with its own unmodified GIL, and the only thing they could access would be their own thread local data and those special shared objects.
What this amounts to is basically an implementation of a multi process architecture on top of a multi threaded architecture. The crucial difference is that the memory shared among the interpreters could hold pointers and thus proper in memory data structures and not just a BLOB into which everything has to be serialized as in the case of conventional shared memory.
Of course the shared objects would have to be manually deleted.
One issue I see is that when such a special object is created, all the objects it creates recursively would also have to be allocated in that pool of special objects. But I think it should be possible to use some kind of global flag to special case the allocator.
Well, there is probably a huge number of issues with this kind of trickery and it's definately not a long term solution. But I'd love to use Python much more than I do and the GIL issue is what prevents that for me at the moment.
I'm not sure this is the real problem; after all, reference counts can be atomically incremented and decremented.
The real problem is that primitive operations in Python (like "foo.bar") cannot safely be performed in C without locking, because you need the hash table to remain consistent while you are doing lookups and/or insertions. This forces you to either wrap all such operations in locks (which has been tried, and slows down the single-threaded case by something like 2x) or reimplement them with lock-free data structures. The latter could be an interesting experiment; you could probably implement a tree-based map using RCU.
http://doi.acm.org/10.1145/1029873.1029875 http://doi.acm.org/10.1145/1146809.1146813 http://research.microsoft.com/apps/pubs/default.aspx?id=6748...
BTW, there's one multi-processing shared objects module here: http://poshmodule.sourceforge.net/
How is that better? It is an improvement over earlier versions of Python. Better would be if the threaded version would actually be twice as fast.
Therefore, the expected behavior of running these two threads would be to have identical runtimes. That would be the "perfect" case with a GIL. So with this "better" GIL its actually near-perfect in this _simple_ case. (Like the presenter said, there needs to be more tests to study the behavior under heavier tasks)
What you're describing, with the threaded being "twice as fast" assumes that the threads will allow two python-instruction-bound tasks to run concurrently, which would require having no GIL.
I'll take incremental, concrete improvements anytime.
Threaded (Dual Core) : 45.5s
Threaded (With a CPU core diabled) : 38.0s
23.5 / 24.0 is an improvement to both tests - and a significant improvement to the Threaded one.
It can be done (this is how Ruby pre-1.9 works), but people complain about it a lot. Matz's rationale for why he added native OS threads to Ruby 1.9 was "people seem to like them."
In response to my fread() example you may be tempted to say that you can just put the underlying fd in non-blocking mode with fcntl(2). This is unfortunately unsafe (as I recall), though I cannot find the reference right now. Accessing the fd directly with read() and write() is non-blocking mode is of course safe, but then you have to do your own buffering.
Also, while poll() and select() give you most of what you need to make your I/O non-blocking, I recall a case where a pipe will be write-ready, but if you try to write too much data at the same time the write will block anyway, instead of returning a short count of bytes written. This was on RHEL3 Linux, so things may have changed since then.
That just sounds like a bug.
And, for those using Python for high-volume production systems, how do you cope? Do you run one process (say a Zope instance) per core and load balance?
When writing CPU-bound code, there are many libraries like NumPy that can do your heavy lifting in native code (thus avoiding the GIL)
So it really isn't as bad as it seems. The reason this patch is good, is because it prevents performance from falling apart if one of your threads takes up too much CPU.
The beauty is that since your app has to be written to be multiprocess capable then you can also extend past one computer when the need arises.
You pay a memory tax overhead though.