That is not true. It is an often repeated misconception. It makes it sound like Python creators were just incompetent and just stuck threads in there even though they are completely useless. In fact Python's threads work well for IO concurrency. I used them and saw great speedup when accepting and handling simultaneous socket connections. Yes you won't get CPU concurrency, but if your server is not CPU bound you might not notice much of a difference.
IO concurrency is real concurrency. In 8 years using Python for fun and professionally I probably wrote more IO concurrent code than CPU concurrent code. Even then for CPU concurrent code I would have had to drop into C using an extension (and there you can release the GIL anyway).
Now, the obvious follow up is that in case of IO concurrency you are often better of using gevent or eventlet. You get lighter weight threads (memory wise) and less chances of synchronizations bugs (since greenlet based green threads will switch only on IO concurrency points, socket reads, sleep and explicit waits on green semaphores and locks).
That's not true. Most applications, with the possible exceptions of proxies, are also CPU bound.
Take for example a web service that receives JSON documents. The act of parsing JSON documents is CPU bound. The act of creating a response is CPU bound. In between you can also have IO bound operations, like fetching data from a MySQL database or a Memcached instance, however in the process of creating the final response you also need to transform the data received and that's also CPU bound.
As a real world example, I worked on a web-service written in Scala and running on the JVM. Initially it was running on only 8 Heroku dynos and these instances were receiving over 30,000 requests per second of real traffic. These Heroku instances are of course under-powered, because on my modest laptop the same web server is able to handle more than 10,000 requests per second.
And yes, asynchronous I/O lets you easily have 100,000 connections per server. But if you need throughput, then the CPU starts being a bottleneck.
Of course, my problem with Python and why I migrated away from it is that in truth Python sucks for asynchronous I/O too. But that's another story.
JSON is parsed in C with CPython, or in assembly with PyPy.
As a real world example, I've done realtime image processing of a gigabyte per second worth of data on a single machine with asynchronous python. It was IO bound, we had more CPU to spare. Hell, we even had some GPUs sitting there not doing anything because they weren't needed.
If you're doing real performance computing, then taking advantage of GPUs/DSP or other hardware is where it is at anyway. Python is quite good at a glue language for interfacing to these things.
Now this is IO concurrency but it is real concurrency. Adding CPU concurrency would be very nice. It might speed things up a bit, or it might not. It really depends.
As an example consider haproxy. The little proxy that could. It handles large amounts of concurrent connection in parallel and it is single threaded in its default configuration. I've heard of 100k connections. It deals with IO concurrency. Chances are, making it multi-threaded might not dramatically improve its performance (it might even slow it down).
That is a common misconception. And it seems to me nowadays most concurrency people deal with (at least when it comes to server and web back-end world) is heavier IO bound. Yet everyone automatically default to their CS 102 -- algorithms class when they think about solving graph problems in parallel or multiplying matrices. So concurrency automatically is implied to be CPU concurrency.
So long as you don't have any CPU bound threads competing for the GIL ;)
If you have a CPU bound thread, it may be worth to pay the performance penalty of separating some of the program flow in different processes.
Here's a demo in Python: https://gist.github.com/bdarnell/1073945
With FD passing, you can have multiple processes, related or unrelated, pulling incoming connections from the same socket. You use the FD passing to share the listening socket.
It could even work as expected for a while (since the kernel gets to arbitrarily decide what port to deliver incoming requests to) only to intermittently fail later.
I would however like SO_REUSEPORT to run experiments: Right now we use iptables/tc to direct some traffic at "new versions" of some of our systems so we can run tests with live data, but connection tracking for localhost is lame. I'd much rather use SO_REUSEPORT.
Only if it has the same uid as the other one. It'd also be trivial to check whether the other processes listening to your port are "friendly" (as in "you don't want both Apache and Nginx listening on port 80").
Event based system can be more performant in some cases and slow in another cases. If there is not much opportunity for CPU to do any work, then event based system will often outperform threads. One example is proxies. I already gave haproxy as an example, so I'll repeat it here as well. It is single threaded event based by default. It is certainly performant. Why? Because in a simplified model it just shuffles data from one socket to another. Pretty straight forward. Introducing multiple threads and context switches might just thrash caches around and actually make it worse (I have seen that happen).
Now add some CPU work in there. Say make each connection compute something, serialize some JSON. Like in those benchmarks, they use a DB driver get a row, serialize it and return. Ok there is some work. Now it is more likely that multi-threaded will help. But again one can surely tweak CPU affinities, thread pool sizes, hyper-threading BIOS settings, db driver types to really change things up. Threads take up memory. Not an insignificant amount. Now I like green threads, Erlang's processes, Go's goroutines because they are lightweight. (At least Erlang's processes map N:M to CPUs for parallel execution on the host machine).
So I guess my point is you are right that event based are not always and strictly more performant. But I also think in certain cases it can beat multi-threaded code (thread memory size, context switches, cache thrashing). That benchmark there, I wouldn't take it too seriously just like I wouldn't take Language Shootout too seriously.
Threads / processes:
* Run some code from A
* Save state, context switch
* Run some code from B
* Save state, context switch
* Deal with locking, synchronisation, etc
vs * Run some code.
There is absolutely no instances where [num threads] > [num cores] is as efficient as not using more threads than cores.This is an example of a major downfall with free software: a developer decides he needs a feature so he implements it without taking any effort to see what has been done before – and more importantly, why.
It leads to the project sprouting thousands of new features while nothing achieves the polish and completeness of the original idea because the developer moved on to something newer and shinier.
I can't find the original blog post where I read the idea, but I did find one on Coding Horror: http://www.codinghorror.com/blog/2008/01/the-magpie-develope...
The Linux kernel solves this by having Linus, who has the long term perspective and the commitment to keep the project moving forward. I'm not claiming he's perfect, just that having him is the correct solution to the problem. Obviously here is someone who thinks the 3.9 kernel has a new feature he needs all the while ignoring past socket work.
It's also not possible to occasionally listen and unlisten.. that causes the hash modulus to change, sending traffic to the wrong sockets and (most likely) resetting all existing connections
Let's say that we are running a server on a port which uses this option to allow multiple processes to bind to it. What's to prevent a rogue process, perhaps with malicious intent, from starting up and siphoning off requests willy nilly? Sounds like a great way to implement a hard to detect MITM attack.
What would be nicer, I think, is if socket reusing was bound not only to the same uid but also to the process listening to it.
That being said, this option can simplify things -- removing the necessity of having some moving part to distribute connections across completely independent processes.
Because this is a Linux kernel feature involving sharing a socket amongst multiple OS processes, and is therefore only interesting to talk about if you are using multiple OS processes. It's not a generalized primer on all techniques of handling IO.
(Btw, there is another interesting forking-for-client-connection pattern in Erlang. Instead of forking off and handling the client connection in a separate process, instead handle the client connection in the accepting process but fork-off another process to continue accepting. In general, just a process pool, that should be easier to set up with this new feature).
And as a result, the user can configure the prefetch pool.
[0] http://en.wikipedia.org/wiki/Thundering_herd_problem [1] http://stackoverflow.com/questions/15636319/why-is-accept-mu... [2] http://uwsgi-docs.readthedocs.org/en/latest/articles/Seriali...
Implementing the prefork model by spawning unrelated processes (by opposition to forking from a common parent process) is likely to consume more memory: each process is unrelated, and do not share copy on write memory pages with other processes.
In fact I have seen issues where gunicorn failed miserably simply because it did not handle a bad import in a child process. Tornado as of the latest version I had used (2.0 I think) did not have any ability to check for dead child processes. I am sure there are more examples of this done wrong than right.
This is an interesting option for several use cases but you still need a parent process to monitor things. Perhaps at some point upstart or systemd will get good enough to monitor multiple processes per daemon in real time. Until then, meh.
Edit: actually, one cool thing you can do with this is code reloading. You simply have your parent process start more workers that attach to the same socket, then kill the old ones. That way the idea of code or config reloading doesn't need to be baked into every part of the worker.
The article suggests you let http://supervisord.org/ (or similar) take care of these things.
Does the kernel use some sort of round-robin approach to assigning client sockets to processes waiting on accept()? This is one area where I'd imagine a dedicated master process would be beneficial, as it could implement "smarter" load balancing based on the health and response times of its child processes.
See Appendix C to his December 15, 1993 book on TCP/IP.
1993.
If I understand SO_REUSEPORT right you let the kernel decide everything - access control, receiving process, timing, etc in exchange for not having your own process doing the same thing. Since that simplistic approach is the kind of thing that can be implemented in about 100 lines of user-space code doing file-descriptor sharing with sendmsg/recvmg via AF_UNIX sockets, I don't see the benefit of pushing that complexity into the kernel. Especially since if you want to exercise any greater level of control you'll just have to roll your own AF_UNIX based code anyway.
You can use setrlimit to prevent that. Plus, your application is likely to have direct control over forking anyway.