Jumping from 35,000 connections per node on an EC2
Small instance to over half a million on a single
EC2 Large...
You moved the goalpost. I think the conclusion would make a lot more sense if you compared apples to apples when it comes to your server... either test everything on a small or on a large.We hit 120k clients at the time the process was killed by the kernel, but had about 500mb free memory left when the OOM kill occurred. Here's why (and please correct me if I'm mistaken - I'm not a kernel expert, but have done a lot of reading in this area over the past couple months):
Small EC2 instances are restricted to running a 32-bit kernel. The 32-bit Linux kernel (2.6 series) allocates memory into three zones, with the first and smallest slot reserved for DMA (~16mb), the next portion reserved for kernel functions ("low memory", ~896mb), and the rest allocated as "high memory" for userspace. There are very limited options to configure this allocation, aside from recompiling and maintaining a custom kernel for our purposes, which we do not want to do. The Hugemem kernel allocates low memory differently, but is no longer being actively recommended (that I can see), and only makes sense for servers with > 4GB anyway.
Because TCP sockets are allocated in low memory, and its size is relatively fixed, 120k sockets will exhaust low memory despite about 500mb of high memory being free and unallocated. At this point, the kernel has no memory left allocated to itself to do work, so the OOM killer steps in and shoots the process.
The 64-bit kernel makes no distinction between "low" and "high" memory, and does not suffer this problem. We switched to an EC2 Large instance in order to avoid this limitation and properly test the service. In the end, if Amazon were to offer a 64-bit Small instance, it would be ideal for our application and would push our cost per client even lower, though it's currently at a very acceptable spot.
While the metrics in this post focus on connections per node, our ultimate goal involves both maximizing the number of connections per instance, in addition to minimizing the number of instances required (essentially, we want to [safely] maximize density across the board). The sentence you quote refers to this cumulative goal rather than the particular comparison of implementations and instance types. I should've been more clear.
Anyhow, pardon this omission from the original post. At some point, I might write a bit more about the TCP/kernel-level issues we ran into if anyone's interested.
A small instance costs 8.5¢/hr, so if it handles 35,000 connections with Eventlet that's 0.00024¢/hr/conn. Put another way, that's $57.60 to handle a million connections for a day.
A large instance costs 34¢/hr, so at 500,000 connections that's 0.000068¢/hr/conn, or $16.32 for a million connections for a day.
So the NIO implementation apparently saves UA 72% over the Eventlet implementation. Impressive!
(This is where I wait for someone from UA to show up and correct my numbers...)
Kidding - I would be very interested in more on why Scala failed as well. Both Scala and Erlang are implementations of the Actor model and should theoretically be very well suited to this.
At the end of the day storing sockets in a hashmap is a pretty compact data structure. If you can minimize synchronization/locking, threads shared state is extremely efficient from a CPU and memory standpoint. Maybe not so efficient from a developers time/sanity standpoint though. :-)
Do note that this article is purely about an edge server though. Its job is essentially to hold sockets and communicate with internal queues. The message passing model is alive and well, albeit at a higher level.
Offtopic: I'm not an expert on functional programming (just getting my feet wet; forgive me for being so naive), but after reading Scala By Example I have this nagging feeling that a functional implementation of an algorithm might actually run slower than an imperative implementation of the same algorithm. I mean, you're copying data all around the place, creating new instances of objects instead of modifying them in place, using a ton of recursion everywhere. Could anyone offer some perspective on this?
"...we spent several hours fanning out that spike to include three versions"
http://www.metabrew.com/article/a-million-user-comet-applica...
A connection that's open isn't problematic at all, assuming it's being used - lots of setups/teardowns can be an issue (which is why polling apps will consistently take up a large portion of your battery life). The main issue is if the app is using a radio when it shouldn't be. Luckily, we can control this pretty well with our handling code and have been able to balance out needs pretty well between delivering a message as quickly as possible and not destroying battery life.
If you're an Android dev, we'd love your feedback on it in general :)
What does Scala do to make the number of connections 50% less?
FOR SCIENCE