EDIT: Oh wait... since the problem is huge pages, this is MUCH better, since we can disable it. And I just verified that it works:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
This means, free latency upgrade for all the Redis users out there during fork... (more or less 20x in my tests) just with this line above.
the doc recommends adding "transparent_hugepage=never" to the kernel boot line in the "/etc/grub.conf" file.
[1] http://oracle-base.com/articles/linux/configuring-huge-pages...
[2] https://support.oracle.com/epmos/faces/DocContentDisplay?id=...
[3] http://blog.couchbase.com/often-overlooked-linux-os-tweaks
EDIT: Actually in the CouchBase case they talk about "page allocation delays" that looks related, potentially. I started the investigation mainly because the Stripe graph looked suspicious with a modern EC2 instance type that has good fork times, so there was something more indeed (unless the test was performed with tens of GB of data).
echo never > /sys/kernel/mm/transparent_hugepage/defrag
(The child should be able to inherit all open fds?)
After fork, the parent and the child have write-protected references in their page tables pointing at the same data. (It makes very little difference whether said write-protected reference is an actual entry in the page tables or just in the logical VMA data.)
As soon as one of the processes tries to modify the data, the data must be copied so that the original data doesn't change (because the other process can still see it). That means that the kernel needs to find somewhere to put the copy, and that's almost certainly the expensive part with THP if defragmentation is on. Once the kernel finds a place to copy to, it can update the vm data structures (page tables or otherwise) to give the writer a private copy.
[1] Minor nitpick: for hugepages, there aren't actually any PTEs involved, but that's really just a terminology issue. It's actually an entry in a higher level table in the tree.
This seems to be a similar problem to how to do replication while also serving requests - on one side you want to preserve consistency, on the other side you want to make a full snapshot. It shouldn't matter how long the replication process takes (i.e. how slow fork is), seconds or days, as long as the incoming changes aren't faster than the replication rate. In this case though, your replication target is the disk.
It seems like there should never have to be a full lock of everything.
The problem is fork() is a blocking system call, and the side effects of fork also cause some latency when accessing the COW memory.
there should never have to be a full lock of everything.
There's no software lock involved from the Redis side, but fork() behavior has the effect of a pseudo-lock (because the system call blocks the entire process from running until it returns).
On the other hand your malloc case is using 4kB pages. If a request has to wait for 1000 other in-flight requests to be executed, that means waiting for at most 1000 pages (4MB) to be COW'd, which would take on the order of 0.1-1.0 milliseconds. This is why the latency is much lower.
tl;dr: smaller pages (4kB vs 2MB) allow finer granularity of the COW mechanism, and lead to lower latencies.
EDIT: you were exactly right. This is what happens, there are 50 clients in the benchmark, with many queued requests, so indeed since the benchmark is designed to touch all the keys evenly, what happens is that every client served in a given event loop cycle has a big chance to get a page fault. This seemed unrealistic to me, since I saw the spike in a single event-loop cycle, but it is how is working actually. Thanks!
Open source is good. Its individuals like antirez who make it great.
I should have spotted this by looking at the split between user and system CPU: when the THP problem happened SYS% spiked, but I was looking at a dozen processes pegged at 100% CPU.
I eventually tracked this down via "perf top" which lets you see what everything on the system is doing: both kernel and user space symbols are tracked. I saw significant kernel-space work in an odd memory-related symbol and found someone else with the same problem.
># uname -a
Linux ip-10-214-7-159 3.2.0-36-virtual #57-Ubuntu SMP Tue Jan 8 22:04:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
># grep TRANSPARENT /boot/config-3.2.0-36-virtual
CONFIG_TRANSPARENT_HUGEPAGE=y
# CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS is not set
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y
># echo never > /sys/kernel/mm/transparent_hugepage/enabled
bash: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory