Redis latency spikes and the Linux kernel: a few more details (opens in new tab)

(antirez.com)

173 pointsmickeyben11y ago44 comments

44 comments

32 comments · 9 top-level

antirez11y ago· 15 in thread

Unfortunately I just spotted that I'm totally wrong, the huge pages apparently are only used by jemalloc: I just mis-read the outputs since this seemed so obvious. So on the contrary, it appears that the high latency is due to the huge pages thing for some unknown reason. So actually it is malloc that, while NOT using huge pages, is going much faster. I've no idea about what is happening here, so please disregard the blog posts conclusions (all the rest is hopefully correct).

EDIT: Oh wait... since the problem is huge pages, this is MUCH better, since we can disable it. And I just verified that it works:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

This means, free latency upgrade for all the Redis users out there during fork... (more or less 20x in my tests) just with this line above.

sandGorgon11y ago

this is a recommended setting even on Oracle [1] [2] and Couchbase [3]

the doc recommends adding "transparent_hugepage=never" to the kernel boot line in the "/etc/grub.conf" file.

[1] http://oracle-base.com/articles/linux/configuring-huge-pages...

[2] https://support.oracle.com/epmos/faces/DocContentDisplay?id=...

[3] http://blog.couchbase.com/often-overlooked-linux-os-tweaks

antirez11y ago

Yes, never reported by Redis users so far, maybe since in Redis it does not have the same impact it has in other databases, but a more subtle latency issue apparently.

EDIT: Actually in the CouchBase case they talk about "page allocation delays" that looks related, potentially. I started the investigation mainly because the Stripe graph looked suspicious with a modern EC2 instance type that has good fork times, so there was something more indeed (unless the test was performed with tens of GB of data).

antirez11y ago

Replying to myself for an update. The hypothesis of a a few users here is correct, while hard to believe. The latency spike is due to the fact that even with the 50 clients of the benchmark, it is possible to touch all the huge pages composing the process in the space of a single event loop. This is why I was observing this initial spike and nothing more. This seemed unrealistic to me with 50 clients, but I remembered that one of the Redis optimizations is to try to serve multiple requests for the same client in the same event loop cycle if there are many in queue, so this is what happens: 50 clients x N-queued-requests = enough requests to touch all the memory pages, or at least a very significant percentage to block for a long time and never again.

evan_miller11y ago

Have you tried disabling only defrag? The performance issues I'm familiar with are generally related to defrag running at allocation time. I would be curious to see your results while enabling THP but:

echo never > /sys/kernel/mm/transparent_hugepage/defrag

antirez11y ago

Awesome hint, I'll try it. However if huge pages means 2MB copy-on-write for each fault, you want still them turned off for Redis. Also does not help with jemalloc memory reclaiming.

nonane11y ago

That makes sense. Since jemalloc is using huge pages, even if it's one byte inside a huge page changes, the kernel needs to copy an entire 2mb per modified entry over (worst case localilty). Whereas on a non-hugepage allocation it'll only be 4kb. Thats why you saw movs in the stack trace - the kernel was too busy copying over the entire huge page over. With small pages, it can be more granular about it's copying.

antirez11y ago

I don't think that's the reason actually, there is a big spike and then all the other COWs are ok, also copying 2MB is no way near to the 300 ms I was observing. Not sure if for some reason most pages where copied, or it was some other issue at all. It could be the code that fragments the huge page into multiple pages maybe? MOV -> fault -> split of page -> COW of a single 4k page.

1 more reply

zwily11y ago

Disabling transparent hugepages has been common advice in the Postgres world for awhile, and for the same reason. I suspect anything that forks a lot should have them off.

jsmthrowaway11y ago

MongoDB, too, which also requests disabling NUMA since MongoDB has no idea what NUMA is.

xtacy11y ago

Apparently, similar issues (+ fixes) also apply for Hadoop workloads: http://structureddata.org/2012/06/18/linux-6-transparent-hug...

daviesliu11y ago

Could your confirm that the MOV is copying PTEs? I doubt that maybe it's copying the heap, because of copy-on-write. After fork(), the parent write to heap randomly, so it will copy more data with jemalloc than libc (because of huge page)

nteon11y ago

After the latency spike does the smaps output still list the same amount of AnonHugePages?

antirez11y ago

Can't test it right now, but seems smart, to check if they got split into smaller pages.

2 more replies

Ono-Sendai11y ago

Why do you fork?

seiji11y ago

See the "How it works" section under "Snapshotting" at http://redis.io/topics/persistence

1 more reply

jbert11y ago· 3 in thread

Dumb question - if problem is a latency spike in the parent, any chance you can continue serving in the child and let the parent do the BGSAVE (and take the latency spike)?

(The child should be able to inherit all open fds?)

nonane11y ago

The problem happens when one of the processes (parent or child) tries to modify a page. Modifying the page triggers the copy-on-write mechanism and blocks the process making the modifications - it doesn't matter if it's parent or child. In Redis' case, the BGSAVE process doesn't do any writes - so it's good. It's always the server process that does the writes.

antirez11y ago

I'm not sure but probably both sides are blocked when copying the shared PTEs.

amluto11y ago

This issue has nothing to do with copying the PTE [1] and everything to do with copying the data.

After fork, the parent and the child have write-protected references in their page tables pointing at the same data. (It makes very little difference whether said write-protected reference is an actual entry in the page tables or just in the logical VMA data.)

As soon as one of the processes tries to modify the data, the data must be copied so that the original data doesn't change (because the other process can still see it). That means that the kernel needs to find somewhere to put the copy, and that's almost certainly the expensive part with THP if defragmentation is on. Once the kernel finds a place to copy to, it can update the vm data structures (page tables or otherwise) to give the writer a private copy.

[1] Minor nitpick: for hugepages, there aren't actually any PTEs involved, but that's really just a terminology issue. It's actually an entry in a higher level table in the tree.

1 more reply

makmanalp11y ago· 2 in thread

Pardon my ignorance - but isn't this a problem that's existed in database software in general, and one that's been dealt with in the past?

This seems to be a similar problem to how to do replication while also serving requests - on one side you want to preserve consistency, on the other side you want to make a full snapshot. It shouldn't matter how long the replication process takes (i.e. how slow fork is), seconds or days, as long as the incoming changes aren't faster than the replication rate. In this case though, your replication target is the disk.

It seems like there should never have to be a full lock of everything.

seiji11y ago

(i.e. how slow fork is)

The problem is fork() is a blocking system call, and the side effects of fork also cause some latency when accessing the COW memory.

there should never have to be a full lock of everything.

There's no software lock involved from the Redis side, but fork() behavior has the effect of a pseudo-lock (because the system call blocks the entire process from running until it returns).

makmanalp11y ago

Thank you, this clears up the issue a bit more in my mind.

mrb11y ago· 1 in thread

Your jemalloc case is using 1564672 kB, this means you only have 764 hugepages. Since the RAM access patterns in the parent process are random, you only need ~1000 accesses to trigger multiple COW (copy-on-write) totaling most of the ~1.5GB. redis-benchmark simulates 50 clients by default and you use a pipelining factor of 4, so you have 200 in-flight requests at any one time. I am not familiar with how the redis server prioritizes the requests to be served, but if it is like most network servers it is not deterministic (whatever comes first from the network event loop). So while on average a request may wait for 200 other ones to be served, a small fraction of them will have to wait longer because of this non-deterministic factor. It is not unimaginable that some have to wait for 500, maybe 1000 other requests. Waiting for 1000 requests means waiting for a COW of most of the ~1.5GB which would take a few hundred milliseconds and would absolutely explain your latency spike.

On the other hand your malloc case is using 4kB pages. If a request has to wait for 1000 other in-flight requests to be executed, that means waiting for at most 1000 pages (4MB) to be COW'd, which would take on the order of 0.1-1.0 milliseconds. This is why the latency is much lower.

tl;dr: smaller pages (4kB vs 2MB) allow finer granularity of the COW mechanism, and lead to lower latencies.

antirez11y ago

This was my first thought as well, but actually this is not what is happening AFAIK, and the performance hit is likely due to inefficient huge page allocation. There are reasons I believe this, but I'm actually checking in a more systematic way right now before saying random things.

EDIT: you were exactly right. This is what happens, there are 50 clients in the benchmark, with many queued requests, so indeed since the benchmark is designed to touch all the keys evenly, what happens is that every client served in a given event loop cycle has a big chance to get a page fault. This seemed unrealistic to me, since I saw the spike in a single event-loop cycle, but it is how is working actually. Thanks!

diem_perdidi11y ago· 1 in thread

I find the back-and-forth ("<this is wrong>", "EDIT", "<wrong advice>"...) to be quite interesting and instructive, better than just editing the whole post with the 'correct' explanation and workaround. Thanks for that!

sswaner11y ago

I agree! I also appreciate the effort over the past few days to dig into this after it was originally reported. It is fascinating to watch the analysis unfold.

msandford11y ago· 1 in thread

This is why open-source software is so rad. I suspect that you'd never see an Oracle customer complaining publicly and then an Oracle engineer publicly tracking down the source of their complaints.

mjfisher11y ago

I'm a huge fan of antirez in particular for this. So many other devs would have shrugged off the occasional latency as acceptable, or a poor measure of true performance, or blamed it on nebulous EC2 performance issues. Which happened, to a greater or less extent. But three blog posts later he's still digging in to the root causes and has eventually (through no small effort) come up with the cause, and a fix.

Open source is good. Its individuals like antirez who make it great.

gpvos11y ago

PTE = Page Table Entry

Erwin11y ago

I had a similar transparent huge pages issues some time back -- the symptoms were incomprehensible spikes in CPU usage sometimes. Apparently this had something to do with memory allocation pressure when my Python processes restarted. However I thought these were perhaps weird power throttling issues which meant CPU was performing at a lower performance (explaining why a normal startup took 1 cpu minute instead of 0:10 seconds).

I should have spotted this by looking at the split between user and system CPU: when the THP problem happened SYS% spiked, but I was looking at a dozen processes pegged at 100% CPU.

I eventually tracked this down via "perf top" which lets you see what everything on the system is doing: both kernel and user space symbols are tracked. I saw significant kernel-space work in an odd memory-related symbol and found someone else with the same problem.

matkam11y ago

I'm having trouble figuring out how to disable transparent hugepages without a reboot:

  ># uname -a
  Linux ip-10-214-7-159 3.2.0-36-virtual #57-Ubuntu SMP Tue Jan 8 22:04:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

  ># grep TRANSPARENT /boot/config-3.2.0-36-virtual
  CONFIG_TRANSPARENT_HUGEPAGE=y
  # CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS is not set
  CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y

  ># echo never > /sys/kernel/mm/transparent_hugepage/enabled
  bash: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory

j / k navigate · click thread line to collapse

44 comments

32 comments · 9 top-level

antirez11y ago· 15 in thread

EDIT: Oh wait... since the problem is huge pages, this is MUCH better, since we can disable it. And I just verified that it works:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

This means, free latency upgrade for all the Redis users out there during fork... (more or less 20x in my tests) just with this line above.

sandGorgon11y ago

this is a recommended setting even on Oracle [1] [2] and Couchbase [3]

the doc recommends adding "transparent_hugepage=never" to the kernel boot line in the "/etc/grub.conf" file.

[1] http://oracle-base.com/articles/linux/configuring-huge-pages...

[2] https://support.oracle.com/epmos/faces/DocContentDisplay?id=...

[3] http://blog.couchbase.com/often-overlooked-linux-os-tweaks

antirez11y ago

Yes, never reported by Redis users so far, maybe since in Redis it does not have the same impact it has in other databases, but a more subtle latency issue apparently.

antirez11y ago

evan_miller11y ago

echo never > /sys/kernel/mm/transparent_hugepage/defrag

antirez11y ago

Awesome hint, I'll try it. However if huge pages means 2MB copy-on-write for each fault, you want still them turned off for Redis. Also does not help with jemalloc memory reclaiming.

nonane11y ago

antirez11y ago

1 more reply

zwily11y ago

Disabling transparent hugepages has been common advice in the Postgres world for awhile, and for the same reason. I suspect anything that forks a lot should have them off.

jsmthrowaway11y ago

MongoDB, too, which also requests disabling NUMA since MongoDB has no idea what NUMA is.

xtacy11y ago

Apparently, similar issues (+ fixes) also apply for Hadoop workloads: http://structureddata.org/2012/06/18/linux-6-transparent-hug...

daviesliu11y ago

nteon11y ago

After the latency spike does the smaps output still list the same amount of AnonHugePages?

antirez11y ago

Can't test it right now, but seems smart, to check if they got split into smaller pages.

2 more replies

Ono-Sendai11y ago

Why do you fork?

seiji11y ago

See the "How it works" section under "Snapshotting" at http://redis.io/topics/persistence

1 more reply

jbert11y ago· 3 in thread

Dumb question - if problem is a latency spike in the parent, any chance you can continue serving in the child and let the parent do the BGSAVE (and take the latency spike)?

(The child should be able to inherit all open fds?)

nonane11y ago

antirez11y ago

I'm not sure but probably both sides are blocked when copying the shared PTEs.

amluto11y ago

This issue has nothing to do with copying the PTE [1] and everything to do with copying the data.

[1] Minor nitpick: for hugepages, there aren't actually any PTEs involved, but that's really just a terminology issue. It's actually an entry in a higher level table in the tree.

1 more reply

makmanalp11y ago· 2 in thread

Pardon my ignorance - but isn't this a problem that's existed in database software in general, and one that's been dealt with in the past?

It seems like there should never have to be a full lock of everything.

seiji11y ago

(i.e. how slow fork is)

The problem is fork() is a blocking system call, and the side effects of fork also cause some latency when accessing the COW memory.

there should never have to be a full lock of everything.

There's no software lock involved from the Redis side, but fork() behavior has the effect of a pseudo-lock (because the system call blocks the entire process from running until it returns).

makmanalp11y ago

Thank you, this clears up the issue a bit more in my mind.

mrb11y ago· 1 in thread

tl;dr: smaller pages (4kB vs 2MB) allow finer granularity of the COW mechanism, and lead to lower latencies.

antirez11y ago

diem_perdidi11y ago· 1 in thread

sswaner11y ago

I agree! I also appreciate the effort over the past few days to dig into this after it was originally reported. It is fascinating to watch the analysis unfold.

msandford11y ago· 1 in thread

This is why open-source software is so rad. I suspect that you'd never see an Oracle customer complaining publicly and then an Oracle engineer publicly tracking down the source of their complaints.

mjfisher11y ago

Open source is good. Its individuals like antirez who make it great.

gpvos11y ago

PTE = Page Table Entry

Erwin11y ago

I should have spotted this by looking at the split between user and system CPU: when the THP problem happened SYS% spiked, but I was looking at a dozen processes pegged at 100% CPU.

matkam11y ago

I'm having trouble figuring out how to disable transparent hugepages without a reboot:

  ># uname -a
  Linux ip-10-214-7-159 3.2.0-36-virtual #57-Ubuntu SMP Tue Jan 8 22:04:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

  ># grep TRANSPARENT /boot/config-3.2.0-36-virtual
  CONFIG_TRANSPARENT_HUGEPAGE=y
  # CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS is not set
  CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y

  ># echo never > /sys/kernel/mm/transparent_hugepage/enabled
  bash: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory

j / k navigate · click thread line to collapse