undefined | Better HN

0 pointsdrdrey3y ago0 comments

Another thing we noticed at Netflix was that after services didn’t get pushed for a while (weeks), performance started degrading because of things like undiscovered memory leaks, threads leaks, disks filling up. You wouldn’t notice during normal operations because of regular autoscaling and code pushes, but code freezes tended to reveal these issues.

0 comments

TranquilMarmot3y ago

We used to have a horribly written node process that was running in a Mesos cluster (using Marathon). It had a memory leak and would start to fill up memory after about a week of running, depending on what customers were doing and if they were hitting it enough.

The solution, rather than investing time in fixing the memory leak, was to add a cron job that would kill/reset the process every three days. This was easier and more foolproof than adding any sort of intelligent monitoring around it. I think an engineer added the cron job in the middle of the night after getting paged, and it stuck around forever... at least for the 6 years I was there, and it was still running when I left.

We couldn't fix the leak because the team that made it had been let go and we were understaffed, so nobody had the time to go and learn how it worked to fix it. It wasn't a critical enough piece of infrastructure to rewrite, but it was needed for a few features that we had.

bombolo3y ago

USA at some point had an anti-missile system that needed periodical reboots because it was originally designed for short deployments, so the floating point variable for the clock would start to lose precision after a while.

mlindner3y ago

Fixed point, not floating point. https://en.wikipedia.org/wiki/Fixed-point_arithmetic

1 more reply

saagarjha3y ago

Which, of course, led to people dying when the drift was too great.

1 more reply

aftbit3y ago

I once managed a cluster of worker servers with an @reboot cronjob that scheduled another reboot after $(random /8..24/) hours. They took jobs from a rabbitmq queue and launched docker containers to run them, but had some kind of odd resource leak that would lead to the machines becoming unresponsive after a few days. The whole thing was cursed honestly but that random reboot script got us through for a few more years until it could be replaced with a more modern design.

throwawaaarrgh3y ago

This is a feature in many HTTPDs, WSGI/FastCGI apps, possibly even in K8s. After X requests/time, restart worker process. Old tricks are the best tricks ;)

AndyMcConachie3y ago

"Have you tried turning it off and on again?"

aeyes3y ago

You don't even need that, the kernel OOM killer would take care of this eventually. Unless its something like Java where the garbage collector would begin to burn CPU.

j3th9n3y ago

The OOM killer doesn't restart (randomly, unless configured) killed processes, it just kills.

2 more replies

lcw3y ago

Agreed, one of the craziest bugs I had to deal with was we had a distributed system using lots of infrastructure. Said distributed system started having trouble communicating with random nodes and sub-systems. I spent 3 hard days finding a Linux kernel bug where the ARP cache was not removing least recently accessed network addresses. Normally, this wouldn't be a big deal for a typical network because few networks would fill up the default arp cache size. That was even true for ours except that we would slowly add and remove infrastructure over the course of a couple months until eventually the ARP cache would fill and remove the random network devices... It wasn't even our distributed application code... Some bugs take time to manifest themselves in very creative ways.

gfv3y ago

Yeah, network scaling bugs are the most fun. The one I liked the most was when after expanding a pool of servers, they started to lose connectivity for a few minutes and then come back a minute or so later as if nothing happened.

Turns out we accidentally stretched one server VLAN too wide, to roughly 600 devices within one VLAN within one switch. The servers had more-or-less all-to-all traffic, and that was enough to generate so many ARP requests and replies that the switch's supervisor policer started dropping them at random, and after ten failed retries for one server the switch just gave up and dropped it from the ARP table.

Of course the control plane policer is global for the switch, so every device connected to the switch was susceptible, not just the ones in the overextended VLAN.

dekhn3y ago

vlans are convenience that is the enemy of performance and undertandability.

1 more reply

Varriount3y ago

Goodness, what kind of process/tools did you use to track that problem down?

katekarin3y ago

My team had a similar issue with the ARP cache on AWS when we used Amazon Linux as an OS for cluster nodes, and Debian for the database host. When new tasks were starting some had random timeouts when connecting to the database.

It turned out that the Debian database host had bad ARP entries (an IP address was pointing to a non-existing MAC Address) caused by frequent reuse of the same IP addresses.

Debian has a default ARP cache size that's larger than Amazon Linux (I think it's entirely disabled on AL?).

As for the tooling we used to track it down, it was tcpdump. We saw SYN's getting sent, but not ACK's back. Few more tcpdump flags (-e shows the hardware addresses) and we discovered mismatched MAC addresses.

lcw3y ago

We didn't have much tooling outside of typical things you would find in a Linux distro. It started with trying to isolate a node having issues. Then looking at at application and kernel logs. Then testing the connection to the other node via ping or telnetting a port I knew should be open. Found out I couldn't route then just process of elimination from that point till we managed our way to looking at a full ARP cache. Tested that we could increase the ARP cache size to fix the issue. Then figured out by going through the kernel why it wasn't releasing correctly by looking at the source code for the release we were using. I'm simplifying some discovery, but there was no magic unfortunately.

polio3y ago

If resource leaks became a serious issue I imagine they could buy time by restarting. I'm curious what the causes were for code freezes. At Meta they would freeze around Thanksgiving and NYE because of unusually high traffic.

drdreyOP3y ago

Same, code freezes were typically around holidays (when you know traffic will be elevated, engineers will be less available and you want increased stability)

__bjoernd3y ago

I once debugged a kernel memory leak in an internal module that manifested after around 6 years of (physical) server uptime. There are surprises lurking very far down the road.

r3trohack3r3y ago

We joked about adding this to the NodeQuark platform:

    // Fix Slow Memory Leaks
    setTimeout(() => process.exit(1), 1000 * 60 * 60 * 24)

drdreyOP3y ago

might want to add some random jitter in there :) Imagine your entire NodeQuark cluster decided to restart at the same time

bandrami3y ago

Back in the Pleistocene I worked in a ColdFusion shop (USG was all CF back then and we were contractors) and we had two guys whose job was to bounce stacks when performance fell under some defined level.

j / k navigate · click thread line to collapse

0 comments

TranquilMarmot3y ago

bombolo3y ago

mlindner3y ago

Fixed point, not floating point. https://en.wikipedia.org/wiki/Fixed-point_arithmetic

1 more reply

saagarjha3y ago

Which, of course, led to people dying when the drift was too great.

1 more reply

aftbit3y ago

throwawaaarrgh3y ago

This is a feature in many HTTPDs, WSGI/FastCGI apps, possibly even in K8s. After X requests/time, restart worker process. Old tricks are the best tricks ;)

AndyMcConachie3y ago

"Have you tried turning it off and on again?"

aeyes3y ago

You don't even need that, the kernel OOM killer would take care of this eventually. Unless its something like Java where the garbage collector would begin to burn CPU.

j3th9n3y ago

The OOM killer doesn't restart (randomly, unless configured) killed processes, it just kills.

2 more replies

lcw3y ago

gfv3y ago

Of course the control plane policer is global for the switch, so every device connected to the switch was susceptible, not just the ones in the overextended VLAN.

dekhn3y ago

vlans are convenience that is the enemy of performance and undertandability.

1 more reply

Varriount3y ago

Goodness, what kind of process/tools did you use to track that problem down?

katekarin3y ago

It turned out that the Debian database host had bad ARP entries (an IP address was pointing to a non-existing MAC Address) caused by frequent reuse of the same IP addresses.

Debian has a default ARP cache size that's larger than Amazon Linux (I think it's entirely disabled on AL?).

lcw3y ago

polio3y ago

drdreyOP3y ago

Same, code freezes were typically around holidays (when you know traffic will be elevated, engineers will be less available and you want increased stability)

__bjoernd3y ago

I once debugged a kernel memory leak in an internal module that manifested after around 6 years of (physical) server uptime. There are surprises lurking very far down the road.

r3trohack3r3y ago

We joked about adding this to the NodeQuark platform:

    // Fix Slow Memory Leaks
    setTimeout(() => process.exit(1), 1000 * 60 * 60 * 24)

drdreyOP3y ago

might want to add some random jitter in there :) Imagine your entire NodeQuark cluster decided to restart at the same time

bandrami3y ago

j / k navigate · click thread line to collapse