undefined | Better HN

0 pointsTranquilMarmot3y ago0 comments

We used to have a horribly written node process that was running in a Mesos cluster (using Marathon). It had a memory leak and would start to fill up memory after about a week of running, depending on what customers were doing and if they were hitting it enough.

The solution, rather than investing time in fixing the memory leak, was to add a cron job that would kill/reset the process every three days. This was easier and more foolproof than adding any sort of intelligent monitoring around it. I think an engineer added the cron job in the middle of the night after getting paged, and it stuck around forever... at least for the 6 years I was there, and it was still running when I left.

We couldn't fix the leak because the team that made it had been let go and we were understaffed, so nobody had the time to go and learn how it worked to fix it. It wasn't a critical enough piece of infrastructure to rewrite, but it was needed for a few features that we had.

0 comments

bombolo3y ago

USA at some point had an anti-missile system that needed periodical reboots because it was originally designed for short deployments, so the floating point variable for the clock would start to lose precision after a while.

mlindner3y ago

Fixed point, not floating point. https://en.wikipedia.org/wiki/Fixed-point_arithmetic

gcr3y ago

Floating point clocks do lose precision after long enough time though; see https://randomascii.wordpress.com/2012/02/13/dont-store-that...

Storing floating point coordinates for example is what causes the "farlands" world generation behavior in Minecraft, for example.

saagarjha3y ago

Which, of course, led to people dying when the drift was too great.

perihelions3y ago

Context for those unfamiliar:

https://www-users.cse.umn.edu/~arnold/disasters/patriot.html

https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dha...

https://hn.algolia.com/?query=patriot%20missile (HN threads)

1 more reply

aftbit3y ago

I once managed a cluster of worker servers with an @reboot cronjob that scheduled another reboot after $(random /8..24/) hours. They took jobs from a rabbitmq queue and launched docker containers to run them, but had some kind of odd resource leak that would lead to the machines becoming unresponsive after a few days. The whole thing was cursed honestly but that random reboot script got us through for a few more years until it could be replaced with a more modern design.

throwawaaarrgh3y ago

This is a feature in many HTTPDs, WSGI/FastCGI apps, possibly even in K8s. After X requests/time, restart worker process. Old tricks are the best tricks ;)

AndyMcConachie3y ago

"Have you tried turning it off and on again?"

aeyes3y ago

You don't even need that, the kernel OOM killer would take care of this eventually. Unless its something like Java where the garbage collector would begin to burn CPU.

j3th9n3y ago

The OOM killer doesn't restart (randomly, unless configured) killed processes, it just kills.

ce43y ago

Unless the OOM-killer kills the wrong process. Ages ago we had a userspace filesystem (gpfs) that was of course one of the oldest processes around and it consumed lots of RAM. When the oom killer started looking for a target, of course one of the mmfsd processes was selected and it resulted in instantaneous machine lockup (any access to that filesystem would be blocked forever in the system call which depended on the userspace daemon to return, alas never returning). Was funny to debug

1 more reply

pkolaczk3y ago

If it's deployed in K8s, it would be restarted automatically after dying.

1 more reply

j / k navigate · click thread line to collapse

0 comments

bombolo3y ago

mlindner3y ago

Fixed point, not floating point. https://en.wikipedia.org/wiki/Fixed-point_arithmetic

gcr3y ago

Floating point clocks do lose precision after long enough time though; see https://randomascii.wordpress.com/2012/02/13/dont-store-that...

Storing floating point coordinates for example is what causes the "farlands" world generation behavior in Minecraft, for example.

saagarjha3y ago

Which, of course, led to people dying when the drift was too great.

perihelions3y ago

Context for those unfamiliar:

https://www-users.cse.umn.edu/~arnold/disasters/patriot.html

https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dha...

https://hn.algolia.com/?query=patriot%20missile (HN threads)

1 more reply

aftbit3y ago

throwawaaarrgh3y ago

This is a feature in many HTTPDs, WSGI/FastCGI apps, possibly even in K8s. After X requests/time, restart worker process. Old tricks are the best tricks ;)

AndyMcConachie3y ago

"Have you tried turning it off and on again?"

aeyes3y ago

You don't even need that, the kernel OOM killer would take care of this eventually. Unless its something like Java where the garbage collector would begin to burn CPU.

j3th9n3y ago

The OOM killer doesn't restart (randomly, unless configured) killed processes, it just kills.

ce43y ago

1 more reply

pkolaczk3y ago

If it's deployed in K8s, it would be restarted automatically after dying.

1 more reply

j / k navigate · click thread line to collapse