The solution, rather than investing time in fixing the memory leak, was to add a cron job that would kill/reset the process every three days. This was easier and more foolproof than adding any sort of intelligent monitoring around it. I think an engineer added the cron job in the middle of the night after getting paged, and it stuck around forever... at least for the 6 years I was there, and it was still running when I left.
We couldn't fix the leak because the team that made it had been let go and we were understaffed, so nobody had the time to go and learn how it worked to fix it. It wasn't a critical enough piece of infrastructure to rewrite, but it was needed for a few features that we had.
Turns out we accidentally stretched one server VLAN too wide, to roughly 600 devices within one VLAN within one switch. The servers had more-or-less all-to-all traffic, and that was enough to generate so many ARP requests and replies that the switch's supervisor policer started dropping them at random, and after ten failed retries for one server the switch just gave up and dropped it from the ARP table.
Of course the control plane policer is global for the switch, so every device connected to the switch was susceptible, not just the ones in the overextended VLAN.
It turned out that the Debian database host had bad ARP entries (an IP address was pointing to a non-existing MAC Address) caused by frequent reuse of the same IP addresses.
Debian has a default ARP cache size that's larger than Amazon Linux (I think it's entirely disabled on AL?).
As for the tooling we used to track it down, it was tcpdump. We saw SYN's getting sent, but not ACK's back. Few more tcpdump flags (-e shows the hardware addresses) and we discovered mismatched MAC addresses.
// Fix Slow Memory Leaks
setTimeout(() => process.exit(1), 1000 * 60 * 60 * 24)