undefined | Better HN

0 pointsfloatingatoll5y ago0 comments

If your oldest request was queued 5+ seconds ago in a near-realtime system (such as Slack), CPU usage isn't your biggest problem.

Slack wrote an autoscaling implementation that ignored request queue depth and downsized their cluster based on CPU usage alone, so while they knew how to resolve it, I would not go so far as to say they knew how to prevent it. The mistake of ignoring the maxage of the request queue is perhaps the second most common blind spot in every Ops team I've ever worked with. No insult to my fellow Ops folks, but we've got to stop overlooking this.

0 comments

3 comments · 1 top-level

nicoburns5y ago· 2 in thread

> The mistake of ignoring the maxage of the request queue is perhaps the second most common blind spot in every Ops team I've ever worked with. No insult to my fellow Ops folks, but we've got to stop overlooking this.

What's the first?

floatingatollOP5y ago

Non-randomized wallclock integers.

For example: “sleep 60 seconds”, “cron 0 * * * * command”, “X-Retry-After: 300”

Found in: recurring jobs, backoff algorithms, oauth tokens.

Found in: ops-created tasks, dev-released software.

encoderer5y ago

I'm building something at Cronitor to help detect those hot-spots! If you want to learn more, email me: shane at cronitor.io

1 more reply

j / k navigate · click thread line to collapse