Maybe the Github Actions infrastructure isn't run like that.
edit: my oncall rotation notified on all 500s, 24/7, not just rates - https://news.ycombinator.com/item?id=48279262
I know all of Gmail, every GCE service I can think of, every AWS service I can think of, Amazon.com, Netflix, and Github all do not page on just a single 500.
I know none of those are particularly "high performance" though. Curious where your experience is coming from.
I had a fairly long tenure, where I maintained multiple key services in critical online payments flow. Authentication, authorization, core business and risk data, as well as some cross-cutting control plane stuff, etc. You needed one or more of our services to take a payment, serve any request from the employee dashboard - pretty much everything hit our services. The entire company ground to a halt without my team.
We paged for every single 500. In instances where a particular class of 500 was spurious or not worth fixing, we would leave it acked or mark it as noise. But typically we'd just put in a fix as soon as possible so we didn't page.
Our graceful shutdown and traffic shaping stack was great, but occasionally we'd get a few pages during deploys or failovers.
Oncall was typically not bad, but when it did get bad it was terrible. I've been involved in huge outages that cost hundreds of millions of dollars. Usually it was the fault of multiple teams having compounding runaway failures rather than one service or bug in particular.
It's inexcusable to have a customer's payments not go through. We engineered around resilience. We had strict five nines SLAs and p99 targets and evaluated our adherence with even the smallest partial outage. Hundreds of other services depended on ours, and downstream impacts were huge, so we had to keep a tight ship.
We didn't have "business hours"-only paging either as our platform was available globally, including a heavy install base in Asia.
Assuming the existence of some kind of network (with zero guarantee of 100% reliability), how does this work in practice? Is each 500 treated as an event that needs investigation, even if the result of that would end up as 'a router dropped something from an internal buffer but the transaction as a whole was re-tried by a parent so the service itself recovered'?
Recently there was this: https://news.ycombinator.com/item?id=47252971 "10% of Firefox crashes are caused by bitflips"
Which makes me think a small amount of random issues which happen even though nothing is broken, is normal everywhere. Especially once move things around on a network, there's potential for a lot more random errors.
This is why data hoarders who have NASes with lots of space insist on running their servers with ECC RAM despite it being significantly more expensive. Because bit flips, for all intents and purposes, cannot happen. The RAM itself detects and corrects for them.
I wouldn't expect bit flips to be a significant contributor to enterprise problems.
It does require constant tuning and adjustment though.
If my DB health check endpoint is returning 500s for N consecutive checks over M minutes, yeah, please wake me up at 3am!
If one user hit a weird edge case in form validation and got a one-off 500, please don't! We can fix that on Monday.
Not always easy to distinguish those clearly or configure those business hours rules, but for my team at https://heyoncall.com/ that is the goal -- otherwise your team burns out fast. Waking up someone at 3am has a real cost, so you better be sure it's worth it.
As others have said, follow-the-sun type models do exist, usually staffed by people in their normal working hours (EMEA, Americas, APAC) but this means you've still got to cover the weekend and public holidays (which there are a lot of when you factor in plenty of different countries).
Where you need a quick response you can have a core ops/noc team that looks at things with lower thresholds and shorter windows, and their job is to do the initial triage and then page the appropriate team earlier than they would have been alerted by their own alert thresholds/monitoring.
Actually clicking the button to change the status on a public status page is a whole different topic that becomes very political in certain companies.
But if it is synthetic queries sent from the monitoring platform, then you control the user agent, payload, and endpoints. So any failed requests are a symptom of a misconfiguration and/or failure that should be investigated. Albeit not necessarily as a P1 priority.
I'm sure you're not in ops. Or in a dev org of a service with decent request rates.
What you're asking for is a service to fail silently. There's no way a service with a decent request rate to have 0 500s. Not when it still sees development.
A 50 year old bank API? Maybe...