you also don't want your automation guessing at what the problem is, or what the effects are. you want real info from a real person even if it isn't given to you the millisecond you look for it.
this is why status pages aren't updated by automation. if they're updated by a person, you know that people know about the problem, you know that people are working on the problem, and so on, which is good, but while they figure out what's going on, you see a "green" status page.
this is normal.
(this is for future readers, more than the person I am replying to.)
Approached in that way a status page is almost useless, since it is not reliable and only after I found out via other sources it is updated.
I am perfectly happy with a status page that shows the, mm, status of the service. Could be as easy as not reachable, slower than usual or any generic information (a traffic light). I disagree that a status page has to show the why of the error, although of course it would be nice.
Actually looks like the metrics part of Reddit’s status page broke over 2 weeks ago
With proper reporting it's trivial to know which subsystem is experiencing problems, if any. It doesn't have to be very granular, just "normal", "experiencing issues", "offline". If reporting doesn't work, you should be alerted it doesn't work, and if alerting doesn't work, there needs to either be out-of-band alerting for that or someone monitoring the status at all times.
Manual overrides for status pages should exist for when the automation doesn't work of course.
At my last job we had a big screen in the office we monitored (Grafana) and we usually saw problems before the alerting kicked in - it had about a minute delay. When not in-office/during work hours, the on-call received alerts. It wasn't technically nor organisationally complex.
"The whole point" (as you put it) of status pages was to publish high-level monitoring data to users. The monitoring process should occur outside the system that is being monitored, perhaps even on a different cloud.
Eventually, many companies realized this revealed expensive SLA violations and ended that level of transparency.
Your status page can and should report import metrics to users, like elevated error rates. Most status pages used to.
As a user, you often don't know if the vendor's system is really down or if there's something wrong with your own system.
At least that's what AWS Health[1] looks like to me.
Seems like a huge spike in load.
Spikes in request latency can be because of bunch of stuff, including more traffic, but in my experience, it's usually around non-existing optimizations for some data structure that got triggered after N items or new deploys containing code that wasn't as optimal as the author of the code thought. Especially when dealing with distributed systems, where sub-optimal code in one part can cascade performance issues to various parts in the system.
How would I know? What if my website doesn't have any monitoring and I use a payment system, shouldn't I automatically be notified when that payment system is down? What if it's down for a week? I think service-providing companies should always announce outages and even suspected outages.
Because of this reason I believe they would not be pointless if they were simply status pages, instead of "incident response pages". My hypothesis for them being this way instead is it is too much transparency for some companies for PR and legal reasons.
Those GitHub badges... they are as ugly as it gets.
But soon after, legal/executive team got ownership of them apparently, and the status pages are no longer automatically showing downtime/response time and notice about when things are actually down can take a while.
So I think it's nice that there is at least one place where I can see if it's a problem on my end, or if it's global. It helps to remove some frustration at least.
However I have a feeling that most companies are set up to download 50MiB of dependencies at every run, so a website being down makes the entire thing not work.
Now 30 mins later, i've refreshed the issue and see that my reply and the comment I was replying too (by another user) are both gone. Hopefully, it's eventually consistent and these comments will re-appear later.
{ "code": 500, "message": "internal server error" }
Does anyone have luck? Any workaround to fix it?
EDIT: Seems to be a routing issue. I've enabled a UK VPN and it's working fine now.
For engaged, happy engineers its the equivalent of getting a surprise snow day when you are grown up and have to go dig your car out of the snow and its a normal day just with extra steps.
Not if you self-host Git
Self-hosting everything else GitHub does is harder. Which is why they are building out all of those things, they don't want people to move to other places so easily.
Hopefully these constant outages makes more developers pissed off that issues are not stored in git as well, and start working on tooling to solve this shitty problem once and for all.
P2P/Local First software for everyone! \o/
You can self-host the whole of GitHub can’t you?
edit: oopsie I misread.
Not a huge problem, unless it lasts for hours or gasp, days.