(Edited now that the status page has been updated).
I was under the impression that gitlab use gitlab.com for their work. Surely someone would have noticed within seconds that it was down?
Why have the misleading "updated a few seconds" ago text if it doesn't update on complete failure? :)
The delay in updating status is a result of our Incident Management process [0]. We have a Communications Manager on Call (CMOC) who leads communication throughout an incident. One of their responsibilities includes updating the status page. The slight delay between noticing the issue and updating the status page is a result of the time it takes for the CMOC to get alerted, assess the situation, and write the communication that is shared on the status page.
I'm not sure how the "updated a few seconds ago" messages are generated but I'll try to find out once the incident has been resolved.
0 - https://about.gitlab.com/handbook/engineering/infrastructure...
At first glance it looks like everything is operational with no issues.
Also, most alerting systems like check multiple times before declaring a public outage, many times 2 to 3 failures some seconds apart are needed.
1. External engineers will start to automate recovery/mitigation processes around your status page if it has real time status.
2. You now need to bug test your status page thoroughly because of #1. It basically becomes an actual API.
I guess, the status pages should now have a button to get data from public.. crowd sourced status page?
https://status.gitlab.com/ is updated. Edit: https://status.gitlab.com/pages/incident/5b36dc6502d06804c08...
Maybe some common severs ?
Just look at Gnome: [0]. They are doing it right.
Gitlab is a perfect example. They had database issues and had to restore from backups already.
But I fatfingered a lot of self hosted stuff in my time.
Also at gitlab.com scale the problems they face are very different from a typical deployment.
It is like having maintaining your car and using the train.
On average if you can fix your car (or hire a good mechanic i.e. consulting) you would probably have a better experience than public transport breaking down, that you are powerless to do anything about.
I would rather run a business depending on my car than the train ?
Yes, I can also fix it if the server was my mine but more than likely I'll be busy doing my actual job (which does not involve fiddling with self-hosted gitlab instances) so I'll take my chances with the Gitlab engineering team. They do fix things and me being busy, asleep, sick, or travelling have no impact on their response. I intend to keep it this way.
Spoken as someone who has never taken a train i suppose? Transit at scale can handle maintenance much better than a single vehicle and/or mechanic, and they do so proactively and on schedules. And when things get really bad ( catastrophic failure of some component you can't just "fix" on the spot), public transit will organise a backup ( a new train or a bunch of buses) to get you to your destination.
Need to do a launch? Build it and push it.
Need to share a change with someone so they can review?, `git diff` and send a patch via email. Want to use a server? Spin up a server, add users and keys and push up to it.
Gitlab, GitHub and these hosted solutions haven't always existed. They're convenient, but not a OMGWTF moment... unless of course you don't have backups.
Can you link the issue please? :)
For context, Prometheus and observability will be handled with Opstrace in the future [0]. I'd like to learn about your use-case and see which troubles you have been running into. Thanks!
grrr... I am stuck with my job now .... :(