For those who are interested, on the first Wednesday of each month, I write a blog post on our availability. Most recent one is here: https://github.blog/2021-03-03-github-availability-report-fe...
Thank you for not doing that.
Even so, it's always possible for an engineer to submit a schema change which is detrimental to performance. For example, dropping an important index, or changing it such that some necessary column is no longer present. Linters simply cannot catch some classes of these problems, as they're application/workload-specific. Usually they must be caught in code review, but people make mistakes and could approve a bad change.
Disclosure: I'm the author of Skeema, but have not worked for or with GitHub in any capacity.
[1] https://github.com/github/gh-ost
[2] https://github.blog/2020-02-14-automating-mysql-schema-migra...
I was futzing around with the description for a PR and hitting save wouldn't update it, yet clicking edit would show the text I expected to see.
Suspecting something was up I checked Github Status but it was green across the board. Assuming enough other people hit the same chain of events, could it provide a reliable enough indicator of an issue?
Sure, the previous decent sized company (~1000+ devs) had that exact metric available.
Visits to the status page generally that is. Now whether you could actually correlate that to an increase in errors for a particular component, no so much ;)
I'm sure it's totally feasible but it requires a certain amount of discipline to have consistency in logging/metric standards across all your applications to some extent.
Even worse, some applications would return a shared error page but internally, I believe it was logged as a 301 redirect until someone spotted it :)
> Now whether you could actually correlate that to an increase in errors for a particular component, no so much ;)
Yep, makes sense. I was picturing a broad "Something Bad Happened, Go Investigate" notification. But I imagine the sensitivity would have to be tuned, especially to account for massive traffic increases from places like HN.
> Even worse, some applications would return a shared error page but internally, I believe it was logged as a 301 redirect until someone spotted it :)
Yikes!
I wonder if reliability has become less of a priority. As somebody with little to no experience of running things at scale I’m finding myself attributing this to some form of “move fast and break things”.
Often git operations were unaffected though.
That was the case when they were the small and hungry startup.
Meanwhile they've been acquired by a giant corporation with a less than stellar reputation for reliability or quality. So it's most likely a case actually of "move slow and break things".
That is unfair Minesweeper never crashed, and the printspooler is not up for debate here ;)
What do other folks use to avoid this situation? Have a Gitlab instance or similar that you can pull from instead for CI?
1. Use as little of the configuration language provided by the CI as possible (prefer shellscripts that you call in CI instead of having each step in a YAML config for example)
2. Make sure static content is in a Git repository (same or different) that is also available on multiple SCM systems (I usually use GitHub + GitLab mirroring + a private VPS that also mirrors GitHub)
3. Have a bastion host for doing updates, make CI push changes via bastion host and have at least four devs (if you're at that scale, otherwise you just) with access to it, requiring multisig of 2 of them to access
Now when the service goes down, you just need 2 developers to sign the login for the bastion host, then manually run the shellscript locally to push your update. You'll always be able to update now :)
Multiple remotes can help and is certainly something you should have as a backup. However I don't think it solves the root cause which is how the CI is configured.
I'm a firm proponent of keeping your CI as dumb as possible. That's not to say unsophisticated, I mean it should be decoupled as much as possible from the the how of the actions it's taking.
If you have a CI pipeline that consists of Clone, Build, Test, and Deploy stages, then I think your actual CI configuration should look as close as possible to the following pseudocode:
stages:
- clone: git clone $REPO_URL
- build: sh ./scripts/build.sh
- test: sh ./scripts/test.sh
- deploy: sh ./scripts/deploy.sh
Each of these scripts should be something you can run on anything from your local machine to a hardened bastion, at least given the right credentials/access for the deploy step. They don't have to be shell scripts, they could be npm scripts or makefiles or whatever, as long as all the CI is doing is calling one with very simple or no arguments.This doesn't rule out using CI specific features, such as an approval stage. Just don't mix CI level operations with project level operations.
As a side benefit this helps avoid a bunch of commits that look like "Actually really for real this time fix deployment for srs" by letting you run these stages manually during development instead of pushing something you think works.
More importantly though, it makes it substantially easier to migrate between CI providers, recover from a CI/VCS crash, or onboard someone who's responsible for CI but maybe hasn't used your specific tool.
Or take your local copy and use git-fu commands to create a bare repo of it that you can compress and put somewhere like S3. Then download it in CI and checkout from that.
Or just tarball your app source, who cares about git, and do the same (s3, give it a direct path to the asset)
All of this is potentialy useless info though. Hard to say without understanding how your CI works. If all you need is the source code, there are a half dozen ways to get that source into CI without git.
In 2021, basic business continuity plans for software companies should incorporate these sorts of concerns. You should have a published procedure somewhere that a person could follow for producing the final build artifacts of your software on any machine once backups are made available. Situations like these are why I check in 100% of my dependencies to source control as well.
The anti-pattern to watch out for is long, complex scripts that live in your CI system’s config file. These are hard to test and replicate when you need to.
Too many times I suggested everyone to begin self-hosting or have that as a backup but once again some think 'going all in on GitHub' is worth it. (It really is not the case)
Don't read too much into it and comment freely as normal. In the end, it's just internet points.
Gitlab, mirrored repo basically.
Don't use Microsoft?
Where is the acknowledgment of a problem, root-cause analysis, and followup for new practices and engineering to prevent issues? Who is responsible for these issues and what are they doing to make it right? What positions are you hiring for _right now_ to get to work making your service reliable?
But I don't know how many times [0] I have to say this but, just get a self-hosted backup rather than 'going all in on GitHub' or 'Centralising everything'.
With that out of the way... GH has had a lot of issues in recent months. More than the past. I would hope those things are on a road to being fixed.
Is there a pass-through proxy for git? Or a leader-follower arrangement that is nice, with a proxy server?
You can set up a cronjob to sync them, or some have built-in config to do the mirroring [4].
I used Google's mirroring option before. It was fine, but we never had to use it (local copies were sufficient when GH was slow one day).
[1] https://cloud.google.com/source-repositories
[2] https://aws.amazon.com/codecommit/
[3] https://azure.microsoft.com/en-us/services/devops/repos/
[4] https://cloud.google.com/source-repositories/docs/mirroring-...
Believe it or not, we have higher service availability hosting GitLab ourselves than GitHub
They could use some dogfooding, and new website.
My heart can't handle another rollercoaster of unicorns for long...