GitHub availability this week (opens in new tab)

(github.com)

203 pointstanoku13y ago59 comments

59 comments

46 comments · 15 top-level

cwb7113y ago· 8 in thread

The part of this post that really blew my mind:

  We host our status site on Heroku to ensure its availability
  during an outage. However, during our downtime on Tuesday
  our status site experienced some availability issues.

  As traffic to the status site began to ramp up, we increased
  the number of dynos running from 8 to 64 and finally 90.
  This had a negative effect since we were running an old
  development database addon (shared database). The number of
  dynos maxed out the available connections to the database
  causing additional processes to crash.

Ninety dynos for a status page? What was going on there?

wfarr13y ago

At the time of the outage, the status site was seeing upwards of 30,000/req minute.

AS we scaled up dynos, we would see temporary performance improvements until the status site would stop responding again. In the short term, this led to us massively increasing dynos as quickly as we could as it appeared that CPU burn was a significant cause of the slowness (at the time). This was in part caused by all the dynos repeatedly crashing. That's how we ended up going from 8 previously to 90.

Once the database problem for the status site was identified and resolved, we began scaling down dynos to a smaller number.

ashray13y ago

What prevented you from just caching the status page and then refilling the cache manually every X seconds ? I'm sure a status that is a few seconds old given the system wide meltdown wouldn't have been an unreasonable compromise ?

1 more reply

adgar13y ago

30,000req/minute is 500qps. That's... just not a lot for a large service.

mbell13y ago

Anyone tested S3's static page hosting under heavy load? I would think you could just update the static file as a result of some events fired by your internal monitoring process.

dustym13y ago

We use S3 behind 1 second max-age cloudfront to serve The Verge liveblog. It's been nothing but rock solid. We essentially create a static site and push up JSON blobs. See here:

http://product.voxmedia.com/post/25113965826/introducing-syl...

2 more replies

WestCoastJustin13y ago

S3 is great for static content. I was taking the AWS ops course and the instructor mentioned some very large organizations redirect their site to S3 when under DDOS so they can remain on-line. In fact, he said that AWS recommended this solution to them?! Can you fathom someone who is under DDOS, and you tell them, hey, just redirect that our way ;)

1 more reply

moe13y ago

"Heavy load"?

30 kRPM is 500 hits/sec. Nginx will serve >2000/sec from a m1.small. For S3 that is about the equivalent of a mosquito fart.

biot13y ago

Use Jekyll and push the site to S3:

https://github.com/mojombo/jekyll/wiki

https://github.com/laurilehmijoki/jekyll-s3#readme

cschep13y ago· 5 in thread

Interesting to read about github using MySQL instead of Postgres. Anyone know why? I am just curious because of all the MySQL bashing I hear in the echo chamber.

technoweenie13y ago

Mostly because of legacy reasons, at this point.

boundlessdreamz13y ago

That sounds like you would have chosen differently if you had to choose now. Is that so?

lonnyk13y ago

Do you have a source for this information?

2 more replies

autotravis13y ago

They use both, according to Zach Holman (http://zachholman.com/talk/unsucking-your-teams-development-...)

technoweenie13y ago

The only postgres we use is from internal Heroku apps. We use Mongo in a few places too.

aaronblohowiak13y ago· 4 in thread

If Github hasn't gotten their custom HA solution right, will you?

Digging into their fix, they disabled automatic failover -- so all DB failures will now require manual intervention. While addressing this particular (erroneous) failover condition, it does raise minimum down time for true failures. Also, their mysql replicant's misconfiguration upon switching masters is also tied to their (stopgap) approach to preventing the hot failover. So, the second problem was due to a mis-use/misunderstanding of maintenance-mode.

How is it possible that the slave could be pointed at the wrong master and have nobody notice for a day? What is the checklist to confirm that failover has occurred correctly?

There is also lesson to be learned in the fact that their status page had scaling issues due to db connection limits. Static files are the most dependable!

jaggederest13y ago

It blows my mind that they aren't simply using Jekyll to generate and update the status page. I mean... they wrote it, right?

rhizome13y ago

I think people tend to overestimate the value of nines to the user. It's chiefly a management/VC/busybody metric that has gained importance mainly due to it being a high level and easy to understand abstraction. "Well how much was it down?" Then they spend zillions on failover software, hardware and talent that could be supplanted by one fewer nine and a simpler architecture.

And really, just to get a dig in here, I believe Arrington shares a big part of the blame for this state of affairs with all of his Dvorak-caliber ignorant harping about Twitter back in the day.

autotravis13y ago

"There is also lesson to be learned in the fact that their status page had scaling issues due to db connection limits. Static files are the most dependable!"

Seriously, why would a status page need to query a db?

gsibble13y ago

I assume that the status server is not actively checking every Github server/service whenever someone pings it. It probably polls the servers every X seconds. The best place to store that type of data is in a DB.

Where else would you put it?

2 more replies

cagenut13y ago· 3 in thread

I'd like to welcome the github ops/dbas to the club of people who've learned the hard way that automated database failover usually causes more downtime than it prevents.

Here's sortof the seminal post on the matter in the mysql community: http://www.xaprb.com/blog/2009/08/30/failure-scenarios-and-s...

Though it turns into an MMM pile-on the tool doesn't matter so much as the scenarios. Automated failover is simply unlikely to make things better, and likely to make things worse, in most scenarios.

ghshephard13y ago

Automated database failover is absolutely mandatory for HA environments (as in, there is no way to run a 5 9s system without it) but, poorly done, results in actually reducing your uptime (which is a separate concept from HA).

I've been in a couple of environment in which developers have successfully rolled out automated database failover, and, my takeaway, is that's it usually not worth the cost - and with very, very few exceptions, most organizations can take the downtime of several minutes to do manual failover.

In general, when rolling out these operational environment, they are only ready when you've found, and demonstrated 10-12 failure cases, and come up with workarounds.

In other words - if you can't demonstrate how your environment will fail, then it's not ready for an HA deployment.

Xorlev13y ago

Every HA deployment I've done, the HA manager inevitably had issues to begin with. It takes time, patience, and a few late nights.

1 more reply

aaronblohowiak13y ago

automated failover in the case of too much load is usually not what you want to do. automated failure in the case of hw/network failure is usually what you want to do. differentiating the former from the latter is left as an exercise for the reader.

jluxenberg13y ago· 3 in thread

"16 of these repositories were private, and for seven minutes from 8:19 AM to 8:26 AM PDT on Tuesday, Sept 11th, were accessible to people outside of the repository's list of collaborators or team members"

ouch!

_f75i13y ago

One of those repos was mine. :( Fortunately it was a fresh Rails app without anything important. However, it does make me rethink the security of storing my code on github.

mckoss13y ago

I store proprietary code on github, but I would never recommend storing actual secrets (like keys or passwords).

code013y ago

I am really curious about the technical reasons how this might have happened.

andrewljohnson13y ago· 2 in thread

The lack of any negative response on this thread is a testament both to the thoroughness of the post-mortem, and the outstanding quality of GitHub in general.

In GitHub we trust. I can't imagine putting my code anywhere else right now.

gbog13y ago

I like github too but please remember that things come and go. Some time ago it was SourceForge that was hot.

mckoss13y ago

... but never as well loved.

druiid13y ago· 2 in thread

Well, I have to say... replication related issues like this are why I/we are now using a Galera backed DB cluster. No need to worry about which server is active/passive. You can technically have them all live all the time. In our case we have two live and one failover that only gets accessed by backup scripts and some maintenance tasks.

Once we got the kinks worked out it has been performing amazingly! Wonder if GitHub looked into this kind of a setup before selecting the cluster they did.

aaronblohowiak13y ago

any details on the kinks you worked out?

druiid13y ago

Sure. Maybe I should do a writeup for it on my blog at some point in the near future :).

The two main issues we encountered both had to do with search for products/categories on our sites. The first was that Galera/WSREP doesn't support MyISAM replication (It has beta support, but I wouldn't trust it). This meant that we had to transition our fulltext data to something else. The something else in this case was Solr which has been a much better solution anyway (fulltext based search was legacy anyway so this I can kind of count as a win).

The second issue and the one that was causing random OOM crashes was partly due to a bug, partly due to the way the developer responsible for the search changes implemented things. The bug part is that galera doesn't specifically differentiate between a normal table and a temp table. When you have very very small/fast temporary tables that are created and truncated before the creation of the table is replicated across the cluster it can leave some of these tables open in memory (memory leak whoo!). We were able to fix for this and have been happy ever since.

If there's any interest I can do a larger writeup about actual implementation of the cluster, caveats and the like.

1 more reply

gbog13y ago· 2 in thread

Genuine question: github is built upon git, which is a rock solid system for storing dataand in these reports we read that github relies a lot on MySQL, so... Did the github guys ponder using git as their data store? Just an example, in git one can add comments on commits, would it be possible to use it for the github comment function? Or maybe it is?

holman13y ago

Generally, Git will be way too slow for that. Git is typically our bottleneck, since you're dealing with so much overhead and disk access to perform functions.

Databases are best for, well, performing relational queries. In the case of commenting on a commit, if you store them only in the repository it becomes non-trivial to ask "show me all of the comments by this user" unless you have an intermediary cache layer (in which case you're back where you started).

gbog13y ago

Thanks for answering. Tell me if I'm wrong but MySQL would be behind a caching layer anyway, so the choice would be between cached git or cached git + mysql.

In git, logging commits on a file from an author is also a kind of join, and it is surprisingly fast, so using git as a data store is a weird idea that I cannot take out of my head.

WestCoastJustin13y ago· 1 in thread

Here are the makings of a bad week (Monday of all things)

- MySQL schema migration causes high load, automated HA solution causes cascading database failure

- MySQL cluster becomes out of sync

- HA solution segfaults

- Redis and MySQL become out of sync

- Incorrect users have access to private repositories!

Cleanup and recovery takes time, all I can say is, I'm glad it was not me who had that mess to clean up. I'm sure they are still working on it too!

This brings to mind some my bad days.. OOM killer decides your Sybase database is using too much memory. Hardware error on DRBD master causes silent data corruption (this took a lot of recovery time on TBs of data). I've been bitten by the MySQL master/slave become out of sync. That is a bad place to be in.. do you copy your master database to the slaves.. that takes a long time even of a fast network.

cageface13y ago

This kind of thing is one of the main reasons I prefer to do app development instead of backend work now. I don't get calls at 3am any more.

akoumjian13y ago· 1 in thread

I would love to know more about this two pass migration strategy.

jnewland13y ago

We use https://github.com/soundcloud/large-hadron-migrator/

pbiggar13y ago

I know that they have to be apologetic like this, but the simple fact is that GitHub's uptime is fantastic.

I run http://CircleCi.com, and so we have upwards of 10,000 interactions with GitHub per day, whether API calls, clones, pulls, webhooks, etc. A seriously seriously small number of them fail. They know what they're doing, and they do a great job.

jyap13y ago

"As traffic to the status site began to ramp up, we increased the number of dynos running from 8 to 64 and finally 90."

Wait, why isn't there some caching layer? eg. Generate a static page or use Varnish.

This part makes no sense at all.

At most you're then firing up another 5 dynos (or none) to handle the traffic. 90 is ridiculous.

1 more reply

dumbluck13y ago

This was the awesome kind of explanation about what went wrong and what was learned that I wish everyone would do.

donavanm13y ago

Update strategy of master first is interesting. I've always seen the other way with update standby, flip to standby, verify, update original master. Auto inc db keys once again cause horribleness. Nothing new there I suppose. And as mentioned the multi dyno + DB read status page is craaaazy. Why oh why isnt this a couple static objects. Automagically generate and push if you want. Give 'em a 60 second TTL and call it a day. Put them behind a different CDN & DNS then the rest of your site for bonus points.

lokotecla113y ago

para que sirve esta pagina soy nuevo

j / k navigate · click thread line to collapse

59 comments

46 comments · 15 top-level

cwb7113y ago· 8 in thread

The part of this post that really blew my mind:

  We host our status site on Heroku to ensure its availability
  during an outage. However, during our downtime on Tuesday
  our status site experienced some availability issues.

  As traffic to the status site began to ramp up, we increased
  the number of dynos running from 8 to 64 and finally 90.
  This had a negative effect since we were running an old
  development database addon (shared database). The number of
  dynos maxed out the available connections to the database
  causing additional processes to crash.

Ninety dynos for a status page? What was going on there?

wfarr13y ago

At the time of the outage, the status site was seeing upwards of 30,000/req minute.

Once the database problem for the status site was identified and resolved, we began scaling down dynos to a smaller number.

ashray13y ago

1 more reply

adgar13y ago

30,000req/minute is 500qps. That's... just not a lot for a large service.

mbell13y ago

Anyone tested S3's static page hosting under heavy load? I would think you could just update the static file as a result of some events fired by your internal monitoring process.

dustym13y ago

We use S3 behind 1 second max-age cloudfront to serve The Verge liveblog. It's been nothing but rock solid. We essentially create a static site and push up JSON blobs. See here:

http://product.voxmedia.com/post/25113965826/introducing-syl...

2 more replies

WestCoastJustin13y ago

1 more reply

moe13y ago

"Heavy load"?

30 kRPM is 500 hits/sec. Nginx will serve >2000/sec from a m1.small. For S3 that is about the equivalent of a mosquito fart.

biot13y ago

Use Jekyll and push the site to S3:

https://github.com/mojombo/jekyll/wiki

https://github.com/laurilehmijoki/jekyll-s3#readme

cschep13y ago· 5 in thread

Interesting to read about github using MySQL instead of Postgres. Anyone know why? I am just curious because of all the MySQL bashing I hear in the echo chamber.

technoweenie13y ago

Mostly because of legacy reasons, at this point.

boundlessdreamz13y ago

That sounds like you would have chosen differently if you had to choose now. Is that so?

lonnyk13y ago

Do you have a source for this information?

2 more replies

autotravis13y ago

They use both, according to Zach Holman (http://zachholman.com/talk/unsucking-your-teams-development-...)

technoweenie13y ago

The only postgres we use is from internal Heroku apps. We use Mongo in a few places too.

aaronblohowiak13y ago· 4 in thread

If Github hasn't gotten their custom HA solution right, will you?

How is it possible that the slave could be pointed at the wrong master and have nobody notice for a day? What is the checklist to confirm that failover has occurred correctly?

There is also lesson to be learned in the fact that their status page had scaling issues due to db connection limits. Static files are the most dependable!

jaggederest13y ago

It blows my mind that they aren't simply using Jekyll to generate and update the status page. I mean... they wrote it, right?

rhizome13y ago

And really, just to get a dig in here, I believe Arrington shares a big part of the blame for this state of affairs with all of his Dvorak-caliber ignorant harping about Twitter back in the day.

autotravis13y ago

"There is also lesson to be learned in the fact that their status page had scaling issues due to db connection limits. Static files are the most dependable!"

Seriously, why would a status page need to query a db?

gsibble13y ago

Where else would you put it?

2 more replies

cagenut13y ago· 3 in thread

I'd like to welcome the github ops/dbas to the club of people who've learned the hard way that automated database failover usually causes more downtime than it prevents.

Here's sortof the seminal post on the matter in the mysql community: http://www.xaprb.com/blog/2009/08/30/failure-scenarios-and-s...

Though it turns into an MMM pile-on the tool doesn't matter so much as the scenarios. Automated failover is simply unlikely to make things better, and likely to make things worse, in most scenarios.

ghshephard13y ago

In general, when rolling out these operational environment, they are only ready when you've found, and demonstrated 10-12 failure cases, and come up with workarounds.

In other words - if you can't demonstrate how your environment will fail, then it's not ready for an HA deployment.

Xorlev13y ago

Every HA deployment I've done, the HA manager inevitably had issues to begin with. It takes time, patience, and a few late nights.

1 more reply

aaronblohowiak13y ago

jluxenberg13y ago· 3 in thread

ouch!

_f75i13y ago

One of those repos was mine. :( Fortunately it was a fresh Rails app without anything important. However, it does make me rethink the security of storing my code on github.

mckoss13y ago

I store proprietary code on github, but I would never recommend storing actual secrets (like keys or passwords).

code013y ago

I am really curious about the technical reasons how this might have happened.

andrewljohnson13y ago· 2 in thread

The lack of any negative response on this thread is a testament both to the thoroughness of the post-mortem, and the outstanding quality of GitHub in general.

In GitHub we trust. I can't imagine putting my code anywhere else right now.

gbog13y ago

I like github too but please remember that things come and go. Some time ago it was SourceForge that was hot.

mckoss13y ago

... but never as well loved.

druiid13y ago· 2 in thread

Once we got the kinks worked out it has been performing amazingly! Wonder if GitHub looked into this kind of a setup before selecting the cluster they did.

aaronblohowiak13y ago

any details on the kinks you worked out?

druiid13y ago

Sure. Maybe I should do a writeup for it on my blog at some point in the near future :).

If there's any interest I can do a larger writeup about actual implementation of the cluster, caveats and the like.

1 more reply

gbog13y ago· 2 in thread

holman13y ago

Generally, Git will be way too slow for that. Git is typically our bottleneck, since you're dealing with so much overhead and disk access to perform functions.

gbog13y ago

Thanks for answering. Tell me if I'm wrong but MySQL would be behind a caching layer anyway, so the choice would be between cached git or cached git + mysql.

In git, logging commits on a file from an author is also a kind of join, and it is surprisingly fast, so using git as a data store is a weird idea that I cannot take out of my head.

WestCoastJustin13y ago· 1 in thread

Here are the makings of a bad week (Monday of all things)

- MySQL schema migration causes high load, automated HA solution causes cascading database failure

- MySQL cluster becomes out of sync

- HA solution segfaults

- Redis and MySQL become out of sync

- Incorrect users have access to private repositories!

Cleanup and recovery takes time, all I can say is, I'm glad it was not me who had that mess to clean up. I'm sure they are still working on it too!

cageface13y ago

This kind of thing is one of the main reasons I prefer to do app development instead of backend work now. I don't get calls at 3am any more.

akoumjian13y ago· 1 in thread

I would love to know more about this two pass migration strategy.

jnewland13y ago

We use https://github.com/soundcloud/large-hadron-migrator/

pbiggar13y ago

I know that they have to be apologetic like this, but the simple fact is that GitHub's uptime is fantastic.

jyap13y ago

"As traffic to the status site began to ramp up, we increased the number of dynos running from 8 to 64 and finally 90."

Wait, why isn't there some caching layer? eg. Generate a static page or use Varnish.

This part makes no sense at all.

At most you're then firing up another 5 dynos (or none) to handle the traffic. 90 is ridiculous.

1 more reply

dumbluck13y ago

This was the awesome kind of explanation about what went wrong and what was learned that I wish everyone would do.

donavanm13y ago

lokotecla113y ago

para que sirve esta pagina soy nuevo

j / k navigate · click thread line to collapse