We host our status site on Heroku to ensure its availability
during an outage. However, during our downtime on Tuesday
our status site experienced some availability issues.
As traffic to the status site began to ramp up, we increased
the number of dynos running from 8 to 64 and finally 90.
This had a negative effect since we were running an old
development database addon (shared database). The number of
dynos maxed out the available connections to the database
causing additional processes to crash.
Ninety dynos for a status page? What was going on there?AS we scaled up dynos, we would see temporary performance improvements until the status site would stop responding again. In the short term, this led to us massively increasing dynos as quickly as we could as it appeared that CPU burn was a significant cause of the slowness (at the time). This was in part caused by all the dynos repeatedly crashing. That's how we ended up going from 8 previously to 90.
Once the database problem for the status site was identified and resolved, we began scaling down dynos to a smaller number.
http://product.voxmedia.com/post/25113965826/introducing-syl...
30 kRPM is 500 hits/sec. Nginx will serve >2000/sec from a m1.small. For S3 that is about the equivalent of a mosquito fart.
Digging into their fix, they disabled automatic failover -- so all DB failures will now require manual intervention. While addressing this particular (erroneous) failover condition, it does raise minimum down time for true failures. Also, their mysql replicant's misconfiguration upon switching masters is also tied to their (stopgap) approach to preventing the hot failover. So, the second problem was due to a mis-use/misunderstanding of maintenance-mode.
How is it possible that the slave could be pointed at the wrong master and have nobody notice for a day? What is the checklist to confirm that failover has occurred correctly?
There is also lesson to be learned in the fact that their status page had scaling issues due to db connection limits. Static files are the most dependable!
And really, just to get a dig in here, I believe Arrington shares a big part of the blame for this state of affairs with all of his Dvorak-caliber ignorant harping about Twitter back in the day.
Seriously, why would a status page need to query a db?
Where else would you put it?
Here's sortof the seminal post on the matter in the mysql community: http://www.xaprb.com/blog/2009/08/30/failure-scenarios-and-s...
Though it turns into an MMM pile-on the tool doesn't matter so much as the scenarios. Automated failover is simply unlikely to make things better, and likely to make things worse, in most scenarios.
I've been in a couple of environment in which developers have successfully rolled out automated database failover, and, my takeaway, is that's it usually not worth the cost - and with very, very few exceptions, most organizations can take the downtime of several minutes to do manual failover.
In general, when rolling out these operational environment, they are only ready when you've found, and demonstrated 10-12 failure cases, and come up with workarounds.
In other words - if you can't demonstrate how your environment will fail, then it's not ready for an HA deployment.
ouch!
In GitHub we trust. I can't imagine putting my code anywhere else right now.
Once we got the kinks worked out it has been performing amazingly! Wonder if GitHub looked into this kind of a setup before selecting the cluster they did.
The two main issues we encountered both had to do with search for products/categories on our sites. The first was that Galera/WSREP doesn't support MyISAM replication (It has beta support, but I wouldn't trust it). This meant that we had to transition our fulltext data to something else. The something else in this case was Solr which has been a much better solution anyway (fulltext based search was legacy anyway so this I can kind of count as a win).
The second issue and the one that was causing random OOM crashes was partly due to a bug, partly due to the way the developer responsible for the search changes implemented things. The bug part is that galera doesn't specifically differentiate between a normal table and a temp table. When you have very very small/fast temporary tables that are created and truncated before the creation of the table is replicated across the cluster it can leave some of these tables open in memory (memory leak whoo!). We were able to fix for this and have been happy ever since.
If there's any interest I can do a larger writeup about actual implementation of the cluster, caveats and the like.
Databases are best for, well, performing relational queries. In the case of commenting on a commit, if you store them only in the repository it becomes non-trivial to ask "show me all of the comments by this user" unless you have an intermediary cache layer (in which case you're back where you started).
In git, logging commits on a file from an author is also a kind of join, and it is surprisingly fast, so using git as a data store is a weird idea that I cannot take out of my head.
- MySQL schema migration causes high load, automated HA solution causes cascading database failure
- MySQL cluster becomes out of sync
- HA solution segfaults
- Redis and MySQL become out of sync
- Incorrect users have access to private repositories!
Cleanup and recovery takes time, all I can say is, I'm glad it was not me who had that mess to clean up. I'm sure they are still working on it too!
This brings to mind some my bad days.. OOM killer decides your Sybase database is using too much memory. Hardware error on DRBD master causes silent data corruption (this took a lot of recovery time on TBs of data). I've been bitten by the MySQL master/slave become out of sync. That is a bad place to be in.. do you copy your master database to the slaves.. that takes a long time even of a fast network.
I run http://CircleCi.com, and so we have upwards of 10,000 interactions with GitHub per day, whether API calls, clones, pulls, webhooks, etc. A seriously seriously small number of them fail. They know what they're doing, and they do a great job.
Wait, why isn't there some caching layer? eg. Generate a static page or use Varnish.
This part makes no sense at all.
At most you're then firing up another 5 dynos (or none) to handle the traffic. 90 is ridiculous.