Coinbase Incident Post Mortem: June 25–26, 2019 (opens in new tab)

(blog.coinbase.com)

127 pointsjessepollak6y ago28 comments

28 comments

21 comments · 8 top-level

kvlr6y ago· 5 in thread

TLDR: MongoDB

Sounds like a maintenance error and an application design error that led to a flood of traffic. Doesn't sound MongoDB specific.

mbell6y ago

I don't think there is enough information to come to a conclusion here, but I doubt your proposed failure mode.

Based on my experience with largish data in mongo, my guess would be that the database size was >> than ram and, due at least in part to mongo's design, when the master failed over the memory state of the new master didn't have the working set present in RAM. This lead to a huge inrush of disk IO resulting in all sorts of bad until the RAM state sorted itself out.

1 more reply

Hydraulix9896y ago

There’s definitely some truth to parent’s snark:

http://hackingdistributed.com/2013/01/29/mongo-ft/

danbmil996y ago

Really, rolling out the Mongo spite from '13? Isn't there a statute of limitations on this sort of pgres fanboy balderdash?

Haters gonna hate, hate, hate

1 more reply

H8crilA6y ago

Honestly I'm a bit surprised they don't use a more mature, fully managed solution. They can't have so much traffic to not justify getting some traditional relational database (like Oracle).

redis_mlc6y ago· 5 in thread

> We take uptime very seriously, and we’re working hard to support the millions of customers that choose Coinbase to manage their cryptocurrency

No you don't.

- If you did, you'd hire a DBA team and they would be familiar with the various jobs in your environment. But first your founders would have to have respect for Operations, which will take a dozen more major outages.

The other major Coinbase outages have also been database-related, namely missing indexes.

- If you did, you wouldn't be doing major database (or other production) changes at 3 pm in the afternoon.

So let's cut to the chase. You prioritize features over Operations, and as a result guinea-pig your users. Just like any other SF startup. So just admit that to your end-users.

Klathmon6y ago

>you wouldn't be doing major database (or other production) changes at 3 pm in the afternoon.

Coinbase isn't dedicated to one timezone. They fully support tons of countries and have customers all over the place.

Things like this have to happen at some point, and there are benefits to doing stuff like this during "work" hours (like having all of your staff online and available)

dmitriid6y ago

> Coinbase isn't dedicated to one timezone. They fully support tons of countries and have customers all over the place.

What timezone are their developers in? That’s the important question in this situation.

Devs deploying changes that affect customers at the end of the devs’ workday are reckless.

kerkeslager6y ago

> Things like this have to happen at some point, and there are benefits to doing stuff like this during "work" hours (like having all of your staff online and available)

That sounds like a benefit to Coinbase and not to any of their customers.

3 more replies

trykondev6y ago

I think this is exactly correct, and I hate that this behavior of offloading testing onto users seems more and more to just be the expectation rather than the exception.

Coinbase had a number of issues when cryptocurrencies really exploded in 2017, and at that time I felt more willing to give them the benefit of the doubt because the landscape of cryptocurrencies had shifted so dramatically and I could empathize with their struggles to keep up. Two years later, there aren't any excuses anymore in my mind. As the parent comment says -- it's just not a priority.

I've happily moved off the platform to other options which have given me no trouble whatsoever.

keyle6y ago

> You prioritize features over Operations, and as a result guinea-pig your users. Just like any other SF startup. So just admit that to your end-users.

Wow, I'm not from any "SF startup" but I find that side jab quite cunning.

viraptor6y ago· 1 in thread

> We have ensured that failovers for this cluster may only be initiated during rare, scheduled downtime, when there will be no impact on customers.

I hope all their hardware crashes are also scheduled when there will be no impact... This seems a bit backwards - unless you constantly exercise the instantaneous failover, how do you know it works?

Edit: Actually it's worse - if you don't test the instant failover under a full load, how do you know it's still instant then?

rednixion6y ago

1: Sounds like they are saying that until the changes to remove the cluster out of the request path, failover cannot be trusted so the only immediate action that can be taken is to disable the automatic failover logic and prevent any live maintenance jobs needing to take it offline from running. Not unreasonable depending on the timeline to implement the changes when comparing the risk of a cluster killing hardware failure (assuming that there is at least a one node tolerance) vs behavior inside of your buy/sell pipeline being "undetermined"; no failover is probably a great way to keep the sprint work priority at the appropriate level.

2. Loadtesting? Accurate end-to-end loadtests are painful to bootstrap from nothing and requires an environment that can be expensive, worth their weight in gold depending on how critical downtime is or if everything in your platform is pulling from the same resource pool(ie appliance based deployment).

twblalock6y ago· 1 in thread

I suppose this is a good time to ask whether Coinbase thinks they are a bank, or a brokerage, or both.

If they are a bank, this isn't the end of the world. I've had online banking outages at "normal" banks. It is still a bad thing, but there are other ways I can get my money, like going to a branch.

On the other hand, if Coinbase is like a brokerage, this is really bad. And let's face it, most use of crypto is for investment and speculation purposes. For trades to fail for half an hour is really bad. If they are running this thing like a startup on MongoDB (seriously?) I don't see how anyone who puts their money in can have any confidence of getting it back out.

wakeywakeywakey6y ago

> If they are running this thing like a startup on MongoDB (seriously?)

Do you base this on recent info (latest versions w/ Jepsen tests)? If so, what specifically makes Mongo a "startup" db?

drefanzor6y ago· 1 in thread

So basically price alerts lagged the system.

noitsnot6y ago

Completely shut it down. It's funny they DDoS'd themselves.

cle6y ago

Lots of architectural and cultural problems IMO. Mixing different kinds of queries on the same cluster, no auto scaling on neither cluster nor web server, they seem to be okay with customer-impacting maintenance events (seriously?!), and their "fix" for an event caused by cache misses is to add more caching, which will make their system even harder to understand and predict, increasing the likelihood of more severe and byzantine failures.

It's often an unpopular opinion around here, but this is why I prefer simple hosted databases with limited query flexibility for high volume and high availability services (Firestore, DynamoDB, etc.). It's harder to be surprised by expensive queries, and you won't have to fiddle with failovers, auto scaling, caching, etc. Design your system around their constraints and it will have predictable performance and can more easily scale under unexpected load.

staticassertion6y ago

> Before the incident began, a background job performed an aggregation for a large number of accounts, causing excessive reads into cache and cache evictions for the cluster’s storage engine cache.

I thought this was interesting. I think that caches can be so dangerous in an incident - suddenly operations that are almost always constant time are executing in a much different complexity, and worst is that this tends to happen when you get backups (since old, uncached data is suddenly pushing recent data out).

I think chaos engineering may be a good solution here, in lieu of better architectures - see what happens when you clear your cache every once in a while, how much your load changes, how your systems scale to deal with it.

nocitrek6y ago

Great post mortem. Well detailed, good work ethic there.

j / k navigate · click thread line to collapse

28 comments

21 comments · 8 top-level

kvlr6y ago· 5 in thread

TLDR: MongoDB

oconnor6636y ago

Sounds like a maintenance error and an application design error that led to a flood of traffic. Doesn't sound MongoDB specific.

mbell6y ago

I don't think there is enough information to come to a conclusion here, but I doubt your proposed failure mode.

1 more reply

Hydraulix9896y ago

There’s definitely some truth to parent’s snark:

http://hackingdistributed.com/2013/01/29/mongo-ft/

danbmil996y ago

Really, rolling out the Mongo spite from '13? Isn't there a statute of limitations on this sort of pgres fanboy balderdash?

Haters gonna hate, hate, hate

1 more reply

H8crilA6y ago

Honestly I'm a bit surprised they don't use a more mature, fully managed solution. They can't have so much traffic to not justify getting some traditional relational database (like Oracle).

redis_mlc6y ago· 5 in thread

> We take uptime very seriously, and we’re working hard to support the millions of customers that choose Coinbase to manage their cryptocurrency

No you don't.

The other major Coinbase outages have also been database-related, namely missing indexes.

- If you did, you wouldn't be doing major database (or other production) changes at 3 pm in the afternoon.

So let's cut to the chase. You prioritize features over Operations, and as a result guinea-pig your users. Just like any other SF startup. So just admit that to your end-users.

Klathmon6y ago

>you wouldn't be doing major database (or other production) changes at 3 pm in the afternoon.

Coinbase isn't dedicated to one timezone. They fully support tons of countries and have customers all over the place.

Things like this have to happen at some point, and there are benefits to doing stuff like this during "work" hours (like having all of your staff online and available)

dmitriid6y ago

> Coinbase isn't dedicated to one timezone. They fully support tons of countries and have customers all over the place.

What timezone are their developers in? That’s the important question in this situation.

Devs deploying changes that affect customers at the end of the devs’ workday are reckless.

kerkeslager6y ago

> Things like this have to happen at some point, and there are benefits to doing stuff like this during "work" hours (like having all of your staff online and available)

That sounds like a benefit to Coinbase and not to any of their customers.

3 more replies

trykondev6y ago

I think this is exactly correct, and I hate that this behavior of offloading testing onto users seems more and more to just be the expectation rather than the exception.

I've happily moved off the platform to other options which have given me no trouble whatsoever.

keyle6y ago

> You prioritize features over Operations, and as a result guinea-pig your users. Just like any other SF startup. So just admit that to your end-users.

Wow, I'm not from any "SF startup" but I find that side jab quite cunning.

viraptor6y ago· 1 in thread

> We have ensured that failovers for this cluster may only be initiated during rare, scheduled downtime, when there will be no impact on customers.

I hope all their hardware crashes are also scheduled when there will be no impact... This seems a bit backwards - unless you constantly exercise the instantaneous failover, how do you know it works?

Edit: Actually it's worse - if you don't test the instant failover under a full load, how do you know it's still instant then?

rednixion6y ago

twblalock6y ago· 1 in thread

I suppose this is a good time to ask whether Coinbase thinks they are a bank, or a brokerage, or both.

If they are a bank, this isn't the end of the world. I've had online banking outages at "normal" banks. It is still a bad thing, but there are other ways I can get my money, like going to a branch.

wakeywakeywakey6y ago

> If they are running this thing like a startup on MongoDB (seriously?)

Do you base this on recent info (latest versions w/ Jepsen tests)? If so, what specifically makes Mongo a "startup" db?

drefanzor6y ago· 1 in thread

So basically price alerts lagged the system.

noitsnot6y ago

Completely shut it down. It's funny they DDoS'd themselves.

cle6y ago

staticassertion6y ago

> Before the incident began, a background job performed an aggregation for a large number of accounts, causing excessive reads into cache and cache evictions for the cluster’s storage engine cache.

nocitrek6y ago

Great post mortem. Well detailed, good work ethic there.

j / k navigate · click thread line to collapse