undefined | Better HN

0 pointsmalux856y ago0 comments

Sometimes you just have to cut them some slack. Have you engineered a highly available cluster before? I'm not talking about the hot-standby postgres master that gets called on once every 2 years, but I'm talking about a 180 node Cassandra cluster thats doing 15,000 writes a second 24/7 and peaking at 60,000 writes a second every day, and you have to do node replacements every week or two because of the high load.

Or I'm talking about a 200 node hadoop cluster thats doing the electrical metering and billing for 8 million people, and is NOT allowed to stop.

Or the trading platform thats running sub millisecond trades and downtime means 300,000 $ USD per minute.

These are systems I have engineered over the last 10 years, and I can say: These things are complex and have failures in 1000 different ways, and while you're monitoring 999 of them that one thing you're not looking at is festering under the surface (your monitoring system is tracking IRQ hardware interrupt response times, right???)

Part of being in a team is everyone pulling together, and yes it's stressful at the time, but even very good management cant see all ends, just like very good engineering cant predict everything. I don't think it's useful to start pointing the finger at management and "asking some pointed questions at leadership" because sometimes everyone is doing their best. Yes we should analyse our failures so we can do better, but your tone is very accusatory, and I believe that a better approach is an all inclusive chat about how we can do better, and management saying "great job engineering" for fixing it, and giving them a break after the stressful event.

0 comments

TheCondor6y ago

Does the duration of their downtime suggest a “1/1000” unmonitored oversight? Or is it more like a threshold that was meet and probably could/should have been observed?

And FWIW, they have down time every day and weekend, at least in a virtual sense; the load does drop off in a very real sense too. You are spiritually correct, they should pull together and sort it out, and they owe nobody money here (don’t use a discount broker if you want some sort of guarantee about trades) but as a general rule you should ever feel too sorry for banker under just about any circumstances. The harshest lesson here, for everybody, was the only thing they would do for you was give you some commission free trades but that won’t work with this one, so a non-apology is what you get.

mumblemumble6y ago

I think you may be focusing on the finger instead of the thing that it's pointing at.

The post reads to me like all those examples were meant to be concrete examples to drive home a more general argument that complex systems are, well, complex, and that there's an element of hubris in taking potshots from the peanut gallery.

tmpz226y ago

I think the original point in this sub-thread boils down to: basic micro-level human error like typos + bad configuration deploys is completely understandable (to a certain extent), but macro level failures that happen by ignoring obvious trends and best practices is malfeasance.

Personally I don't think Robinhood will ever release a full honest post-mortem and so we'll never know (and never be able to judge fairly).

If the system failed by virtue of being too complex, that is also malfeasance because any devops/SRE worth their salt (as might be expected at a 7 BILLION DOLLAR company) should smell unnecessary complexity from a mile away and slowly refactor it away over the course of several years - which looking at Robinhoods downtime history they never did.

The closest example to Robinhoods engineering woes is Reddit, which throughout its early history made fairly poor infrastructure and data modeling decisions but have since repaired and improved on. We should hold Robinhood to higher expectations then Reddit for obvious reasons. Them having similar engineering capability to circa ~2012 start-up reddit is INEXCUSABLE.

eyegor6y ago

As with any big system, spinning it up is much harder than bringing it down. After an outage, they have to stay offline to audit their systems to ensure that all the nodes are synchronized, all queued trades have been processed, and no accounts are in invalid states. I'm sure they could have restarted in a matter of minutes, but the risk is ridiculously high.

raiyu6y ago

No doubt there are many complex systems and they inevitably go down. Every provider has suffered meaningful outages.

I think the issue here isn’t so much that the system went down but the blog post.

It’s very light on details and doesn’t go far enough in terms of re-establishing trust with the customers that were affected. Which by the looks of it is everyone attempting any trade most of the day on Monday.

luckylion6y ago

On the other hand, they've had plenty of time and resources to do just that in a reliable fashion, it's not like it's one guy in his bedroom (I hope!). It's not like they are volunteers doing this open source for the community, they are getting paid (very well, I assume) to run the system. And Management is getting paid (even better, I assume) to make sure the priorities are right and correct decisions are taken. "Who could've known there might be a lot more traffic" sounds like somebody failed in Management, and engineering might have failed by not foreseeing the issue and/or informing Management.

Sure, don't burn people at the stake, but "hey, it's hard, don't blame them, they are doing their best" doesn't cut it for me. I'm sure they're expecting to be paid and not for someone to "do their best" to pay them.

malux85OP6y ago

Can you give me a concrete example of a massive distributed system that has zero downtime?

Because the largest distributed system I have seen and worked on was at Apple (or maybe DFP at Google) - and even though they had some of the smartest people in the world and literally billions of dollars behind them, there were still an endless list of problems and downtime events.

Spoiler alert: It doesn't exist.

luckylion6y ago

The point isn't that "a system cannot fail", the point is "if the system fails, it's no big deal, shit happens, cut them some slack" is a weird way to look at it for corporate systems, especially in sensitive areas.

If you're running a HA system and you only need one nine to express your availability percentage, sure, sure, you have the smartest people etc and you're doing such a great job, and yeah, yeah, show me one system that has 100% uptime etc.

1 more reply

unicornmama6y ago

Google doesn’t target zero downtime. The marginal cost is too high. For important services (like Search page and ads) they aim for 5 nines uptime (99.999%), which translates to 5 minutes of downtime per year.

https://en.m.wikipedia.org/wiki/High_availability

2 more replies

C1sc0cat6y ago

Dialcom (Telecom Gold) in the UK was pretty close to 100% Almost survived the big storm of 87 - unfortunately the modems where on the UPS.

We built an entire new DC and had Tottenham Court Road dug up in case the Thames flooded.

In fact any big telecom will have down times for a switch (central office) measured in generations

1 more reply

frockington16y ago

Can you name a reputable brokerage that was down all of Monday and Tuesday this week?

Spoiler alert: it doesn't exist

1 more reply

Ntrails6y ago

> Or the trading platform thats running sub millisecond trades and downtime means 300,000 USD per minute.

I mean, I'll bite. Assuming you only traded 6 hours a day (ie US) that'd be a 27bn dollar a year strategy, and the only way for returns to be linear and trading to be sub milli is market making/arbitrage.

That is a lot of half spreads...

2 more replies

techie1286y ago

Kudos, these are moderate sized systems you've built over your career. There are lot bigger and more mission critical systems in the world and you might build them one day.

I understand GP's tone wasn't exactly nice here. But here's the rub with RH's outage. RH is unfortunately in an industry (Finance, Healthcare, Aviation, Food, etc.) where people _need_ to trust them to be successful. The consequences of failure in these industries is very catastrophic not only for them but their clients. Sure failures happen but the scale at which RH has failed and the lukewarm response they've put out has pissed off people. I don't recall any brokerage, old or new, that has failed so catastrophically and has responded to it so poorly. If you think you have a worse example, I am all ears.

kasey_junk6y ago

Hard to judge _worse_ in this context but while I was trading all of CME globex was down for 4 hours canceling active orders.

https://www.profit-loss.com/cme-hit-by-globex-outage/

I don’t remember them offering any apology or explanation at all.

That’s an exchange mind you where things like the global price of oil and s&p futures trade. Not a small boutique brokerage.

Further they have planned downtime every week & at that point still had planned daily downtime I think.

I think Robinhood screwed up. I think they should learn a hard lesson. But people thinking that trading is some high reliability industry haven’t spent any time in it.

The scary thing to me is are healthcare, aviation & food the same?

vsareto6y ago

Part of AWS's sell with elasticity is only spending what you need, but those industries have redundancies or unused capacity.

Someone in one of these threads said there's a hidden DNS within VPCs that can fail and isn't scaled, so if that's true, they might just have to architect around that unless they can get AWS to change it. It's on RH for not knowing that but it's also kind of on AWS too.

But as far as what you can do, you can really only split your cash across brokerages if you want to engineer the same redundancy yourself. Otherwise, RH would need to route everything to another exchange to keep satisfying orders, and even that is just another system that could fail. Keeping all of your money in one brokerage doesn't seem ideal if you want to completely avoid downtime. Doing the same redundancy yourself with those industries isn't really practical.

wonderwonder6y ago

Boeing's failures have killed hundreds of people. Governments still pay them and people still fly on their planes. Stores sell salmonella contaminated products all the time and people still shop there. RH's failure pales in comparison. Crypto exchanges fail all the time, people still use them. RH may lose a few customers in the short term but I see no reason they wont bounce back, they provide a product people like and the majority of people dont like change and will stay with them provided stability returns soon.

Non technical people dont want a technical apology, they just want an 'our bad, working on it' which is what was provided. The company will be fine. Should they be is another question all together.

raiyu6y ago

Technically people still fly the old Boeing planes that don't crash. The 737MAX is still not in service, and there is a likelihood that it may never go back into service. All future orders are cancelled, and there isn't a clear pathway to the plane re-certified and more importantly for people to trust them again.

High trust systems require just that, high trust. And once broken it's hard to re-establish.

Crypto exchanges certainly have their fair issues of downtime, but don't forget that crypto exchanges for a long time operated purely for early adopters as crypto wasn't something that everyone traded. There was also less availability of competition, because again the industry was newer and there were fewer choices.

And certainly Coinbase helped to popularize crypto trading and they had their fair issues, but I don't believe they had an outage of this exact magnitude, and again they were in an early adopter area where mistakes are seen as part of the process. If not expressly, then at least subconsciously.

1 more reply

C1sc0cat6y ago

The Problems with the TSB in the Uk come to mind and NatWest /RBS had a similar SNAFU a few years back.

tinus_hn6y ago

Catastrophe would be if they actually lost money. This is all indirect damage that is probably disclaimed in their terms of service.

No service guarantees 100% availability, it doesn’t exist.

LaserToy6y ago

It is not about scale, it is about the fact that people lost real money. If you can’t make it work you should not be in that business, and I don’t really care how hard they work.

I’m taking my account off their platform.

malux85OP6y ago

Is this your first day of trading or something?

People lose money in trading all the time, for hundreds of reasons and some of those reasons are infrastructure downtime.

If your risk profile doesn't reflect that, maybe you should take your money out of trading altogether.

anon1020106y ago

I carry reasonable investment balances - I’m not an active trader but in this space I expect availability. I’d never put my money on RH - and this has nothing to do w risk profile

1 more reply

frockington16y ago

I've been trading for years, would not keep a penny on that platform. They've effectively cut off all liquidity for their customers for at least 2 days during high market volatility. You are missing out on tax loss harvesting, buying dips etc.

LaserToy6y ago

I will take it off this platform for sure.

1 more reply

argonaut6y ago

Nearly every online brokerage has had an outage or outages in the past.

LaserToy6y ago

For a full day + morning and issued non apology? Also, I prefer to be with the group that is not in the “nearly all” one

dirtydroog6y ago

Exactly. When markets are volatile I imagine they find it difficult to manage risk and so just shut everyone out and blame it on IT.

1 more reply

yuppie_scum6y ago

Nothing was stopping the Robinhood customers from opening an eTrade or TD Ameritrade account or something and doing their trading out of that platform for the duration of the outage. Robinhood isn't really an institutional platform in my understanding anyway.

LaserToy6y ago

How can you do it if your equity is in Robinhood?

dirtydroog6y ago

If you used Scylla you'd have only needed 90 nodes. (Don't believe the instability rumours)

gshulegaard6y ago

I've was a primary contributor on a migration of time series data to Scylla. As an anecdote, I once emailed our business contact about tracking down why we appeared to have data inconsistencies between our new (Scylla backed) and old system. I thought the e-mail got lost since we never heard back...until 8 months later (long after we had de-prioritized the migration since our old system was "good enough") asking if we had tried the newly released version which fixed a data loss issue.

Blew. My. Mind. Not only because of the radio silence and then dropping back in out of the blue as if no time had passed, but also because they had a data loss issue.

So rechecked out my previous branch, upgraded Scylla versions and sure enough the data differences we were noticing before appeared to be resolved. I couldn't believe the amount of time I had spent combing through my code to see if I had a hard to detect bug somewhere...but nope, it was ScyllaDB (although I am sure there were plenty of other bugs...just they weren't the cause of this specific symptom).

I am actually a fan of ScyllaDB and what is trying to do. Performance was great (as advertised) and management was simple enough; but they are going to need to work pretty hard to convince me "instability" is just rumor after that experience not too many years ago.

dirtydroog6y ago

Well, we moved to it from Cassandra. It's yet to fall over and we're querying it maybe 200k times per second. No changes were needed from the client driver side of things too. YMMV.

ethbro6y ago

And this is why folks still run DB2.

j / k navigate · click thread line to collapse

0 comments

TheCondor6y ago

Does the duration of their downtime suggest a “1/1000” unmonitored oversight? Or is it more like a threshold that was meet and probably could/should have been observed?

mumblemumble6y ago

I think you may be focusing on the finger instead of the thing that it's pointing at.

tmpz226y ago

Personally I don't think Robinhood will ever release a full honest post-mortem and so we'll never know (and never be able to judge fairly).

eyegor6y ago

raiyu6y ago

No doubt there are many complex systems and they inevitably go down. Every provider has suffered meaningful outages.

I think the issue here isn’t so much that the system went down but the blog post.

luckylion6y ago

malux85OP6y ago

Can you give me a concrete example of a massive distributed system that has zero downtime?

Spoiler alert: It doesn't exist.

luckylion6y ago

1 more reply

unicornmama6y ago

https://en.m.wikipedia.org/wiki/High_availability

2 more replies

C1sc0cat6y ago

Dialcom (Telecom Gold) in the UK was pretty close to 100% Almost survived the big storm of 87 - unfortunately the modems where on the UPS.

We built an entire new DC and had Tottenham Court Road dug up in case the Thames flooded.

In fact any big telecom will have down times for a switch (central office) measured in generations

1 more reply

frockington16y ago

Can you name a reputable brokerage that was down all of Monday and Tuesday this week?

Spoiler alert: it doesn't exist

1 more reply

Ntrails6y ago

> Or the trading platform thats running sub millisecond trades and downtime means 300,000 USD per minute.

That is a lot of half spreads...

2 more replies

techie1286y ago

Kudos, these are moderate sized systems you've built over your career. There are lot bigger and more mission critical systems in the world and you might build them one day.

kasey_junk6y ago

Hard to judge _worse_ in this context but while I was trading all of CME globex was down for 4 hours canceling active orders.

https://www.profit-loss.com/cme-hit-by-globex-outage/

I don’t remember them offering any apology or explanation at all.

That’s an exchange mind you where things like the global price of oil and s&p futures trade. Not a small boutique brokerage.

Further they have planned downtime every week & at that point still had planned daily downtime I think.

I think Robinhood screwed up. I think they should learn a hard lesson. But people thinking that trading is some high reliability industry haven’t spent any time in it.

The scary thing to me is are healthcare, aviation & food the same?

vsareto6y ago

Part of AWS's sell with elasticity is only spending what you need, but those industries have redundancies or unused capacity.

wonderwonder6y ago

Non technical people dont want a technical apology, they just want an 'our bad, working on it' which is what was provided. The company will be fine. Should they be is another question all together.

raiyu6y ago

High trust systems require just that, high trust. And once broken it's hard to re-establish.

1 more reply

C1sc0cat6y ago

The Problems with the TSB in the Uk come to mind and NatWest /RBS had a similar SNAFU a few years back.

tinus_hn6y ago

Catastrophe would be if they actually lost money. This is all indirect damage that is probably disclaimed in their terms of service.

No service guarantees 100% availability, it doesn’t exist.

LaserToy6y ago

It is not about scale, it is about the fact that people lost real money. If you can’t make it work you should not be in that business, and I don’t really care how hard they work.

I’m taking my account off their platform.

malux85OP6y ago

Is this your first day of trading or something?

People lose money in trading all the time, for hundreds of reasons and some of those reasons are infrastructure downtime.

If your risk profile doesn't reflect that, maybe you should take your money out of trading altogether.

anon1020106y ago

I carry reasonable investment balances - I’m not an active trader but in this space I expect availability. I’d never put my money on RH - and this has nothing to do w risk profile

1 more reply

frockington16y ago

LaserToy6y ago

I will take it off this platform for sure.

1 more reply

argonaut6y ago

Nearly every online brokerage has had an outage or outages in the past.

LaserToy6y ago

For a full day + morning and issued non apology? Also, I prefer to be with the group that is not in the “nearly all” one

dirtydroog6y ago

Exactly. When markets are volatile I imagine they find it difficult to manage risk and so just shut everyone out and blame it on IT.

1 more reply

yuppie_scum6y ago

LaserToy6y ago

How can you do it if your equity is in Robinhood?

dirtydroog6y ago

If you used Scylla you'd have only needed 90 nodes. (Don't believe the instability rumours)

gshulegaard6y ago

Blew. My. Mind. Not only because of the radio silence and then dropping back in out of the blue as if no time had passed, but also because they had a data loss issue.

dirtydroog6y ago

Well, we moved to it from Cassandra. It's yet to fall over and we're querying it maybe 200k times per second. No changes were needed from the client driver side of things too. YMMV.

ethbro6y ago

And this is why folks still run DB2.

j / k navigate · click thread line to collapse