I've been doing reliability for most of my career, and have always been able to hide behind, "We're not a bank, if we lose a few requests it doesn't matter". They can't do that. :)
One advantage that they have is that the market closes, so they can do maintenance that takes the whole system down, but when you're running a global consumer product, it's a lot harder to do that without pushback.
So for most of us, our stress is around zero downtime maintenance, and theirs is around never dropping a request when the system is live.
There are multiple layers of controls and manual interventions and things, which while absolutely painful, slow, expensive and shitstorm-conjuring -- are ultimately the final authority on some failures.
For e.g, in payments -- every single settlement or clearing anomaly is looked at by a real human, and rectified/rebooked manually.
So, yeah, the stakes can be really high when you have a couple billion in memory on your server, but -- it's just a system.
And it will fail, and we plan for it to do so.
Makes things damn hard indeed, because you have to truly learn asynchronicity, CQRS and complex live migrations. (Incidentally, engineers who have worked on such systems tend to be over-represented in extreme HA businesses.)
I've always said that with infinite money we could get 100% uptime, but no one has infinite money. Trading firms are about as close as I can imagine to infinite money though.
Isn't the plan more like 23/5 like is already the case for several markets?
I can't see the standard sessions moving more 9:30am/4pm weekdays to 24/7. I take it they'd still let, at least, one hour off for technical reasons.
If I'm not mistaken it's the reason several markets are 23/5 and not 24/5: that one hour of downtime is basically for servers/maintenance right? (maybe someone can chime in)
P.S: I take it technically there's 24/7 trading already seen that cryptocurrencies exchanges are opened 24/7 (I'm not sure: but I think that's the case) but I don't think those do anywhere near the volume of, say, options trading on equities during standard sessions (40 Gbit/s with peak over 70 Gbit/s for the full options feed).
I have heard similar talks from Shopify and such back in the day, about their own product, but always love listening to more.
The clickbait title of "billions of dollars a day" is nothing to praise.
It's fun, because one lost or late packet is an issue immediately red in the monitoring.
I've been SRE too and the most it brought to the table is a concept of error budget.
I can only agree that "billions of dollars" in trades is not much.
I’ve met some exceptional people: top researchers from top universities from several fields, super well paid engineers working on products you probably use, some of the best hackers an advanced persistent threat actor could ask for; they’re just people.
I think if you get a collection of competent, thoughtful people together they would come up with similar solutions to the problems discussed in this talk.