undefined | Better HN

0 pointssurge7y ago0 comments

They had redundancy, the redundancy failed.

I mean truthfully when do you get to test your redundancy against a true disaster. It was a mess. WF is 20 companies rolled into one so the fact the disparate systems from 10 different banks works at all is kind of a miracle.

0 comments

hinkley7y ago

I can't recall who it was, but there was a big outage due to a data center running a monthly generator test and having a major customer go dark. The data center had 7 generators for 4 server rooms: 1 per room, a backup for each pair of rooms, and a backup for the backups. The primary and then both backups failed, so out went the lights.

You're pretty much damned if you do and damned if you don't. If you touch things that are working you could break them. If you don't touch things you never know what'll happen and you get fewer opportunities to learn. Move your servers around geographically and you might improve the odds that anything is working by reducing the odds that everything is working.

I don't think we're quite to a place yet where having servers down can be characterized as a non-event. Even if the customer can't see a behavioral difference, business units still tend to get quite anxious, and sometimes their theatrics put the whole process in jeopardy (not unlike trying to rescue a drowning man). It just hasn't been normalized yet.

computerex7y ago

Look at Netflix with chaos monkey and simian army. Netflix routinely does catastrophic destructive testing on their production systems, sometimes evacuating an AWS region entirely. To them, a server going down is a non-event, because they designed their systems with the premise that servers go down.

milesvp7y ago

Netflix is a great example of how to create robust systems. Keep in mind though, they have a very different risk profile than a large bank. no one is going to lose their life’s savings if netflix entire infrastructure crumbles. This might happen in a bank with a single server outtage. Don’t get me wrong, if I start a bank today, I use chaos monkey stategy. But if I take over a bank infrastructure, with cobol still representing a huge percentage of mission critical code that everyone is scared to touch... no chaos monkey. I might deliberately turn off a server for an hour to see what chaos ensues, but it’ll be after 3 months of analysis, longer if I begin to suspect the system does more than anyone can remember.

1 more reply

DanBC7y ago

If a customer can't watch a film they get unhappy.

If a customer can't pay a fine - can't use their bank account - they go to jail. https://www.telegraph.co.uk/finance/personalfinance/bank-acc...

These are pretty different outcomes.

pushtheenvelope7y ago

Facebook regularly takes down multiple data-centers at a time to stress test this. Its users rarely notice (which I think is the point).

shagie7y ago

There is a different level of criticality for "my post didn't go through, hit refresh" and "my transaction didn't go through - the restaurant said my card isn't working."

Would you honestly want to go to a bank and say "if we unplug this, we can find out what fails."

computerex7y ago

Facebook handles money too, though. Also I think the parent was making the point that Facebook builds software with resiliency in mind so when a failure does happen, the software deals with it gracefully.

1 more reply

jen729w7y ago

They're <10 years old. Their DCs better be cleaner than my kitchen.

I've worked for banks here in Australia. Everything is 30+ years old. It's a shambles.

zaphirplane7y ago

There is this company that had a grid power issue, batteries kicked in batteries running low, diesel generator show time, diesel generator doesn’t kick in. people scramble to shutdown before the batteries run out. So ITIL stuff happens and time to test it, guess what? diesel again doesn’t kick in (don’t recall the consequences thou)

ElijahLynn7y ago

Please expand ITIL, I don't know what it means.

zaphirplane7y ago

https://en.m.wikipedia.org/wiki/ITIL Article explains it all

mulmen7y ago

Then you’re an ITIL expert! Welcome to mid level IT management.

I’m serious.

vonmoltke7y ago

Bloomberg does it once a year. That said, I have yet to encounter a company more obsessed with business continuity. I don't doubt that their failover systems and testing of them are well beyond the typical.

surgeOP7y ago

I think Bloomberg can stand to be down for 8 hours to simulate a disaster. Banks with legacy systems and people constantly dependent on them to conduct business can't risk an actual incident happening because they were testing what would happen if an incident happened.

Netflix designed their stuff from the ground up to fail over. Large monolith corporations who've inherited systems from other companies they've bought or merged with have challenges you won't see many places that have benefited from the 30 years of lessons that were taught at these companies.

vonmoltke7y ago

> I think Bloomberg can stand to be down for 8 hours to simulate a disaster. Banks with legacy systems and people constantly dependent on them to conduct business can't risk an actual incident happening because they were testing what would happen if an incident happened.

No, it can't. Any loss of customer-facing functionality is a critical outage ("World Problem" in company terminology). There are a relatively small number of customers, but the terminal is critical to the operations of those who buy it. The terminal going down for eight hours would be a world-wide headline in the financial press.

A Tier 1 test that simulates loss of a datacenter takes a cluster one DC virtually offline. This puts an entire subset of services offline in that DC entirely. The test is coordinated with the teams who own the services to ensure their services fail over correctly. Any service disruption during the failover is a test failure. If it passes, the customers don't even know it happened. The goal is to be able to lose an entire DC and have the terminal customers not realize it until they hear about it on the news.

icelancer7y ago

> I think Bloomberg can stand to be down for 8 hours to simulate a disaster.

Do you know what Bloomberg does? It powers equities trading markets around the world, 24/7. It isn't just news.

kamikaz1k7y ago

Well that's not true.

Chaos engineering and AWS weren't real things when they started building the company. And the system they have now doesn't resemble much of it was once.

Truth of the matter is they invested more in their infrastructure, but that's because their business plan required them to grow on the back of technological advances. Banks, it's seems, do not. Or maybe they do, and the some of these start up banks will usurp them.

1 more reply

closeparen7y ago

Business critical systems can’t afford not to test failover.

You can bail out of a test at the first sign of trouble. When a real outage hits, there’s no telling how long it will take to recover.

closeparen7y ago

We run that drill every Wednesday morning.

j / k navigate · click thread line to collapse

0 comments

hinkley7y ago

computerex7y ago

milesvp7y ago

1 more reply

DanBC7y ago

If a customer can't watch a film they get unhappy.

If a customer can't pay a fine - can't use their bank account - they go to jail. https://www.telegraph.co.uk/finance/personalfinance/bank-acc...

These are pretty different outcomes.

pushtheenvelope7y ago

Facebook regularly takes down multiple data-centers at a time to stress test this. Its users rarely notice (which I think is the point).

shagie7y ago

There is a different level of criticality for "my post didn't go through, hit refresh" and "my transaction didn't go through - the restaurant said my card isn't working."

Would you honestly want to go to a bank and say "if we unplug this, we can find out what fails."

computerex7y ago

1 more reply

jen729w7y ago

They're <10 years old. Their DCs better be cleaner than my kitchen.

I've worked for banks here in Australia. Everything is 30+ years old. It's a shambles.

zaphirplane7y ago

ElijahLynn7y ago

Please expand ITIL, I don't know what it means.

zaphirplane7y ago

https://en.m.wikipedia.org/wiki/ITIL Article explains it all

mulmen7y ago

Then you’re an ITIL expert! Welcome to mid level IT management.

I’m serious.

vonmoltke7y ago

surgeOP7y ago

vonmoltke7y ago

icelancer7y ago

> I think Bloomberg can stand to be down for 8 hours to simulate a disaster.

Do you know what Bloomberg does? It powers equities trading markets around the world, 24/7. It isn't just news.

kamikaz1k7y ago

Well that's not true.

Chaos engineering and AWS weren't real things when they started building the company. And the system they have now doesn't resemble much of it was once.

1 more reply

closeparen7y ago

Business critical systems can’t afford not to test failover.

You can bail out of a test at the first sign of trouble. When a real outage hits, there’s no telling how long it will take to recover.

closeparen7y ago

We run that drill every Wednesday morning.

j / k navigate · click thread line to collapse