The largest of these calls I've seen was well into the hundreds of sw engineers, managers, network engineers, etc.
Thanks for the answer, I have only ever worked with such a small team that we are all on a call every day.
I can imagine it can probably get a little hectic in large group calls? On the engineering side is there a command structure? Like say the root cause was found and RC team is rushing to fix it. But another team wants to mitigate in the mean time in a slightly risky way. Would their manager make a case with leadership? Would the proposed plan just be put out for general comment as a response to that main ticket?
Our major incident process generally had a “suit” call with non-technical executives and people who would be coordinating customer triage, outreach, etc. Then we would have a tech bridge where the key stakeholders did their thing.
We used the Federal incident command system as a model. It’s a great reference point to use as an inspiration.
Another team would then assess and analyse the root cause from a company wide perspective and then assess the risks, costs and impact and then make any modifications (possibly redoing the temporary fix, and fixing it properly)
Real issue, a call center main telephony system and one of the management servers kept crashing causing over 1400 call center people to stop working. Temporary fix was to re boot the servers every 4 hours causing minor pain, but the call staff was up and running.
After a whole stupid week of the engineers not being able to find the route cause it was escalated extremely high and our team was brought in and we found the root cause in seconds (literally)The servers was VMs and the engineers hadn't checked the physical ESX server they were hosted on. another VM on the box caused the server to go unstable (ESX not configured correctly).
BAU project set up to audit/ report and fix all the ESX servers in the company for other stupid config issues
Usually the way it works is so that we have multiple clearly-identified and properly-handed-off roles. There's an Incident Commander (IC) role, whose job is to basically oversee the whole situation, there's various responders (including a primary one) whose job is to mitigate/fix the problems usually relating their own teams/platform/infra (networking, security, virtualization clusters, capacity planning, logging, etc. depends on the outage). There's also sometimes a communication person (I forget the role name specifically) whose job is to keep people updated, both internal to the outage (responders, etc) and outsiders (dealing with public-facing comms, either to other internal teams affected by the outage or even external customers).
Depending on the size of the outage, the IC may establish a specific "war room" channel (used to be an IRC chatroom, not sure what they use these days though) where most communication from various interested parties will take place. The advantage of a chatroom is that it lets you maintain communication logs and timestams (useful for postmortem and timeline purposes), and it helps when handing off to the next oncaller during a shift change (they can read the history of what happened).
> There's no way hundreds of engineers, managers, network engineers etc. can get anything actually done as a group, right?
Most people will not really be doing much but when you need to diagnose a problem, having a lot of brains with various expertise in different domains helps, especially if those people are the ones that have implemented a certain service that might be obscure to the other oncallers. Generally speaking, it wouldn't be unheard of to have 30-40 people in the same irc channel brainstorming and coordinating a cross-team effort to mitigate a problem, but into the hundreds? Not quite sure about that much.
Just my two cents. You can probably get more info by reading the Google SRE book https://sre.google/books/
I think the "real world" doesn't work like that. The way the real world works is that things are decoupled in a way that one system's failure doesn't bring the entire world down. So things can be solved in isolation by people that actually understand the system and/or systems are designed in a way that they are serviceable etc.
When the power fails in my neighbourhood, you don't get 100 engineers on a hotline, one van comes down, troubleshoots the problem, and fixes it. Like 3 technicians.
I know there are some exceptions like some power failures that cascaded or the global supply shortages. But those are design failures IMO. A computer system that goes down for this length of time and nobody can figure out why or recover, that seems like a total failure to me on multiple levels. We're just doing this wrong.
A lot of the time root cause is solved by a smaller number of people. But identifying root cause and mitigating impact during an event -- and then communicating specifics of that impact -- can fall to a much larger group.
If 1-3 people are actively solving the issue, they do so alone, and give periodic updates to the broader group through a manager or other communication liason.
97 people to check/restart/monitor their team's system, because the Vital Component has never failed before so their graceful recovery code is untested or nonexistent.