- The Optus network (the second largest telco network in Australia) went down
- Optus executives initially couldn't coordinate anything because they were... all on Optus
- The network outage lasted ~14 hours
- The network outage effected triple-zero, the country's emergency number, because Optus was rebooting towers and causing phones to still connect to them so they can't fallback to another carrier to dial 000
- After the network is finally back up, Optus blamed "a 3rd party" that caused the outage
- The "3rd party" was then turned out to be Singtel - Optus' parent company
- Singtel issued a statement to basically say Optus was wrong
- Optus then issued another statement saying the outage was caused by them using default configuration files on some of their Cisco routers
- The Australian senate summoned the Optus CEO for a hearing
- Here we are, the CEO resigned
EDIT:
To add more context, just over a year ago (in Sept 2022), Optus, under the same CEO's leadership, had a massive data breach: https://en.wikipedia.org/wiki/2022_Optus_data_breach
https://en.wikipedia.org/wiki/2023_Optus_outage
And the damage went well beyond spotty emergency calls: for example, if you run a small business that relies on credit card payments, you were fucked if your terminals were on the Optus network. The situation was so bad that prepaid SIM cards for Telstra (the main competitor) were selling out in much of the country.
Causes: A Border Gateway Protocol (BGP) routing problem played a role in the outage. Public data from CloudFlare showed a spike in BGP route announcements from the Optus network around the time the outage occurred — over 940,000 announcements in an hour from a node that normally makes less than 3,000 announcements per hour — indicative of a BGP routing problem.
[snip] committee describes the outage as a gradual event triggered by loss of connectivity between neighbouring computer networks. The report suggests that approximately 90 edge provider routers disconnected as an automated protective measure against routing update overload. The failures occurred following a software upgrade at a North American Singtel exchange that caused one of the routers to disconnect. This, in turn, triggered Optus's routers to rapidly update its own routing tables which triggered the shutdown due to the pre-configured default threshold limits set by Cisco Systems being exceeded. The tabled report and Singtel stressed that the software upgrade was not the cause of the fault- The CEO told the senate hearing that she now carried Telstra and Vodaphone SIM cards (maybe practical, terrible optics)
- Transport for NSW is heavily reliant on Optus so public transport was heavily impacted (not entirely Optus' fault, but public outrage means someone had to be the scapegoat)
Would’ve been great to come out of an event like this looking at whether catastrophic fault conditions like this exist elsewhere in our national infra, but it feels like all that’ll happen is Optus gets the shit kicked out of them while other providers count their blessings
Voice calls should be prioritized over other network traffic. Data should be the same. And my phone should roam from AT&T to VZW (or even TMO).
[1] having said that, I've not looked too closely, and it sounds like (unsurprisingly) that might not exactly be the reality...
> "This unexpected overload of IP routing information occurred after a software upgrade at one of the Singtel internet exchanges (known as STiX) in North America, one of Optus’ international networks. During the upgrade, the Optus network received changes in routing information from an alternate Singtel peering router. These routing changes were propagated through multiple layers of our IP Core network. As a result, at around 4:05am (AEDT), the pre-set safety limits on a significant number of Optus network routers were exceeded. Although the software upgrade resulted in the change in routing information, it was not the cause of the incident."
> "It is now understood that the outage occurred due to approximately 90 PE routers automatically self-isolating in order to protect themselves from an overload of IP routing information. These self-protection limits are default settings provided by the relevant global equipment vendor (Cisco)."
> "Several hypotheses and paths to restoration were explored over the period up to 10.30am."
And then in later statements:
> "Nokia is our managed services partner for our network, and they were involved from the very beginning in managing the incident and recovering the network; their staff are based in India in two locations"[2]
One of the key problems appears to be heavy reliance on outsourced Nokia staff in India, who seemingly would have been disconnected from Optus' systems in Australia. Then within Australia for local Optus staff, perhaps staff had Optus-provided mobile phones and couldn't be reached if the mobile phone network was down. At the minimum, you'd like to think that on-call operational staff exist near all PE routers and have multiple communication means such as mobile phones with other carriers, satellite phone, fixed Internet connectivity not provided by Optus.
The total outage duration was 6.5hrs to diagnose the problem and a further 3.5 hours to get 98% of connectivity re-established. Resolving the problem once diagnosed required physical presence at 14 sites across Australia to reset 90 PE routers (as part of "100+ devices").
[1] https://www.aph.gov.au/DocumentStore.ashx?id=2ed95079-023d-4...
[2] https://www.itnews.com.au/news/optus-had-not-contemplated-an...
I've heard from multiple sources that Optus in Australia is somewhat of a cash cow for Singtel in slightly undercutting Telstra's exorbitant pricing whilst absolutely minimising support and administrative costs. The 2022 security breach[0] being a potential example of a symptom.
Interestingly the article specificies this as the cause:
On Friday, Optus confirmed the outage was due to a configuration issue with more than 90 Cisco routers, which could not cope with changes to routing information supplied from Singtel Internet Exchange (STiX) after a routine software upgrade.
Regardless of the fault origins, this is a justified resignation.
Optus (buck stops with CEO) totally and utterly borked the PR handling of this from the get go.
They had multiple oportunities to get out in front, to shape the story, to make a statement. to answer a few obvious questions, etc.
I've rarely seen a large company like this just ditch the playbook for handling PR in the face of a screwup and sit things out for hours with no comment.
Was it eight hours before any kind of official response of substance?
WTF was going on internally and where the hell was the PR dept.
I have to assume the CEO knew this was a final straw after the 2022 data breach, and this shit-show PR was essentially a free-kick since the writing was already on the wall.
Was probably also told: you know you're fired, but not until you've fronted the senate committee, we're not going to subject our new CEO to that as their first order of business.
Related: https://www.abc.net.au/news/2023-11-20/kelly-bayer-rosmarin-...
Ten hour outage? Don’t let the door hit you on the way out
The former seems like a much bigger problem in my mind but maybe that’s my bias showing
The 1/2 day outage to 000 emergency calls and other life-critical things seems like more of high impact problem than the data loss.
Not saying the data loss is trivial, but the 000 outage definitely wasn't.
Optus's issues are long running. My understanding is that they're largely driven by Singtel's desire to keep cutting everything to the bone.
I've got no idea about Rosmarin's effectiveness as a CEO.
Sure, it didn't help that she was a no-show for most of the outage, and when she did do interviews the responses were weak or not exactly confidence building that they knew what the problem was or how to fix it.
Even still, I don't see throwing a new CEO at it is going to result in any meaningful change while Singtel are still the owners, and there's no legislative changes to protect consumers and critical infrastructure.
Large parts of the business are driven entirely on short-term metrics which can be fiddled with to hide problems. The use of contracting firms (both onshore and off) to manage/build/maintain what should, for a Telco, be their core strengths only seems to make this worse.
Optus has been a basket case for many years, and was only made worse by appointing a number of ex banking executives whom don't understand the role Optus' products play in their customers lives.