UK Air traffic control network crash (opens in new tab)

(news.sky.com)

33 pointswoodylondon2y ago23 comments

23 comments

16 comments · 5 top-level

woodylondonOP2y ago· 8 in thread

How is this even possible in this day and age? The Redundancy for this must be crazy

Because large systems are complex and have very complex and hard to detect failure modes.

What is more impressive is how rare these events are, given the complexity of the underlying systems. Redundancy is not without its own problems (source of truth, for instance).

mike_hearn2y ago

They aren't that rare. Last UK ATC crash was in 2014 I think:

https://www.ft.com/content/65544730-8216-11e4-b9d0-00144feab...

And that's just the UK. Airport systems crashing is a regular occurrence. Paris crashed due to still relying on a Windows 3.1 system:

https://www.zdnet.com/article/a-23-year-old-windows-3-1-syst...

The IT is just poor. Tech firms routinely change much more massive and complex systems at a far faster pace compared to the stagnation found in airline IT, and yet total failures are not more frequent.

2 more replies

somat2y ago

A lot of times large scale outages like this are because of the redundancy. The whole system is interlinked with automatic failover. then it hits a corner case that was not engineered into the fail model and you get cascading system failure where each node starts bringing down other nodes automatically. basically the lesson is: In highly interlinked systems you get highly interlinked failures.

And then after a lot of angry words and finger pointing this new failure gets added to the failure model.

My personal takeaway after chasing the long tail of automatic failover on a few projects, is quite often it is better to drop a few 9's from your service goal, decouple some of the systems, and accept that while some parts of the system may go down, it should not bring everything down with it.

darkclouds2y ago

The US air traffic control crash at the start of the year turned out to be a decades old system which had not been upgraded, I think the UK one is a little bit more recent, but like alot of software, people dont want to pay for the real costs of software developments.

ExoticPearTree2y ago

Complexity tends to hide problems. An over-engineered system is going to be less stable than a simple one where you know how things can break and how to bring them online pretty quickly.

gumballindie2y ago

They agiled software to the point where it’s written by product managers and wanna be engineers that cargo cult “good practice” without understanding what it’s meant to do. This types of issues are common in british made software.

terom2y ago

System redundancy rarely covers software faults.

defrost2y ago

You'd hope an air traffic control centre would have a big box of popsicle sticks and black pen markers to cover the fallover of power | backup power | digital systems.

2 more replies

OBFUSCATED2y ago· 2 in thread

Russian attack, similar to the one in Poland last week? https://www.bbc.com/news/world-europe-66630260.amp

omega32y ago

Unlikely, they’ve already arrested suspects following the attack

https://www.wprost.pl/amp/11364934/atak-na-polska-infrastruk...

darkclouds2y ago

No British Security Services, protecting the economy, by getting people and businesses to spend their money whether they like it or not.

gumballindie2y ago· 1 in thread

I am genuinely convinced that the UK is in collapse. The nhs is flattened, policing is meh, the infrastructure is pretty banged up, bills are through the roof, inflation and interest rates are mad high, and now this. Not to mention a noticeable drop in service quality throughout the private sector, with everyone blaming “staff shortage” and a government desperate to suppress the very same wages that are meant to attract more qualified staff. Seriously this is madness.

thorin2y ago

Most of these are global issues though in the developed world, even healthcare to an extent due to ageing population. I'm not sure what this has to do with an air traffic control outage either?

It does seem that a lot of infrastructure that was put in place during a golden age cannot be adequately maintained, but AFAIK this seems to be the case in USA and Europe too.

woodylondonOP2y ago

According to unconfirmed reports, a French airline entered a flight plan incorrectly, resulting in some form of data corruption. https://www.spectator.co.uk/article/is-one-badly-filed-fligh...

NATs has reported that the system is now back up and running. It is possible that the secondary system took over with data syncing having to take place before the switchover?

My original post was about my interest in the technical aspects of redundancy and failover mechanisms in something as mission-critical as air traffic control.

I come from a background in broadcasting, where failover is critical, and redundancy is built into any broadcast chain. e have multiple backups and jumping-off points to deal with any issues that arise. It's pretty rare we could ever go to “black”.

pdx_flyer2y ago

Lots of US bound traffic took very southerly routings today to avoid UK airspace. My guess is that this will continue until the outage is completely resolved.

j / k navigate · click thread line to collapse

23 comments

16 comments · 5 top-level

woodylondonOP2y ago· 8 in thread

How is this even possible in this day and age? The Redundancy for this must be crazy

jacquesm2y ago

Because large systems are complex and have very complex and hard to detect failure modes.

What is more impressive is how rare these events are, given the complexity of the underlying systems. Redundancy is not without its own problems (source of truth, for instance).

mike_hearn2y ago

They aren't that rare. Last UK ATC crash was in 2014 I think:

https://www.ft.com/content/65544730-8216-11e4-b9d0-00144feab...

And that's just the UK. Airport systems crashing is a regular occurrence. Paris crashed due to still relying on a Windows 3.1 system:

https://www.zdnet.com/article/a-23-year-old-windows-3-1-syst...

2 more replies

somat2y ago

And then after a lot of angry words and finger pointing this new failure gets added to the failure model.

darkclouds2y ago

ExoticPearTree2y ago

Complexity tends to hide problems. An over-engineered system is going to be less stable than a simple one where you know how things can break and how to bring them online pretty quickly.

gumballindie2y ago

terom2y ago

System redundancy rarely covers software faults.

defrost2y ago

You'd hope an air traffic control centre would have a big box of popsicle sticks and black pen markers to cover the fallover of power | backup power | digital systems.

2 more replies

OBFUSCATED2y ago· 2 in thread

Russian attack, similar to the one in Poland last week? https://www.bbc.com/news/world-europe-66630260.amp

omega32y ago

Unlikely, they’ve already arrested suspects following the attack

https://www.wprost.pl/amp/11364934/atak-na-polska-infrastruk...

darkclouds2y ago

No British Security Services, protecting the economy, by getting people and businesses to spend their money whether they like it or not.

gumballindie2y ago· 1 in thread

thorin2y ago

Most of these are global issues though in the developed world, even healthcare to an extent due to ageing population. I'm not sure what this has to do with an air traffic control outage either?

It does seem that a lot of infrastructure that was put in place during a golden age cannot be adequately maintained, but AFAIK this seems to be the case in USA and Europe too.

woodylondonOP2y ago

According to unconfirmed reports, a French airline entered a flight plan incorrectly, resulting in some form of data corruption. https://www.spectator.co.uk/article/is-one-badly-filed-fligh...

NATs has reported that the system is now back up and running. It is possible that the secondary system took over with data syncing having to take place before the switchover?

My original post was about my interest in the technical aspects of redundancy and failover mechanisms in something as mission-critical as air traffic control.

pdx_flyer2y ago

Lots of US bound traffic took very southerly routings today to avoid UK airspace. My guess is that this will continue until the outage is completely resolved.

j / k navigate · click thread line to collapse