Software Engineering Lessons from Aviation (opens in new tab)

(riceo.me)

98 pointsriceo1007y ago55 comments

55 comments

34 comments · 10 top-level

jefffoster7y ago· 17 in thread

I think there's a lot to learn from the aviation industry. I did a talk at my companies internal conference on this (turned into words at https://medium.com/ingeniouslysimple/why-dont-planes-crash-1...).

For me it's the mindset that differs. Too often as software engineers we find a bug and just fix it. Aviation goes a step deeper and finds the environment that created the bug and stops that.

Unfortunately, the recent 737 MAX incidents seem to have changed this. From what I understand the reaction to the problems sounds more like what I'd expect a software business to do, rather than the airline industry!

ken7y ago

There are a handful of highly respected books that everyone knows software engineers should read, like "The Mythical Man-Month" and "Peopleware". Yet whenever I read one of these, I found I learned very little. Everything in them was obvious -- to those who are 'in the trenches'. What we need is a way to get managers to read these, and take them to heart. Even when my manager had a copy of the book sitting on their desk, they rarely had read it, and they absolutely never followed its advice.

(When pushed, they might say "That was a groundbreaking book, for its day, but the industry moved on." Now we've got open floor plans, and AGILE SCRUM, and free snacks ... and also no evidence these are an improvement to the software development process, but never mind.)

This aviation mindset you refer to is the same way. I can't tell you how many times this happened to me:

- User clicks a button, and it doesn't do what it says it should.

- A bug is filed, and assigned to me.

- I investigate, and find the problem. I start preparing a fix.

- Manager comes by to pester me. "Why isn't this button fixed? Shouldn't that have been a quick fix?" We played Planning Poker last week and everybody else who isn't working on it agreed it should only be a 1!

- See, we're computing this value incorrectly, and I grepped the codebase and it turns out we're also doing it wrong in 7 other places, which causes...

- "The customer wants this one button fixed. Don't worry about the others. Don't worry about testing, or cleaning up, or documenting why the mistake was made or how it should have been done. Those aren't on this milestone. Just fix this one button and move on. We need you working on the new features we promised our customers this month..."

Modern software development is a circus of improperly aligned incentives.

asn1parse7y ago

When they did the open space with free snacks I told them I was never coming into the office again. I pointed out that I got sick roughly 3 times a year from just being forced to spend time in that hazardous office. And I then I sealed it by telling them that they will get 2 extra hours of work out of me each day plus all the CO2 I wouldnt be pumping into the atmosphere each day when I drive there on the perilous freeways with all the other pissed off people on the road. And they bought it. This was 2012. Ffw to 2019, I still work at home for the same company. I never get sick, I dont have to deal with any of the toxic culture stuff at work, and I buy and consume my own snacks, the ones I want to eat. I agree, it's a circus and it's a lot easier to manage from a distance.

1 more reply

mixmastamyk7y ago

I know few of us want to become managers, but when you find a harmful one, it's time to take on that responsibility.

2 more replies

Zaak7y ago

> Don't worry about the others. Don't worry about testing, or cleaning up, or documenting why the mistake was made or how it should have been done.

Sounds like getting management to truly understand the concept of technical debt is the core problem.

0x4454427y ago

The blog post was good but it was just another variation highlighting the age old conundrum... fast, good and cheap, pick two.

Those that value quality are going to be swimming up stream in most organization that develop software because the bean counters always go straight to fast and cheap.

maxxxxx7y ago

I work in regulated industry and you really don’t want the level of scrutiny regulated processes have in other industries. Innovation would slow down to a crawl or pretty much stop. In a lot of industries you can make trade offs quality vs speed or innovation and be better off by not having perfect quality.

2 more replies

qznc7y ago

I work in automotive. In Europe there is the ASPICE standard which is actually a reasonable guideline for (commercial) software development (unit tests, code reviews, etc). Customers require you to follow it. Top management requires you to follow it. Projects still ignore it. Writing unit tests at the end of a project misses most of the point, for example.

qznc7y ago

This was a nice prompt for me to finish my ASPICE article: http://beza1e1.tuxen.de/aspice.html

ellius7y ago

After fixing a recent bug, I asked my client company what if any postmortem process they had. I informally noted about 8 factors that had driven the resolution time to ~8 hours from what probably could have been 1 or 2. Some of them were things we had no control over, but a good 4-5 were things in the application team's immediate control or within its orbit.

These are issues that will definitely recur in troubleshooting future bugs, and doing a proper postmortem could easily save 250+ man hours over the course of a year. What's more, fixing some of these issues would also aid in application development. So you're looking at immediate cost savings and improved development speed just by doing a napkin postmortem on a simple bug. I can't imagine how much more efficient an organization with an ingrained and professional postmortem culture would be.

qznc7y ago

For anybody into podcasts, I can recommend "Causality" https://engineered.network/causality/

John Chidgey digs into well known catastrophes, analyses what went wrong, and what was fixed afterwards. Not software related but promotes a safety mindset very well.

1 more reply

jasode7y ago

>as software engineers we find a bug and just fix it. [...] Unfortunately, the recent 737 MAX incidents seem to have changed this.

I think there's some nuance about MCAS that's lost in all the media reports. As far as I understand, the MCAS software didn't have a "bug" in the sense we programmers typically think of. (E.g. Mars Climate Orbiter's software programmed with incorrect units-of-measure.[0])

Instead, the MCAS system was poorly designed because of financial pressure to maintain the fiction of a single 737 type rating.

In other words, the MCAS software actually did what Boeing managers specified it to do:

1) Did the software only read a _1_ AOA sensor with a single-point-of-failure instead of reading _2_ sensors? Yes, because that was what Boeing managers wanted the software to do. It was purposefully designed that way. If the software was changed to reconcile 2 sensors, it would then lead to a new "AOA DISAGREE" indicator[1] which would then raise doubts to the FAA that Boeing could just give pilots a simple iPad training orientation instead of expensive flight-sim training. Essentially, Boeing managers were trying to "hack" the FAA criteria for "single type rating".

2) Did software make adjustments of an aggressive and unsafe 2.5 degrees instead of a more gentle and recoverable 0.6 degrees? Yes, because Boeing designed it that way.

Somebody at Boeing specified the software design to be "1 sensor and 2.5 degrees" and apparently, that's what the programmers wrote.

I know we can play with semantics of "bug" vs "design" because they overlap but to me this seems to be a clear case of faulty "design". The distinction between design vs bug is important to let us fix the root cause.

The 737 MAX MCAS software issue isn't like the Mars Climate Orbiter or Therac-25 software bugs. The lessons from MCO and Therac-25 can't be applied to Boeing's MCAS because that unwanted behavior happens in a layer above the programming:

- MCO & Therac: design specifications are correct; software programming was incorrect

- Boeing 737MAX MCAS: design specifications incorrect; software programming was "correct" -- insofar as it matched the (flawed) design specifications

[0] https://en.wikipedia.org/wiki/Mars_Climate_Orbiter#Cause_of_...

[1] yellow "AOA Disagree" text at the bottom of display: https://www.ainonline.com/sites/default/files/styles/ain30_f...

Animats7y ago

That's a different issue. Aircraft systems are classified as to degree of risk. This is from MIL-STD-882C.

- I Catastrophic - Death, and/or system loss, and/or severe environmental damage.

- II Critical - Severe injury, severe occupational illness, major system and/or environmental damage.

- III Marginal - Minor injury, and/or minor system damage, and/or environmental damage.

- IV Negligible - Less then minor injury, or less then minor system or environmental damage.

Now, face it, most webcrap and phone apps are at level IV. Few people in computing outside aerospace regularly work on Level I systems. (Except the self-driving car people, who are working at Level I and need to act like it.)

MCAS started as just an automatic trim system. Those have been around for decades, and they're usually level III systems. They usually have limited control authority, and they usually act rather slowly, on purpose. So auto trim systems don't have the heavy redundancy required of level I and II systems. Then the trim system got additional functionality, control authority, and speed to provide the MCAS capability. Now it could cause real trouble.

At that point, the auto trim system had become a level I system. A level I system requires redundancy in sensors, actuators, electronics, power, and data paths. Plus much more failure analysis. A full fly-by-wire system or a full authority engine control system will have all that.

So either MCAS needed to have more limited authority over trim, so it couldn't cause trim runaway, or it needed the safety features of a Level I system. Boeing did neither. Parts of the company seem to have thought the system didn't have as much authority as it did. ("Authority", in this context, means "how much can you change the setting".)

Management failure.

1 more reply

Kurtz797y ago

"If the software was changed to reconcile 2 sensors, it would then lead to a new "AOA Disagree" indicator which would raise doubts to the FAA that Boeing could just give pilots a simple iPad training orientation instead of expensive flight-sim training."

I always liked this quote from the "Mythical Man-Month": “Never go to sea with two chronometers, take one or three”.

https://blog.ipspace.net/2017/01/never-take-two-chronometers...

1 more reply

cmurf7y ago

>Instead, the MCAS system was poorly designed because of financial pressure to maintain the fiction of a single 737 type rating.

OK but how do we know, how is it demonstrated, that this financial pressure condition has now been mitigated? What is the exact nature of the "fix"? And actually what are all of the closed door conversations, back then and now, about the various possible behaviors for this software routine? How is it they came up with that one? How is it they come up with the new one? And really, why is the first one wrong (aside from the fact there are a bunch of dead people, which is a consequence of the original error)?

And which parts of the design? There are many parts to it. Not all of them are as bad as others.

As a pilot I find it impossible to imagine a closed door room with engineers not computing, let alone not imagining, the potential for this particular failure mode. And if a pilot were present in that closed door session, I find it impossible they would not immediately be bothered by the potential for mistrim at low altitude that would result in too scary a probability of unrecoverability.

It makes me wonder if pilots were even involved at that level of the design and decision making for the feature.

1 more reply

kevin_b_er7y ago

Yes, everyone wants to blame the pilots or the engineers at Boeing.

This was not an engineering problem, it was a greed-created management intentional decision. It was management designed for failure because management changed the goal to put money over life in distinct ways.

qznc7y ago

There is no other useful meaning of "correct code" apart from "matches specification/design". There is no notion of correctness for design. The design may not be consistent with safety requirements for example.

1 more reply

jayd167y ago

Retrospectives are a common part of agile. Only slightly less common is skipping retrospectives.

cjbprime7y ago· 3 in thread

This was great!

> 1. Don’t kill yourself

> 2. Don’t kill anyone else

Could we reorder these, though? Every once in a while a plane will hit a house and kill its occupants (and the pilot, usually) and it's so awful. I think not killing others as a pilot is so much more important than not killing yourself.

X6S1x6Okd1st7y ago

That ordering reminds me of the first rule of search and rescue: don't create another victim.

If your job is to save a life and that life depends on you, you don't do anyone any favors if you die

jacobush7y ago

Well, if you kill yourself the plane may then go on killing others. This is double plus ungood.

https://en.wikipedia.org/wiki/Helios_Airways_Flight_522

maxxxxx7y ago

I think the idea is that if the pilot dies then most likely passengers or others will die too. Someone needs to control the plane.

billfruit7y ago· 3 in thread

Though article isn't about software development in the aviation industry, a few thoughts on that:

The industry is really slow to change its practices and tools. Like the use of C for most software, I do feel a more safer language out to be preferred.

Use of 1553 bus for inter device communication, the bus and protocol aren't general, it is very opinionated/rigid about the manner in which communication should happen. And the hardware parts for it are horrendously expensive compared to most ethernet, IP equipment. There is an aviation ethernet standard, but adoption of it has been slow.

magduf7y ago

>The industry is really slow to change its practices and tools. Like the use of C for most software, I do feel a more safer language out to be preferred.

And what language would that be, where it has absolute determinism (which rules out anything with GC)?

They tried using Ada years ago for avionics. The problem here is that no one knows Ada any more, and no one really wants to make a career out of it since it isn't used anywhere else.

So, in practice, C and (a narrow subset of) C++ get used. Maybe Rust would be a good choice in the future.

HeyLaughingBoy7y ago

it is very opinionated/rigid about the manner in which communication should happen

This could be a strong factor in its popularity. If things must happen in a certain order, then the behavior of the system becomes easier to verify. Ease of verification should never be understated in safety-critical systems.

billfruit7y ago

Yet the industry uses largely the C language, which isn't a model for safety or ease of verification.

2 more replies

starpilot7y ago· 1 in thread

Not killing yourself and a checklist (like we learned in driver's ed but apply informally at best) also apply to driving a car.

bdamm7y ago

Uh, no. In a car, if things go badly you pull off the road and work on a solution. If things to really badly, you have seatbelts, airbags, crumple zones, and a thick frame to help you out.

In an airplane, if things to badly, you keep flying until you land. If things go really badly, remember that everything is built to be light weight, and unless the crash is well controlled, everything will be destroyed and everyone will die. If your engine quits, your cabin ruptures, your instrumentation fails, you keep flying. And you need instruments; in poor visibility, your own sensory inputs are in fact faulty, and won't help you figure out which way is down.

Unlike in a car, where it's pretty obvious where the ground is, for example.

sn7y ago

Checklists and written procedures are very important. One of the earlier things I did when coming into my company was create a written procedure for software upgrades until we had time to automate it with ansible.

One thing I have not had very good discipline about is I want to use checklists both for code submitted for review and when I'm doing reviews. Lint checkers etc. can only go so far.

If anyone has published checklists for code reviews I'd be curious to see them. This one seems reasonable: https://www.liberty.edu/media/1414/%5B6401%5Dcode_review_che... though I'd add concurrency to the list.

myl7y ago

"...plenty of episodes of Mayday/Air Crash Investigation available on Youtube too. (Be warned though, all doomed flights take off from one of the busiest airports in the world .)" Great show. Comment is spot on, and don't forget "investigators were under extreme pressure".

shamino7y ago

Nathan Marz talks about this previously, with unique insights: http://nathanmarz.com/blog/how-becoming-a-pilot-made-me-a-be...

marcosdumay7y ago

I still hold my opinion that checklists are for hardware issues. One should not be filling them on software tasks. Instead, software is automated, automatically tested and automatically verified - routine checks are an anti-feature and inversely correlated to quality.

skookumchuck7y ago

The article talks about pilot procedures, not engineering procedures specific to aviation.

horacio_colbert7y ago

Thinking of aviation makes me remember the impact of doing things right.

j / k navigate · click thread line to collapse

55 comments

34 comments · 10 top-level

jefffoster7y ago· 17 in thread

For me it's the mindset that differs. Too often as software engineers we find a bug and just fix it. Aviation goes a step deeper and finds the environment that created the bug and stops that.

ken7y ago

This aviation mindset you refer to is the same way. I can't tell you how many times this happened to me:

- User clicks a button, and it doesn't do what it says it should.

- A bug is filed, and assigned to me.

- I investigate, and find the problem. I start preparing a fix.

- See, we're computing this value incorrectly, and I grepped the codebase and it turns out we're also doing it wrong in 7 other places, which causes...

Modern software development is a circus of improperly aligned incentives.

asn1parse7y ago

1 more reply

mixmastamyk7y ago

I know few of us want to become managers, but when you find a harmful one, it's time to take on that responsibility.

2 more replies

Zaak7y ago

> Don't worry about the others. Don't worry about testing, or cleaning up, or documenting why the mistake was made or how it should have been done.

Sounds like getting management to truly understand the concept of technical debt is the core problem.

0x4454427y ago

The blog post was good but it was just another variation highlighting the age old conundrum... fast, good and cheap, pick two.

Those that value quality are going to be swimming up stream in most organization that develop software because the bean counters always go straight to fast and cheap.

maxxxxx7y ago

2 more replies

qznc7y ago

This was a nice prompt for me to finish my ASPICE article: http://beza1e1.tuxen.de/aspice.html

ellius7y ago

qznc7y ago

For anybody into podcasts, I can recommend "Causality" https://engineered.network/causality/

John Chidgey digs into well known catastrophes, analyses what went wrong, and what was fixed afterwards. Not software related but promotes a safety mindset very well.

1 more reply

jasode7y ago

>as software engineers we find a bug and just fix it. [...] Unfortunately, the recent 737 MAX incidents seem to have changed this.

Instead, the MCAS system was poorly designed because of financial pressure to maintain the fiction of a single 737 type rating.

In other words, the MCAS software actually did what Boeing managers specified it to do:

2) Did software make adjustments of an aggressive and unsafe 2.5 degrees instead of a more gentle and recoverable 0.6 degrees? Yes, because Boeing designed it that way.

Somebody at Boeing specified the software design to be "1 sensor and 2.5 degrees" and apparently, that's what the programmers wrote.

- MCO & Therac: design specifications are correct; software programming was incorrect

- Boeing 737MAX MCAS: design specifications incorrect; software programming was "correct" -- insofar as it matched the (flawed) design specifications

[0] https://en.wikipedia.org/wiki/Mars_Climate_Orbiter#Cause_of_...

[1] yellow "AOA Disagree" text at the bottom of display: https://www.ainonline.com/sites/default/files/styles/ain30_f...

Animats7y ago

That's a different issue. Aircraft systems are classified as to degree of risk. This is from MIL-STD-882C.

- I Catastrophic - Death, and/or system loss, and/or severe environmental damage.

- II Critical - Severe injury, severe occupational illness, major system and/or environmental damage.

- III Marginal - Minor injury, and/or minor system damage, and/or environmental damage.

- IV Negligible - Less then minor injury, or less then minor system or environmental damage.

Management failure.

1 more reply

Kurtz797y ago

I always liked this quote from the "Mythical Man-Month": “Never go to sea with two chronometers, take one or three”.

https://blog.ipspace.net/2017/01/never-take-two-chronometers...

1 more reply

cmurf7y ago

>Instead, the MCAS system was poorly designed because of financial pressure to maintain the fiction of a single 737 type rating.

And which parts of the design? There are many parts to it. Not all of them are as bad as others.

It makes me wonder if pilots were even involved at that level of the design and decision making for the feature.

1 more reply

kevin_b_er7y ago

Yes, everyone wants to blame the pilots or the engineers at Boeing.

qznc7y ago

1 more reply

jayd167y ago

Retrospectives are a common part of agile. Only slightly less common is skipping retrospectives.

cjbprime7y ago· 3 in thread

This was great!

> 1. Don’t kill yourself

> 2. Don’t kill anyone else

X6S1x6Okd1st7y ago

That ordering reminds me of the first rule of search and rescue: don't create another victim.

If your job is to save a life and that life depends on you, you don't do anyone any favors if you die

jacobush7y ago

Well, if you kill yourself the plane may then go on killing others. This is double plus ungood.

https://en.wikipedia.org/wiki/Helios_Airways_Flight_522

maxxxxx7y ago

I think the idea is that if the pilot dies then most likely passengers or others will die too. Someone needs to control the plane.

billfruit7y ago· 3 in thread

Though article isn't about software development in the aviation industry, a few thoughts on that:

The industry is really slow to change its practices and tools. Like the use of C for most software, I do feel a more safer language out to be preferred.

magduf7y ago

>The industry is really slow to change its practices and tools. Like the use of C for most software, I do feel a more safer language out to be preferred.

And what language would that be, where it has absolute determinism (which rules out anything with GC)?

They tried using Ada years ago for avionics. The problem here is that no one knows Ada any more, and no one really wants to make a career out of it since it isn't used anywhere else.

So, in practice, C and (a narrow subset of) C++ get used. Maybe Rust would be a good choice in the future.

HeyLaughingBoy7y ago

it is very opinionated/rigid about the manner in which communication should happen

billfruit7y ago

Yet the industry uses largely the C language, which isn't a model for safety or ease of verification.

2 more replies

starpilot7y ago· 1 in thread

Not killing yourself and a checklist (like we learned in driver's ed but apply informally at best) also apply to driving a car.

bdamm7y ago

Uh, no. In a car, if things go badly you pull off the road and work on a solution. If things to really badly, you have seatbelts, airbags, crumple zones, and a thick frame to help you out.

Unlike in a car, where it's pretty obvious where the ground is, for example.

sn7y ago

One thing I have not had very good discipline about is I want to use checklists both for code submitted for review and when I'm doing reviews. Lint checkers etc. can only go so far.

myl7y ago

shamino7y ago

Nathan Marz talks about this previously, with unique insights: http://nathanmarz.com/blog/how-becoming-a-pilot-made-me-a-be...

marcosdumay7y ago

skookumchuck7y ago

The article talks about pilot procedures, not engineering procedures specific to aviation.

horacio_colbert7y ago

Thinking of aviation makes me remember the impact of doing things right.

j / k navigate · click thread line to collapse