How to prevent software bug from killing lives like Boeing's or Tesla's?

10 pointsleeu19117y ago19 comments

As an experienced software engineer, I'm sure people make mistakes as to err is human. Working at dahmakan.com, a SEA food delivery startup, the worst can happen when our software has problem is someone won't get a meal on time or a meal is wasted.

But for software of Boeing or Tesla, it is highly more critical when errors happen as we saw.

I would love to learn about your suggestions/experience about preventing these costly mistake from happening.

19 comments

cbanek7y ago

Overall, I think designing a system that prevents people from being harmed is very hard, and has to be a design level concept that is in everyone's minds as they start, and every day through development.

Even your example "the worst can happen when our software has problem is someone won't get a meal on time or a meal is wasted." isn't really true. What if you ordered fish, or oysters, and they were left out too long and caused some kind of food poisoning (just as an example).

There are many levels of thinking about this problem. Maybe you can have a sticker on the package that reacts to temperature to let someone know the meal isn't safe, etc. You still have to train the user to know what it is, and when it is safe.

So in this simple example, you have software, hardware, redundancy, and user training that all have to happen. Same for things like cars or planes. You're really trying to build a safety critical system, and many times (such as the Boeing example), it isn't just software or hardware that causes the problems, but issues arise at the intersection of both.

For Boeing, it would be lack of user training, lack of good UX, possibly hardware design issues with being prone to stall, hardware issues with the angle of attack sensors, lack of enough redundancy of the angle of attack sensors to operate properly, etc.

You can never get to a 0% chance of failure. Most of the time you are just attacking the highest chances of failure, since when you get down to the level of faulty parts or mechanical fatigue, things always break.

Of course, each subsystem and integration should have good testing to find all these things, but it's sadly less of a science and more of an art IMHO. And I used to work on rocket software.

Many times, the answers are more simple than you think. Simplicity usually means better operation than trying to overcomplicate error handling. Sometimes you just need to change the whole way you are thinking about the problem.

leeu1911OP7y ago

Thank you for very interesting thoughts. Learned something new.

You were right with the counter-exaple that we have redundancy in place that is 'hardware' - mealbox stickers, production qa.

bjourne7y ago

Follow engineering best practices. One thing you should never do is to write a new system, requiring it to perfectly emulate an old one. It can never work and there will always be unexpected deviations in system behavior. I.e. in Boeing's case it isn't so much about any particular flaws (faulty AoA sensor etc), it is about the whole idea of having a completely new aircraft design appearing to the pilot as if it was an old one. It is similar to replacing an Active Directory deployment with OpenLDAP and betting on users not noticing.

Another engineering best practice is to keep proper logs. The Toyota recalls from 2009 to 2011 were likely caused by some bugs but they weren't able to find the root cause. Ostensibly because not enough data were being logged.

hhs7y ago

Interesting point on keeping good logs. Do you know of any best practices? I wonder if there are any reference texts that articulate rules on writing logs for coders/software developers in training?

leeu1911OP7y ago

Completely agree. Logs are so important when bugs happen.

Nokinside7y ago

Tesla and Boeing cases are somewhat different.

Driving using computer vision is heuristic by nature. That's completely new back of worms altogether. Boeing's case is more traditional design errors.

For driving cars SAE Level 3 Automation level (expectation that the human driver will respond appropriately to a request to intervene) is dangerous fools errand. Humans either drive the car with light assistance or the automation must care of everything without human fallback. Unless human is constantly driving their response time drops and they can't react and take wheel when fallback situation occurs. The middle ground SAE Levels 2 and 3 are inherently dangerous because human cognition and psychology.

The human-automation interaction is very critical issue that connects the both cases.

beatgammit7y ago

Here's a different way of looking at it. How many die because of faulty software vs faulty humans?

Yes, all software is going to have bugs and bugs in critical software can cost real lives, but I think we focus too much on the negatives and ignore all of the lives that we've saved because of modern technology. People seem to prefer explainable patterns over random ones, even if the random ones are less common. For some reason, "the pilot must have been overworked" is more acceptable than "an unlikely condition wasn't tested for and the software got into an invalid path", which can look random from the outside.

My point here is that, while software failing is terrible and we should do everything we can to prevent it, we need to recognize that it's often a net benefit. As for practical ways to prevent it, here are a few thoughts:

- formal proofs of correctness - extensive tests, both automated and manual - frozen compilers - limited scope; the less code there is, the easier it is to make reliable - high quality hardware (unexpected bit flips are just a deadly as a software bug)

I don't write critical software like this, but I do read about it, such as NASA's design guidelines. However, we have to accept that there will be errors when going into a critical project, and do everything we can to prevent them.

imhoguy7y ago

Check this as an introduction: https://en.m.wikipedia.org/wiki/Safety-critical_system

In general I think the direction is to cover any software development with formal proofs to detect any posibility of unexpected system state.

matt_s7y ago

A simple approach would be to not automate things that don't need automating. I think as developers we get a little too far ahead of ourselves.

The failure on the MAX was related to computers moving controlled surfaces (flaps, rudder, etc.) on the plane automatically based on sensor readings. Questions should be asked about what problem that was solving, and was that a requirement (aka complaint) from pilots? Why was that feature necessary?

leeu1911OP7y ago

I read that it was due to the new design of their engine which is more likely to lead to a 'stall' situation, hence, they put in 'anti-stall' system

hackermailman7y ago

For minimizing disaster, the standard grad text for this type of thing is Requirements Engineering: From System Goals to UML Models to Software Specifications by Axel van Lamsweerde. There you learn about creating models and risk analysis, fault tolerance modeling[1], privacy requirements etc., all established methods based on engineering foundations but applied to software modeling and development.

If you mean software development for mission critical things that control movement like aircraft, drones, factory robots etc., these engineers I would assume use verified compilers/toolchains like CompCert project to implement the models they have already formally analyzed http://symbolaris.com/course/fcps17.html but I've never done mission critical work, just dabbled in it to apply methods there to non-critical software.

[1] https://arxiv.org/pdf/1611.02273.pdf - Application-layer Fault-Tolerance Protocols

vkaku7y ago

Not to mention Uber!

Anyway, there are a few things anyone can do against human stupidity.

The most important thing is to have a decent backbone as an engineer. Do not take shortcuts in safety even if your stupid or arrogant boss tells you to do so.

finnthehuman7y ago

Process process process.

Start with researching formal "quality management systems." At the very least that'll introduce you to using FMEAs, external standards, rigorous design reviews and testing, quality gates between development and production (if they're not a PITA, they're not working) and traceability for everything in production.

There are off-the-shelf learning materials for all of it.

If you do go down that road hire someone with QMS experience to design your process and handhold your team through the transition. Otherwise you're likely to over-complicate it for a less effective result.

Glawen7y ago

Safety critical SW is done by following a strict and thorough process. It always begins with safety experts and systems engineers who will identify and classify the risks and determine what will be done to minimize the risk. The software engineer only comes later to implement the SW, which is only a part of the systems implementation.

Here Boeing misclassified MCAS and gave a lower risk than later identified, leading to a more relaxed development process. I hope that light will be shed on what happened at Boeing, because it looks like an intern did the MCAS.

clnhlzmn7y ago

I don't work with safety critical software, but I think the advice in general is to avoid relying on software alone for safety critical functions.

gus_massa7y ago

Meatware make error too. For example https://en.wikipedia.org/wiki/Air_France_Flight_447 was 50% a software/hardware/design error and 50% a meatware error.

hacknat7y ago

Having a QA team, who are full fledged software engineers, who can provide formal correctness tests is the ideal situation.

They’re developing a whole other project in isolation from yours and the only thing that needs to be agreed upon is the interface(s).

Ultramanoid7y ago

Redundancy.

segmondy7y ago

Redundancy is not equivalent to Fault tolerance.

j / k navigate · click thread line to collapse

19 comments

cbanek7y ago

Of course, each subsystem and integration should have good testing to find all these things, but it's sadly less of a science and more of an art IMHO. And I used to work on rocket software.

leeu1911OP7y ago

Thank you for very interesting thoughts. Learned something new.

You were right with the counter-exaple that we have redundancy in place that is 'hardware' - mealbox stickers, production qa.

bjourne7y ago

hhs7y ago

Interesting point on keeping good logs. Do you know of any best practices? I wonder if there are any reference texts that articulate rules on writing logs for coders/software developers in training?

leeu1911OP7y ago

Completely agree. Logs are so important when bugs happen.

Nokinside7y ago

Tesla and Boeing cases are somewhat different.

Driving using computer vision is heuristic by nature. That's completely new back of worms altogether. Boeing's case is more traditional design errors.

The human-automation interaction is very critical issue that connects the both cases.

beatgammit7y ago

Here's a different way of looking at it. How many die because of faulty software vs faulty humans?

imhoguy7y ago

Check this as an introduction: https://en.m.wikipedia.org/wiki/Safety-critical_system

In general I think the direction is to cover any software development with formal proofs to detect any posibility of unexpected system state.

matt_s7y ago

A simple approach would be to not automate things that don't need automating. I think as developers we get a little too far ahead of ourselves.

leeu1911OP7y ago

I read that it was due to the new design of their engine which is more likely to lead to a 'stall' situation, hence, they put in 'anti-stall' system

hackermailman7y ago

[1] https://arxiv.org/pdf/1611.02273.pdf - Application-layer Fault-Tolerance Protocols

vkaku7y ago

Not to mention Uber!

Anyway, there are a few things anyone can do against human stupidity.

The most important thing is to have a decent backbone as an engineer. Do not take shortcuts in safety even if your stupid or arrogant boss tells you to do so.

finnthehuman7y ago

Process process process.

There are off-the-shelf learning materials for all of it.

Glawen7y ago

clnhlzmn7y ago

I don't work with safety critical software, but I think the advice in general is to avoid relying on software alone for safety critical functions.

gus_massa7y ago

Meatware make error too. For example https://en.wikipedia.org/wiki/Air_France_Flight_447 was 50% a software/hardware/design error and 50% a meatware error.

hacknat7y ago

Having a QA team, who are full fledged software engineers, who can provide formal correctness tests is the ideal situation.

They’re developing a whole other project in isolation from yours and the only thing that needs to be agreed upon is the interface(s).

Ultramanoid7y ago

Redundancy.

segmondy7y ago

Redundancy is not equivalent to Fault tolerance.

j / k navigate · click thread line to collapse