But for software of Boeing or Tesla, it is highly more critical when errors happen as we saw.
I would love to learn about your suggestions/experience about preventing these costly mistake from happening.
Even your example "the worst can happen when our software has problem is someone won't get a meal on time or a meal is wasted." isn't really true. What if you ordered fish, or oysters, and they were left out too long and caused some kind of food poisoning (just as an example).
There are many levels of thinking about this problem. Maybe you can have a sticker on the package that reacts to temperature to let someone know the meal isn't safe, etc. You still have to train the user to know what it is, and when it is safe.
So in this simple example, you have software, hardware, redundancy, and user training that all have to happen. Same for things like cars or planes. You're really trying to build a safety critical system, and many times (such as the Boeing example), it isn't just software or hardware that causes the problems, but issues arise at the intersection of both.
For Boeing, it would be lack of user training, lack of good UX, possibly hardware design issues with being prone to stall, hardware issues with the angle of attack sensors, lack of enough redundancy of the angle of attack sensors to operate properly, etc.
You can never get to a 0% chance of failure. Most of the time you are just attacking the highest chances of failure, since when you get down to the level of faulty parts or mechanical fatigue, things always break.
Of course, each subsystem and integration should have good testing to find all these things, but it's sadly less of a science and more of an art IMHO. And I used to work on rocket software.
Many times, the answers are more simple than you think. Simplicity usually means better operation than trying to overcomplicate error handling. Sometimes you just need to change the whole way you are thinking about the problem.
You were right with the counter-exaple that we have redundancy in place that is 'hardware' - mealbox stickers, production qa.
Another engineering best practice is to keep proper logs. The Toyota recalls from 2009 to 2011 were likely caused by some bugs but they weren't able to find the root cause. Ostensibly because not enough data were being logged.
Driving using computer vision is heuristic by nature. That's completely new back of worms altogether. Boeing's case is more traditional design errors.
For driving cars SAE Level 3 Automation level (expectation that the human driver will respond appropriately to a request to intervene) is dangerous fools errand. Humans either drive the car with light assistance or the automation must care of everything without human fallback. Unless human is constantly driving their response time drops and they can't react and take wheel when fallback situation occurs. The middle ground SAE Levels 2 and 3 are inherently dangerous because human cognition and psychology.
The human-automation interaction is very critical issue that connects the both cases.
Yes, all software is going to have bugs and bugs in critical software can cost real lives, but I think we focus too much on the negatives and ignore all of the lives that we've saved because of modern technology. People seem to prefer explainable patterns over random ones, even if the random ones are less common. For some reason, "the pilot must have been overworked" is more acceptable than "an unlikely condition wasn't tested for and the software got into an invalid path", which can look random from the outside.
My point here is that, while software failing is terrible and we should do everything we can to prevent it, we need to recognize that it's often a net benefit. As for practical ways to prevent it, here are a few thoughts:
- formal proofs of correctness - extensive tests, both automated and manual - frozen compilers - limited scope; the less code there is, the easier it is to make reliable - high quality hardware (unexpected bit flips are just a deadly as a software bug)
I don't write critical software like this, but I do read about it, such as NASA's design guidelines. However, we have to accept that there will be errors when going into a critical project, and do everything we can to prevent them.
In general I think the direction is to cover any software development with formal proofs to detect any posibility of unexpected system state.
The failure on the MAX was related to computers moving controlled surfaces (flaps, rudder, etc.) on the plane automatically based on sensor readings. Questions should be asked about what problem that was solving, and was that a requirement (aka complaint) from pilots? Why was that feature necessary?
If you mean software development for mission critical things that control movement like aircraft, drones, factory robots etc., these engineers I would assume use verified compilers/toolchains like CompCert project to implement the models they have already formally analyzed http://symbolaris.com/course/fcps17.html but I've never done mission critical work, just dabbled in it to apply methods there to non-critical software.
[1] https://arxiv.org/pdf/1611.02273.pdf - Application-layer Fault-Tolerance Protocols
Anyway, there are a few things anyone can do against human stupidity.
The most important thing is to have a decent backbone as an engineer. Do not take shortcuts in safety even if your stupid or arrogant boss tells you to do so.
Start with researching formal "quality management systems." At the very least that'll introduce you to using FMEAs, external standards, rigorous design reviews and testing, quality gates between development and production (if they're not a PITA, they're not working) and traceability for everything in production.
There are off-the-shelf learning materials for all of it.
If you do go down that road hire someone with QMS experience to design your process and handhold your team through the transition. Otherwise you're likely to over-complicate it for a less effective result.
Here Boeing misclassified MCAS and gave a lower risk than later identified, leading to a more relaxed development process. I hope that light will be shed on what happened at Boeing, because it looks like an intern did the MCAS.
They’re developing a whole other project in isolation from yours and the only thing that needs to be agreed upon is the interface(s).