Behind the Lion Air Crash, a Trail of Decisions Kept Pilots in the Dark (opens in new tab)

(nytimes.com)

104 pointssimonbr7y ago82 comments

82 comments

31 comments · 11 top-level

del827y ago· 6 in thread

Based only on what I read in this article, I can kind of see Boeing's point? From what I understand, the procedure/checklist for an uncommanded nose down didn't change from the old to the new version, even with the addition of MCAS. So from the Pilot's perspective, there is nothing that they should do differently in the new vs. the older 737s when this happens-- follow the checklist, which will (eventually) cause you to flip the Stabilizer Trim Cutout Switches, and that will fix the problem.

So the interface didn't change, and the procedure's the same. Should Boeing and airlines update training every time they change something "under the hood", even when the procedure for pilots is the same? How about when they make software updates to already-flying models?

rocqua7y ago

One interface change was the effect of 'pulling hard back on the stick' in case of runaway stabilizers. That worked with the old system, but not with MCAS.

This seems to be exactly the interface change that lead to the crash.

CydeWeys7y ago

To use a car analogy ...

You can always override cruise control by stomping hard on the brake (like to avoid an imminent crash). That's how it's always worked and you've gotten used to this, and done it on occasion when warranted.

Now imagine that the next generation of adaptive cruise-control/"auto-pilot"/whatever comes out, and stomping on the brakes no longer does anything. You have to first disable the cruise control by pressing a button on the steering wheel, and only then will inputs to the brake pedal do anything.

And then you don't tell drivers about this change.

You can totally see how, right in the lead-up to a fatal accident, a driver is going to be focused solely on stomping on that brake pedal in increasing panic, wondering right up to the moment of death why that's not doing anything. They won't consider the cruise control off button because it's not their most immediate need (braking is), and they've never needed to use it before.

5 more replies

twtw7y ago

For reference, here is the runaway stabilizer memory item (the "checklist") for 737:

1. Control column ............................. Hold firmly

2. Autopilot (if engaged) ..................... Disengage

3. IF the runaway stops:

------------------------ [done]

4. IF the runaway continues:

STAB TRIM CUTOUT switches (both) ...... CUTOUT

IF the runaway continues:

Stabilizer trim wheel ............ Grasp and hold

EDIT:

It's #4 that's of interest here. People saying that the interface changed are saying that it's fine if pilots stop after #1, even when dealing with runaway stabilizer for 12 minutes.

2 more replies

del827y ago

That's fair, but what does the checklist say? I mean, say there are a thousand different malfunctions that could cause an uncommanded nose-down. This one happens to be #359, but the pilots don't know that, they just know that they're pointed at the ground. Maybe it used to be the case that the first item on the checklist (pull back on the stick) fixed problem #359, and now it's the second. But there are several hundred other malfunctions that might have caused the problem that also aren't solved by pulling back on the stick, so the next move is to go to the next item on the checklist, right?

1 more reply

dkarl7y ago

The analogy to code would be if a library had a documented API to do something, but some people were using undocumented behavior in another part of the API to do the same thing, and the undocumented behavior changed. The difference is the consequences and how you prepare for them. With a library, there are many steps at which the change could be noticed by users: issue trackers, mailing list discussions, prerelease builds, integration tests, and test environments. Plus for most software nobody will be killed if the application goes down.

I see Boeing's point, too, but to me it just means both sides are at fault. The pilots are at fault for not following the emergency checklist. Boeing is at fault for abusing rules to slip in a change based on the assumption that pilots never rely on their own understanding of the aircraft, which I'm sure they know to be false. Air safety is all about human factors, and that's a pretty obvious one.

argd6787y ago

The interface isn’t opaque in airplanes like in software, everyone learns about how the internals of the aircraft work in order to troubleshoot problems. So pilots are reasoning based how how they think it works under the hood.

fabian2k7y ago· 3 in thread

> In designing the 737 Max, Boeing decided to feed M.C.A.S. with data from only one of the two angle of attack sensors at a time, depending on which of two, redundant flight control computers — one on the captain’s side, one on the first officer’s side — happened to be active on that flight.

The one thing that seriously surprised me was that an automated system that is able to point the airplane towards the ground is intentionally fed by a single, non-redundant sensor.

Everything else I've read about various safety systems that limit or override the pilot has explanations about how redundant sensors are used. And how the system does switch itself off in case the sensors don't give consistent results.

rocqua7y ago

There is one more technical surprise in this article for me. Pulling hard back on the control collumn would override the stabilizer runaway in old versions, but not MCAS. That is a major interface difference between the old and new planes.

It sounds like the old flight manual stated one of two possible methods for dealing with runaway stabilizers. Because the second method (hard back on the control) wasn't in the manual, changes to that way weren't taken into account. Hence a non backwards compatible change slipped through.

CydeWeys7y ago

One hopes that in the fix version of the software that goes out, they'll retain that backwards-compatible manual override again. It seems like a flat-out mistake that MCAS, which solely takes input from a single non-redundant sensor, overrides manual inputs silently.

1 more reply

BlackFly7y ago

From the Semver perspective, change to an undocumented (and therefore not public) feature is not a major change. The point is that their safety documentation doesn't change from one system to the next.

Anyone who was "doing it by the book" was not pulling up on the stick.

Now, maybe Boeing was suggesting via side channels that there were alternate ways to solve certain problems and those side channels should qualify as public documentation... but it may have been intuition earned through experience overriding standard procedure.

2 more replies

eternalny17y ago· 3 in thread

Boeing should face some major fines for this, and additional regulation is going to be needed to make sure this doesn't happen in the future.

This all seems to come down to the fact they wanted to avoid having to retrain pilots ($$$), so these automation changes were kept in the dark.

The crew before them dealt with this same problem but they successfully cut out the trim system. They got lucky and they should have been more vocal in expressing the fault outside of just a post-flight note about it.

The fact that the 737 can auto-trim itself beyond manual elevator authority, due to a SINGLE faulty AOA sensor, is mind-boggling and scary.

extrapickles7y ago

From what I can tell, the previous crew did not get lucky, they just followed the checklist which would have solved the issue in this case.

Auto-trim beyond the elevator authority is not a problem as the pilots can take manual control of the trim by grabbing the trim wheel (its in a very obvious spot on the 737).

The actual fix is hard as adding another alarm can get tricky from a UX perspective during an emergency. Probably the only “fix” is to reinforce the value of following the checklist.

kelnos7y ago

There already is another alarm: an optional "angle-of-attack disagree" indicator that Lion Air was apparently too cheap to install.

Now, that wouldn't have directly pointed to what was wrong, but it would have been pretty suggestive.

(I would suggest, though, that having an optional configuration that lacks robustness for a system that can automatically point the plane toward the ground... a really poor choice of options.)

twtw7y ago

> and additional regulation is going to be needed to make sure this doesn't happen in the future.

Given that the FAA made the decision that it was fine to not retrain the pilots, sounds like there's going to have to be someone to regulate the regulator.

Animats7y ago· 3 in thread

From the article: "In designing the 737 Max, Boeing decided to feed M.C.A.S. with data from only one of the two angle of attack sensors at a time, depending on which of two, redundant flight control computers — one on the captain’s side, one on the first officer’s side — happened to be active on that flight."

They created a single point of failure that way. Why?

danielvf7y ago

By having each redundant flight computer hooked to completely different sensors, in case of a bad sensor the crew can bypass not only the sensor, but also any computation done with that sensor.

It's not a single point of failure as we think if it - if it starts acting up, you can easily disable the automatic stabilizer system, per the procedures. 737 stabilizer runaways take several seconds to take effect, and are recoverable afterword. Later you can switch flight computers and then be using clean data, though you are supposed to leave the stabilizer system off for the remainder of the flight.

Animats7y ago

Using only one sensor at a time, there's no "sensors disagree" fault to tell the pilot there's a problem. Or to tell the flight control system it shouldn't be taking drastic action based on that sensor.

Airbus uses three angle of attack sensors and compares them. They've had at least one crash when two sensors failed in a consistent way.[1] The vulnerability of aircraft flight control systems to bad AOA data is well known.

[1] https://news.aviation-safety.net/2010/09/17/report-blocked-a...

1 more reply

Someone12347y ago

Cost cutting. The article points that picture quite clearly. They created the whole MCAS system to mitigate problems with the aircraft's design in a very short span of time, and needed the whole thing to be cheap.

Someone12347y ago· 2 in thread

Makes one wonder if the FAA is too close to Boeing. Not only did they green light this but they also put considerable pressure on EASA to do the same. The FAA's first priority should be safety, not Boeing's bottom line or their ability to more quickly deploy an aircraft update.

Pilots are pretty unhappy about this M.C.A.S. situation. They're literally expected to fly an aircraft, and not even being told how that aircraft functions. And while the checklist may eventually take care of this, that isn't a substitute for a professional pilot in the cockpit. Just the lack of training/simulated failures for this new system is highly irregular, pilots are used to and expect such training while transitioning to a new major aircraft version.

The biggest drivers here seem to be cost and Boeing's competitiveness, not safety. I think it might be time for the EASA to trust the FAA a little less, at least until they get their house back in order.

PedroBatista7y ago

The revolving door between FAA and cushy positions at Boeing is not a secret. The shortcuts Boeing has been making with the blessing ( or willful omission of the FAA ) have been discussed but with many interests in the middle, like strong competition from Airbus, geopolitics, internal politics, States outbidding each other to create more jobs, national ego, and straight up greed.

"Word on the street", is that Boeing has lost most their safety reputation and most people just do their job and try to not get burned when the planes start to crash. I want to believe most are just blowing of steam, but I don't know..

dingaling7y ago

The 737 models of two generations ago ( 300 / 400 / 500 ) had several fatal accidents in the 1990s due to runaway rudders. That dented Boeing's reputation with users but not with the FAA.

danielvf7y ago· 2 in thread

There's something called the "Swiss Cheese Theory" of accidents. (https://en.wikipedia.org/wiki/Swiss_cheese_model)

In a mostly-robust system, different layers catch and defend against the errors of other layers in the system. For a major accident to occur, holes in multiple layers have to line up that day.

In this case we have four holes that lined up that day - a plane model with a possible rare software bug, an aircraft with a faulty sensor feeding bad information to the computers, an airline company with internal culture that continues to fly a specific aircraft that keeps trying to point at the ground, sometimes without even making an attempt at fixing the problem, and finally, on this fatal day, a crew that did not follow the proper procedures even after twenty-three nose down incidents during the flight.

Even without the MACS system present, the last two holes seem like they would bring down an airliner eventually, from one cause or another.

Pilots are rightfully mad about not being told about the MACS system. But it's just one of many systems on a 737 that can trim the stabilizer to point that it can't be flow. That's why the procedure for any stabilizer problem is to disable automatic control of the stabilizer. The training and checklists that the accident pilots had covered this, and previous pilots flying the accident aircraft did this and then had uneventful flights.

FabHK7y ago

Still, to have a system on board that, with one sensor malfunctioning, repeatedly trims you down (unless you switch the cutout switch or physically arrest the trim-wheel), is pretty tough.

By the way - in small airplanes, you can overcome trim with elevator pressure. That's not necessarily the case on a passenger jet; and not only because it's much bigger, but because the trim works differently [1]. I wonder whether that played a role. I must admit that before I read [1], I had assumed that bad trim is something I can overpower, when push comes to shove.

[1] https://www.skybrary.aero/bookshelf/books/2627.pdf

danielvf7y ago

Yeah, when the trim is a giant screw changing the angle of the whole stabilizer, rather than just a little tab, it's a whole different ballgame.

There are plenty of single components on an airliner whose failure can cause a stabilizer trim runaway. Different airliners handle it differently. On a 737 can you can cut out automatic control, and use wheels connected to the stabilizer jackscrew with metal cables. On other airliners, you can cut out automatic control, then switch second electric control system and use it manually. A 737 stabilizer runway isn't an instant thing, and is a loud event in the cockpit.

danielvf7y ago· 1 in thread

Here's a video from a few years ago of two student pilots handling a trim runaway in a 737 simulator.

https://www.youtube.com/watch?v=3pPRuFHR1co&t=154

You'll notice that it's a loud, physical event, with a very simple solution.

This happened over twenty times in the accident flight, and the pilots never disabled the problematic automatic stabilizer system.

robocat7y ago

This is a great video showing the UI - it makes what is going on much clearer!

cmurf7y ago

Actual example: Normal takeoff in instrument meteorological conditions (no external visual references, flight by reference exclusive to instruments). The attitude indicator shows proper climb attitude, vertical speed and altimeter show positive rate of climb, airspeed indicator shows speed increasing above target speed. Pilot response? Probably nose up and/or power reduction; OK they do both. Airspeed indication continues to increase. Pilot noses up and powers down. Airspeed increases. Pilot noses up aggressively. Stall. Crash.

What happened? The pitot tube and drain were clogged. Static port was clear. This turned the airspeed indicator into an altimeter - it was incapable of showing correct airspeed from the moment of blockage.

The cause of the crash is pilot error. The pilot is expected to recognize from other instruments that the airspeed indicator is unreliable, and this is part of training for instrument rating.

If the MCAS in the Lion Air crash made a similar mistake - using a single data point to determine a stall condition. That is an error. It's functionally "pilot error" to have no means of determining if the angle of attack sensor is wrong, and no mechanism for disregarding its data. Further, the corrective action it took, had the flight condition actually been true, sounds excessive. If a human pilot did the exact same thing MCAS did, I expect the human would be blamed - it would be pilot error to so aggressively nose down that you've exchanged a level flight high speed stall (a rare event indeed) for a high speed dive. That is not a competent recovery, in particular that there's apparently no recognition of the danger of high speed dives let alone recovery from them it's probably a really good idea if your stall recovery does not ensue in a dive!

coldcode7y ago

If it affects flight stability especially in an emergency, then pilots should be trained to understand what it affects. Period. Not doing so to save money or get more sales is beyond stupid. Watch Air Disasters to see what happens when highly trained pilots fail to do the right thing because they hadn't trained to deal with what went wrong because the problem was something different than what they knew. Flying is easy when things are working, pilot training is the difference between dealing with an emergency or being dead.

dsego7y ago

Mentour Pilot did a youtube podcast about this air crash back in November. https://youtu.be/zfQW0upkVus

hlandau7y ago

When we first heard this crash was due to a change in computer-controlled stabilizer behaviour, my question was "why on earth did Boeing do this?". Perhaps I didn't read deep enough, but the summary explanation that it was to improve handling was a poor answer.

I guess what really bugged me about it is how un-Boeing-like this behaviour was; a computer overriding a pilot (even if there is a way for a pilot to override it in turn). It's fundamentally an Airbus-esque design.

As I read this article though, everything fell into place. As you read it you start to see, with utter clarity, exactly how this happened organizationally.

It's well known that Airbus uses software flight envelope protection to enable them to reduce the safety margin applied to the airframe, reducing weight. In other words, fuel efficiency is improved by making airframes less airworthy and compensating for it in software. I don't actually disagree with this as such; it's been demonstrated to be a sound approach, but historically Airbus's domain.

Essentially, it seems like what happened here is that Boeing finally felt the need to adopt similar techniques to compete with Airbus on fuel efficiency (though regarding engine size issues, not airframe safety margins, but still making a plane's airworthiness more caveated and fixing it in software). Essentially, we're witnessing the point at which Boeing feels its traditional user interface philosophy (do what the pilot says) is conflicting with market pressures.

If this were a new plane with a new type rating, this wouldn't be unreasonable. Trying to tack this on to an existing plane, and not only that, but doing everything in your power to minimise the amount of transition training, is OTOH extraordinarily egregious.

The problem with this change isn't so much that Boeing's reasoning for not telling pilots about it isn't logical. If anything, the problem is that their reasoning is utterly logical: the checklist will solve the problem anyway, no matter the cause. You can see how this decision must have percolated through different teams at Boeing, through regulators, via this unimpeachable-seeming logic. The market pressures involved (fuel efficiency and retraining costs) would have made it particularly hard to contest. It's a completely logical line of reasoning... yet here we are with fatalities.

I'm very interested to note, though, this new revelation (to me at least) that the yoke behaviour re: extreme deflection mitigating stabilizer runaway was removed in the MAX. So what was Boeing's justification for this change? Was it even mentioned? If not, what on earth were the regulator's justifications for allowing it to go unmentioned? I want to hear those justifications, since it seems impossible to justify. I was under the impression that compatibility of type ratings fundamentally revolved around an absence of differences in how two planes handle, and how they respond to the yoke.

I should add, the reliance on a single sensor is also remarkable; makes me wonder if this entire subsystem was really rushed and not given proper design review, which would make sense given the circumstances (panicking to get a product to market).

j / k navigate · click thread line to collapse

82 comments

31 comments · 11 top-level

del827y ago· 6 in thread

rocqua7y ago

One interface change was the effect of 'pulling hard back on the stick' in case of runaway stabilizers. That worked with the old system, but not with MCAS.

This seems to be exactly the interface change that lead to the crash.

CydeWeys7y ago

To use a car analogy ...

And then you don't tell drivers about this change.

5 more replies

twtw7y ago

For reference, here is the runaway stabilizer memory item (the "checklist") for 737:

1. Control column ............................. Hold firmly

2. Autopilot (if engaged) ..................... Disengage

3. IF the runaway stops:

------------------------ [done]

4. IF the runaway continues:

STAB TRIM CUTOUT switches (both) ...... CUTOUT

IF the runaway continues:

Stabilizer trim wheel ............ Grasp and hold

EDIT:

It's #4 that's of interest here. People saying that the interface changed are saying that it's fine if pilots stop after #1, even when dealing with runaway stabilizer for 12 minutes.

2 more replies

del827y ago

1 more reply

dkarl7y ago

argd6787y ago

fabian2k7y ago· 3 in thread

The one thing that seriously surprised me was that an automated system that is able to point the airplane towards the ground is intentionally fed by a single, non-redundant sensor.

rocqua7y ago

CydeWeys7y ago

1 more reply

BlackFly7y ago

Anyone who was "doing it by the book" was not pulling up on the stick.

2 more replies

eternalny17y ago· 3 in thread

Boeing should face some major fines for this, and additional regulation is going to be needed to make sure this doesn't happen in the future.

This all seems to come down to the fact they wanted to avoid having to retrain pilots ($$$), so these automation changes were kept in the dark.

The fact that the 737 can auto-trim itself beyond manual elevator authority, due to a SINGLE faulty AOA sensor, is mind-boggling and scary.

extrapickles7y ago

From what I can tell, the previous crew did not get lucky, they just followed the checklist which would have solved the issue in this case.

Auto-trim beyond the elevator authority is not a problem as the pilots can take manual control of the trim by grabbing the trim wheel (its in a very obvious spot on the 737).

The actual fix is hard as adding another alarm can get tricky from a UX perspective during an emergency. Probably the only “fix” is to reinforce the value of following the checklist.

kelnos7y ago

There already is another alarm: an optional "angle-of-attack disagree" indicator that Lion Air was apparently too cheap to install.

Now, that wouldn't have directly pointed to what was wrong, but it would have been pretty suggestive.

(I would suggest, though, that having an optional configuration that lacks robustness for a system that can automatically point the plane toward the ground... a really poor choice of options.)

twtw7y ago

> and additional regulation is going to be needed to make sure this doesn't happen in the future.

Given that the FAA made the decision that it was fine to not retrain the pilots, sounds like there's going to have to be someone to regulate the regulator.

Animats7y ago· 3 in thread

They created a single point of failure that way. Why?

danielvf7y ago

By having each redundant flight computer hooked to completely different sensors, in case of a bad sensor the crew can bypass not only the sensor, but also any computation done with that sensor.

Animats7y ago

[1] https://news.aviation-safety.net/2010/09/17/report-blocked-a...

1 more reply

Someone12347y ago

Someone12347y ago· 2 in thread

PedroBatista7y ago

dingaling7y ago

The 737 models of two generations ago ( 300 / 400 / 500 ) had several fatal accidents in the 1990s due to runaway rudders. That dented Boeing's reputation with users but not with the FAA.

danielvf7y ago· 2 in thread

There's something called the "Swiss Cheese Theory" of accidents. (https://en.wikipedia.org/wiki/Swiss_cheese_model)

In a mostly-robust system, different layers catch and defend against the errors of other layers in the system. For a major accident to occur, holes in multiple layers have to line up that day.

Even without the MACS system present, the last two holes seem like they would bring down an airliner eventually, from one cause or another.

FabHK7y ago

Still, to have a system on board that, with one sensor malfunctioning, repeatedly trims you down (unless you switch the cutout switch or physically arrest the trim-wheel), is pretty tough.

[1] https://www.skybrary.aero/bookshelf/books/2627.pdf

danielvf7y ago

Yeah, when the trim is a giant screw changing the angle of the whole stabilizer, rather than just a little tab, it's a whole different ballgame.

danielvf7y ago· 1 in thread

Here's a video from a few years ago of two student pilots handling a trim runaway in a 737 simulator.

https://www.youtube.com/watch?v=3pPRuFHR1co&t=154

You'll notice that it's a loud, physical event, with a very simple solution.

This happened over twenty times in the accident flight, and the pilots never disabled the problematic automatic stabilizer system.

robocat7y ago

This is a great video showing the UI - it makes what is going on much clearer!

cmurf7y ago

The cause of the crash is pilot error. The pilot is expected to recognize from other instruments that the airspeed indicator is unreliable, and this is part of training for instrument rating.

coldcode7y ago

dsego7y ago

Mentour Pilot did a youtube podcast about this air crash back in November. https://youtu.be/zfQW0upkVus

hlandau7y ago

As I read this article though, everything fell into place. As you read it you start to see, with utter clarity, exactly how this happened organizationally.

j / k navigate · click thread line to collapse