Normalization of Deviance (2015) (opens in new tab)

MichaelZuo3y ago

There definitely is a lot of pretend, or alleged, authority floating around in a large organization. But when it comes to brass tacks, the number of folks who really can do something and have it accepted is quite small.

xyzzy1233y ago

The flip side of "quality first" / stop the line is getting killed in market by a worse but faster or cheaper solution. You are locked in game theory with your regulators (if you have any) and your competitors.

It's disappointing (and career limiting) to do the "right" engineering and lose because you didn't correctly gauge the risk tolerance of the market.

I don't think there's any one answer for this.

My observation is that people will pay a premium for demonstrable physical safety features but privacy & security in software do not win markets.

aj73y ago

Heh, heh. They ignored my input and kept shipping. Then the customers began calling…

Meanwhile our major competitor cleverly placed their rotating element in such a position that the beam retraced itself through the rotating element, thereby substantially cancelling this effect.

smrtinsert3y ago

I've also learned this as "bring the solution", regardless of whether or not you find the problem.

actually_a_dog3y ago

Do you actually use the phrase "be negative" when you ask the question? I could see that distorting the responses by people who don't consider pulling the cord to stop the line "being negative."

quickthrower23y ago

Or … can the first officer order a goaround? Reading too much Admiral Cloudberg!

geenew3y ago

I once encountered a situation with a very expensive field laser. At one point, measurements started showing an increasing amount of offset.

Over a period of days, the error became increasingly, comically bad, until finally the system refused to boot.

A technician was called, and after hearing about the behaviour, the first request was that a photo of the laser light exit port be taken.

It was obvious why it wouldn’t boot: a mirror in the light path had fallen off.

The worst part was, the mirror had been held on by glue, and had been slowly slipping out of place. The hot climate was probably a factor.

They really should have had someone to say ‘you can’t ship that’ when the topic of glue to hold mirrors came up.

pm_me_your_quan3y ago

This exact problem happened with an optic in my lab in graduate school. For two years the senior grad student and postdoc blamed each other over the entire apparatus becoming misaligned every couple of days. (It was a really toxic environment.) Eventually, they both left, I was the only one there, and it still became misaligned. In one day I tracked it down to a prism from Thorlabs whose glue had gone bad positioned at the very beginning of the laser line- it was sliding in its mount.

I wish I had pushed more strongly about it. We spent probably a full person-day of work every week on that.

c_o_n_v_e_x3y ago

>They really should have had someone to say ‘you can’t ship that’ when the topic of glue to hold mirrors came up.

I work in product at a hardware company and have a lot of domain experience which came from spending years in the (literal) field. There's been many times where I write a product spec and the engineers are incredulous. "Really? It gets THAT hot?" or "Do we really need to provide a bonding/grounding lug on the case?"

It's not uncommon to find engineering teams with deep domain experience in one area, but completely lacking in others. Ignoring domain experience, there should have been rigorous product testing during design that would have weeded out the glue issue.

KMag3y ago

Interesting. I had an internship at a company that did inertial navigation, mostly for defence applications. I only knew of ring lasers for use in gyroscopes. (Send a laser around a loop wave guide/fiberoptic, and any translational acceleration cancels out going out and back, but any acceleration in rotational velocity in the plane of the ring/rotation vector perpendicular to the ring shows up as a Dopler shift. Tune the laser to have a standing wave, and rotational acceleration shifts the nodes of the standing wave around the ring.)

I had a colleague who got called up when a Trident missile MIRV bus fell off a forklift and he had to do simulations to tell the Navy if it was still good or needed to be brought back in for rework/recalibration. My understanding is that either the MRIV bus itself or its container has integral devices that record peak 3-axis acceleration for just such a scenario. I imagine they're as simple as a few precise weights on a few wires with precise failure strains, so you can bracket the peak acceleration by which wires broke and which survived.

On the one hand, it's great to have more accurate nukes, which allow lower yields, smaller stockpiles, and presumably smaller craters if everything goes sideways. On the other hand, "surgical" nukes result in it more likely that one side will use them and gamble that the other side won't massively retaliate.

XorNot3y ago

You could look at it a different way: a more accurate nuke means a nuke that's targeted at military facilities and not sized 10x larger and aimed at "everything around that city over there".

If it was ever used, that work saves lives.

They proved you wrong, at great cost.

A more correct and polite version of your advice: "It will cost us a lot more to ship this as-is, and fix it later, than it will to delay shipment and fix it now. Is it too late to do that? Did we over-commit to shipping now?"

It wasn't your responsibility to come up with that version. It was your manager's responsibility. It was also their responsibility to find the necessary decision-makers and involve them directly. I would argue that this sort of work is the only real way that "management" can provide value in the first place.

Somehow, socially, it's incredibly common for people to value the inverse of that job. People assume it is "good work" for a manager to successfully ignore unpopular concerns, and push through to the end, no matter how inefficient that makes the journey.

That works out in the case that the shipping date was over committed, such that a delay would cost more than fixing it later. Even so, that entire situation would be avoided by refusing to over-commit shipping dates in the first place. That's the same responsibility applied earlier in time, so a manager that behaves the way I have described could factor out the entire problem at its source.

This is what the average person should learn about management. Even if it's not their job personally, there is a lot of leverage behind the decision a worker makes about what management behaviors to socially favor, and what behaviors to socially reject. That leverage is multiplied at every level up the hierarchy, making that the opinion of someone in a management role is very significant, and the opinion of someone in an executive role is crucial.

It's really difficult to be explicit about opinions. You can't really put them in your resume, but at the same time, an opinion on management style may be an executive's primary value contribution!

LBJsPNS3y ago

I have seen this far too often in engineering projects. The push is to get the product out the door and turn it over from R&D to production engineering. Never mind the final quality, that will be fixed in the field by another department with a different budget. We got a product shipped; our department's reputation and budget is intact.

fennecfoxy3y ago

I'm not in as prestigious a line of work as you are but I've found the exact same thing happens in my industry (web development, mostly).

Everyone wants to think of the cool ideas to make things work, but few people want to think of all the ways those ideas can break, fail, fail to be future-proof, be expensive, etc, etc whereas I relish in it; what's more satisfying than helping make a proposed or existing solution even better?

But the same applies to stuff outside of work, too. I find I'm quite negative about stuff in the exact same way and whilst it's fun to think "how could we fix this, how could we make it better", all people see it as is negativity and social pressure has made me start to rethink this approach in life. It's better to keep your mouth shut and let the fire start than to open your mouth and be negative, as per your analogy.

Hell it even applies to traditionalism; "we should put out that fire" "but that fire's always been there, that's the way it's always been" "but it's a fire!!!" "yeah, well it was here before you and we like it. That fire walked uphill both ways through the snow to get to school".

hef198983y ago

Hah, that is so me! Well, not for lasers, but the dynamics are the same. You point out issues and risks, are ignored and labelled negative. Then when those things cannot be ignored, people come crying to you for help. And when they hear the solutions, aka stop doing what you are doing wrong, they again label you as negative, up to the point of blaming for everything that went wrong.

You are not getting points for preventing fires, you get them for putting them out. Unfortunately, some folks seems to conclude that lighting fire up, just to put them out later, is a good and easy way to earn that "hero" reputation.

chaps3y ago

These stories ring so, so true. Once worked at a company whose infrastructure issues were so deep and festering that after fighting a fire, my boss told me, "If you go to the press about this, the client will sue us and everyone who works here will lose their jobs."

aliqot3y ago

That's what I never understood about this story; did you guys have any suspicion it would dump radiation into the patient all at once, or was this like a concurrency bug

chaps3y ago

We weren't able to reliably install security daemons on a client's machine because the entire automation system didn't account for autoscaling. The issues were raised well before I joined and the project head legitimately didn't understand it as a problem that needed solving. The hosts were for a presidential candidate's webserver, and they noticed the webservers were missing security daemons days before the election.

Huh? Are you assuming that the parent comment is about someone programming a medical device?

dang3y ago

Normalization of Deviance (2015) - https://news.ycombinator.com/item?id=22144330 - Jan 2020 (43 comments)

Normalization of deviance in software: broken practices become standard (2015) - https://news.ycombinator.com/item?id=15835870 - Dec 2017 (27 comments)

How Completely Messed Up Practices Become Normal - https://news.ycombinator.com/item?id=10811822 - Dec 2015 (252 comments)

What We Can Learn From Aviation, Civil Engineering, Other Safety-critical Fields - https://news.ycombinator.com/item?id=10806063 - Dec 2015 (3 comments)

csomar3y ago

Extrapolating from these related threads, this article should have 2K+ comments the next year.

PaulHoule3y ago

I like the bit about

   Let's look at how the first one of these, “pay attention to weak signals”,
   interacts with a single example, the “WTF WTF WTF” a new person gives off 
   when the join the company.

and kinda wonder if a company that prioritized not getting this reaction from new hires might find it is the most impactful thing they can do in terms of culture.

jkaptur3y ago

Just to give a different, concrete, perspective (and push a hot button HN issue), I've spent a fair amount of time working on extremely large web applications, and by far the #1 "WTF WTF WTF" thing that new hires say is "what do you mean you aren't using $TODAYS_HOT_JS_FRAMEWORK??"

Once you get away from "should we use version control" and into actually difficult software engineering questions, it's not clear how to balance a fresh perspective vs. an experienced (normalized? tainted?) view. I wish the article went into this more.

Like, how does the new hire (or anyone else) know the difference between "learning the complexity of the new system" and "internalizing/normalizing the deviance of this culture"?

aidenn03y ago

> Like, how does the new hire (or anyone else) know the difference between "learning the complexity of the new system" and "internalizing/normalizing the deviance of this culture"?

If a new hire can't checkout, build, and test the software on the first day, then there is likely something either wrong with the hire or the infrastructure. A sufficiently old and arcane software system might take weeks before a new hire can make even a simple change, but that shouldn't impact those three items.

908B64B1973y ago

> by far the #1 "WTF WTF WTF" thing that new hires say is "what do you mean you aren't using $TODAYS_HOT_JS_FRAMEWORK??"

To that speaks of the caliber of programmers hired. If all they have seen is $TODAYS_HOT_JS_FRAMEWORK and wrote nothing but a web app using $TODAYS_HOT_JS_FRAMEWORK they might not grasp the fundamentals that would make then realize that frameworks are just abstractions (and not that different from one another).

I don't think any software engineer would even ask that question, since the answer will almost always be "$TODAYS_HOT_JS_FRAMEWORK didn't exist when the project started, and it's not worth a re-write to port it over".

Now, that brings out a second important truth: a company can't attract and retain a wide range of different caliber employees. For instance, if a place still questions the usefulness of source control (perhaps because they consider git to be too complicated) there's no way they'll attract and retain top performers. So the culture will select people that agree that source control is a waste of time.

josephg3y ago

> it's not clear how to balance a fresh perspective vs. an experienced (normalized? tainted?) view.

Having a healthy, balanced view comes from enough experiences. Ideally from working in a bunch of places and seeing enough things go sideways, and correctly understanding and identifying the causal chain that led to failures.

Funnily enough its sort of like training an AI - you essentially need a lot of correctly labelled data to learn. Junior engineers don't have enough data points, and unfortunately some "senior engineers" I've worked with took (in my opinion) the wrong lessons from their experiences. (Eg the CTO who thinks version control is too complex.)

The interesting cases are when smart, experienced people disagree on what the best solution is. Should you keep your team small and smart or have a varied team with more mentorship and process? Is code review worth it in every case? What is the right amount of tests for your software? How often do we want to push to production?

When I was teaching programming my students would sometimes ask juicy questions. My favorites were the questions I could answer with "I'll tell you my opinion, but I've worked with people I look up to who think I'm wrong about this..."

flappyeagle3y ago

If "what do you mean you aren't using $TODAYS_HOT_JS_FRAMEWORK" is the first question they ask, then you can end their employment right then and there. Hire people whose "wtf" you take seriously.

There's a whole section about how to balance a fresh perspective, in "solutions". The best way is for an effective VP to hear the WTFs of new hires, apply engineering judgement, and make changes based on that signal. If management is not the ones creating the deviance, they ought to be able to tell what reactions are just unfamiliarity with the system and what are signs of something actually being broken. The article is arguing that most people ignore those weak signals by default, and ought to pay more attention to them, not that they're always reliable.

jt21903y ago

Yes. From the article:

> “The thing that's really insidious here is that [once a person buys] into the WTF idea… they can spread it elsewhere for the duration of their career… Once people get convinced that some deviation is normal, they often get really invested in the idea.”

> [H]ow does the new hire (or anyone else) know the difference between "learning the complexity of the new system" and "internalizing/normalizing the deviance of this culture"?

The article implies that the new hire should pay close attention to the things that are incentivized, and those that are not.

ReflectedImage3y ago

That's not a "WTF". All front end developers trash their predecessors work and rewrite in the last framework. That just what they do.

GartzenDeHaes3y ago

In my experience, you cannot change an organization's culture with rules, mission statements, listing values, or giving speeches. They only way is to take down the old culture bearers. Some of them may be managers, but more often they are employees who have gained some organizational power. You'll often find them at the center of sticky organizational spider webs with approval processes such as purchasing and service administrators.

To change the culture, these people have to go. Firing them may not be feasible, but there are other options. Dethroning them in the form of a promotion or even just physically moving them can be effective. When people don't have to jump through their hoops anymore, they lose their organizational power.

Ensorceled3y ago

> They only way is to take down the old culture bearers. ... To change the culture, these people have to go.

I was briefly head of engineering at a company that had several "old culture bearers" that made change impossible. I was something like the 3rd or 4th engineering leader over the space of a year. Apparently the person after me was actually allowed to fire a few of these people and was able to turn things around.

chaps3y ago

Been told that before. When I spoke up, I was told that I was new and shouldn't talk about things I know nothing about.

giobox3y ago

There's often a wise tradeoff between criticizing systems you've just seen after being at the company 5 minutes and actually spending some time at the company to learn the historical context of why the thing you think is insane/shit is insane/shit before telling everyone who built it how insane/shit it is.

People generally don't wake up in the morning and go into work motivated to make insane/shit things - context, tech debt and business realities all mount up and even the best of us can end up making choices that in isolation look crazy.

There are of course companies who are really bad and you may well be right, but so many times I have seen in my career a young new hire storm in and think everything is shit without paying heed to the context and historical pressures. The best thing you can do in many cases is spend the first ~six months at a new tech company trying to understand that context, and indeed I think more mature engineers generally do.

PaulHoule3y ago

It's not (1) "reacting to the reaction" which is the endpoint but (2) not having that reaction. If (1) is important it as because that is the path to (2).

I'd say a company that has accomplished (2) has cut the workload in hiring employees by 30-50% in the sense that every employee who has reaction (1) either internally or externally is at risk for being disengaged or leaving soon. Not only that but you are probably wasting your dev's time and could get dramatically more productivity out of them if you aren't WTFing them to death.

AnIdiotOnTheNet3y ago

To be fair, you really shouldn't. You know nothing of the constraints that people are operating under, or the political or cultural landscape you're dealing with, so you just come off like a preachy academic.

https://blog.codinghorror.com/the-f5-key-is-not-a-build-proc...

lo_zamoyski3y ago

That doesn't seem like a healthy standard b/c it grounds decision making in appearances rather than principles and prudential judgements. Certainly, such feedback or opinions can be worth considering as a way of getting at what principles are being violated and deciding whether these violations are tolerable or what ought to be done about them. A fresh pair of eyes could help. But an untrained pair of eyes might also not be qualified to discern the right course of action.

PaulHoule3y ago

Yes and no. But 2/3 of the time the problem is that the company has no documented build process or something obvious like that . They have time to spend 2 years failing to deliver a product because they don't know how to build it, but when somebody asks "How do we build it?" the answer is "Don't waste our time asking stupid questions." Of course they have been wasting time not knowing how to build the system, the guy who started the project might be able to hit F5 in 15 different windows and get it to sorta kinda work, but new hires they are hiring to work on it are quitting right away and somehow they can never get it into production.

There should be no controversy at all that complete instructions for installing everything required for a dev to build the project and work on it should exist and it should be possible to complete this task in hours, not the weeks that it frequently takes. And, no, "docker" is not an answer to this anymore than "The F5 Key is a Build Process"

It is not "Docker" that solves the problem, it is the discipline of scripting the image build process into a dockerfile. If you know how to write a dockerfile you can write a bash script that runs in 20 seconds as opposed to having Docker spend 20 minutes downloading images and then crash because of a typo.

You are right that a company might have good reasons for doing things in an unobvious way, but most of the time when nobody at a company claims to understand what the company is doing except for the CEO and people aren't too sure about the CEO, it is the fault of the company lacking alignment, not a natural property of freshers.

e_i_pi_23y ago

This is something we actively try to take advantage of at my company - we know that we've grown comfortable with architecture that may not make a lot of intuitive sense, so when we have new people join we try to make a list of confusing concepts so we can try to clean them up. In the ideal case the new person is able to do the cleanup, so we get a more intuitive design and they learn the surrounding architecture more along the way

rr8883y ago

The problem is when you join a high performing team and organization. Is a WTF something they should fix your you need to recalibrate what you think is normal?

kurthr3y ago

There's been a lot of work on reliability of complex systems and how they operate. What has been found is that it is almost always necessary to have failure (degraded operation) modes that prevent system failure, and the more complex and more hazardous failure is the more modes develop.

In these systems it is found that they are almost always operating (or transitioning between) failure modes. Often multiple operational failure modes are simultaneous. It becomes very important to test the system in each of it's failure modes and their combinations to maintain high up time.

https://how.complexsystems.fail/ is an example, but there are many.

Human work, development, and maintenance is itself a system that interacts with these critical systems. Frankly, failure to fail causes failure (thus chaos monkey). The mythical man month is almost a sub category of these failures as are HR hiring processes and other BS. Being too successful and not having competition (or similarly sclerotic competition) can be as much of a hazard as "move fast, break things".

euroderf3y ago

"When a fail-safe system fails, it fails by failing to fail-safe."

stevehawk3y ago

This is a big term in aviation, because in most cases in order for something catastrophic to happen it requires a lot of things to have failed. And one way to ensure that enough things fail is to start deviating from your maintenance, inspections, or general responsibilities. Related: the swiss cheese models.

kurthr3y ago

It's also why things in aviation are so fixed and difficult to change. Not having any new civil aviation planes for 30 years worked... how about 40, 50? When will it break? Well, when someone develops an easy to build, easy to fly, inexpensive experimental craft and zillions of people do it all at once (hasn't happened yet).

One of the more interesting things I've found is that a huge number (easily a majority) of instructors are recently trained flyers, because there is a pipeline to train them and they're cheaper than using experienced pilots (esp for multi-engine and more complex airplanes). They also know all the ins-outs of the training and rule books (with recent changes) so they know how to pass all the tests and how to teach that. Sooooo you have a bunch of inexperienced pilots teaching all the new pilots... there's likely a failure there, but it hasn't reared its head. We still have a lot of ex-military folks around who didn't learn that way.

Who do you want flying when things go bad? People who have spent many hours with things about to go bad (military, emergency/fire, sail plane pilots) who have experience dealing with it. Those people can also be fun/terrifying to fly with, because they will take risks.

sokoloff3y ago

> Not having any new civil aviation planes for 30 years worked...

The A220, A350, and A380 are all newer than 30 years old. (A321 is barely younger than that and A330 barely older.) Boeing has released the 777 and 787 in the last 30 years. The Cirrus SR20 and SR22 are newer than that, as is the SF50 jet. The Diamond DA40, DA42, and DA62 are newer. The Honda Jet is newer. Cessna has a handful of business jets newer than that. The Embraer Phenom 100 and 300 are newer than that. There are variants of the CRJ newer than that (-700, -900, -1000).

That's a lot of new civil aviation aircraft designs in the last 30 years.

neilv3y ago

> They also know all the ins-outs of the training and rule books (with recent changes) so they know how to pass all the tests and how to teach that. Sooooo you have a bunch of inexperienced pilots teaching all the new pilots...

Sounds like there's some similarities with everyone focusing on Leetcode interviews, and then one generation of that filtering and then mentoring the next, and repeat.

The companies don't know what that's costing them, until there's a problem that can't be ignored.

In the case of software engineering (poorly studied, relative to aviation) the company will generally never learn whether a non-Leetcode&promo-focused team could've avoided the problems in the first place, nor whether non-Leetcode experience could've handled a problem that happened anyway.

lmm3y ago

> Who do you want flying when things go bad? People who have spent many hours with things about to go bad (military, emergency/fire, sail plane pilots) who have experience dealing with it. Those people can also be fun/terrifying to fly with, because they will take risks.

Maybe. Or maybe you're better off with freshly trained people who still remember exactly what to do in all the failure scenarios. Certainly I've generally felt safer with drivers who'd just passed their test than with people who've been driving for years, for example.

https://soaringeconomist.com/2019/10/30/experience-can-kill-...

pja3y ago

> Those people can also be fun/terrifying to fly with, because they will take risks.

Risk homeostasis in action!

somat3y ago

A thought experiment.

When is it "Normalization of Deviance"? and when is it a "Efficiency Optimization"?

I mean, the difference is pretty clear after something has failed, But very murky before.

jpollock3y ago

It is Efficiency Optimization when you know why the rule is there, and having made an estimation of the risks, perform a cost-benefit analysis.

aka "Chesterton's Fence"

Otherwise, it's "Normalization of Deviance":

* The build is broken again? Force the submit.

* Test failing? That's a flaky test, push to prod.

* That alert always indicates that vendor X is having trouble, silence it.

Those are deviant behaviours, the system is warning you that something is broken. By accepting that the signal/alert is present but uninformative, we train people to ignore them.

vs...

* The build is always broken - Detect breakage cause and auto rollback, or loosely couple the build so breakages don't propagate.

* Low-value test always failing? Delete it/rewrite it.

* Alert always firing for vendor X? Slice vendor X out of that alert and give them their own threshold.

manicennui3y ago

Unfortunately I don't find that most software engineers understand the difference between actually determining costs and benefits and choosing to make certain tradeoffs and rationalizing whatever choice they already made.

https://www.fastjetperformance.com/blog/how-i-almost-destroy...

Logans_Run3y ago

I'm not sure if this link has already been posted but have a look at How I Almost Destroyed a £50 million War Plane and The Normalisation of Deviance.

renewiltord3y ago

In The Field Guide to Human Error Investigations by Sidney Dekker, he quotes someone else saying something like:

> Everything that can go wrong will go right.

Murphy's Law then manifests from escaping disaster through repeated iterations of taking risks where most things play out well anyway.

I have to laugh at the "append z to the end" strat at Google, though. That's a good one.

overengineer3y ago

Here are a few examples from real-world history that reflect problems discussed in the article:

- The Space Shuttle Challenger disaster in 1986 was caused by the normalization of deviance, where engineers became accustomed to problems with the O-ring seals and began to accept them as normal. This led to the eventual catastrophic failure of the shuttle's launch, killing all seven crew members.

- The 2008 financial crisis was caused in part by a normalization of deviance in the banking industry, where risky and complex financial instruments were routinely used without proper oversight or understanding of the potential risks. This led to a widespread collapse of the financial system and a global economic recession.

- The Volkswagen emissions scandal in 2015 was caused by a normalization of deviance in the automotive industry, where engineers and executives became accustomed to cheating emissions tests and misleading customers about the true environmental impact of their vehicles. This led to significant financial and reputational damage to the company.

- The Theranos scandal in 2018 was caused by a normalization of deviance in the healthcare industry, where the company's leaders became accustomed to misrepresenting the capabilities of their blood testing technology and misleading investors and customers about its accuracy. This led to significant legal and financial repercussions for the company and its executives. (ChatGPT)

deanCommie3y ago

Reality: It is true that EVERY organization is broken in some way or another.

You have to find the one that is broken in the way that is tolerable to you.

Arguably the closest we know to a panacea in terms of engineering culture and best practices is Google. And what are they now known for? An inability to ship anything meaningful anymore. Spinning around in circles launching and re-launching new chat apps.

These are not unrelated. High engineering standards are always in tension with product delivery. As a security engineer once told me, "the most secure system is the one that never gets launched into production."

So while Dan is right, and all the examples are right, and things like non-broken builds and a fast CI/CD pipeline are totally achievable, don't learn the WRONG lesson from this which is that when you arrive to a company and notice a bunch of WTFs, the first thing you must do is start fixing them in spite of any old timers who say "Actually that's not as bad as it seems". Sometimes they're wrong. USUALLY, they're right.

kerblang3y ago

Great stuff - I think this goes in the "required reading" list.

The tech industry tends to revolve around "I'm a super-rational robotic genius" thinking that can't accept the existence of its own irrational tendencies, to the point that it becomes ridiculous.

SideburnsOfDoom3y ago

The standard "Required reading" text on the subject is "The Field Guide to Understanding 'Human Error'" By Sidney Dekker

goostavos3y ago

Great book! (Although, it could have used a more aggressive editor)

Reading it felt like a personal attack in many places. However, reading it forever changed how I think about things. It's a much more useful framing for everyone involved if you start with the question of "why did they think this was the right thing to do?" as opposed to "this person made a bad choice / mistake". My (extraordinary) impatience naturally predisposes me towards the latter, but the core argument of the book is that that's lazy -- you can hand wave away anything and everything with "operator error".

justin_oaks3y ago

I welcome others to share stories of the normalization of deviance in their companies.

One company I worked had no unit tests, no infrastructure as code, and no build server. This held strong for a while until enough developers implemented some unit tests, infrastructure as code (e.g. terraform), and a build server as skunkworks projects. Eventually management tolerated them, but never endorsed them. Some teams at the company still never embraced good practices because it wasn't forced on them.

I guess I've never worked at a company that valued unit tests across the whole of the engineering team. I introduced them and implemented them on my own team, but others ignored it.

ctroein893y ago

> and no build server

Personal experience is that a build server normalizes deviance. "But it works on the build server" we used to say, as, with time, it become harder and harder to build locally. "Just fix your environment!" we used to say, when it was the build system that was actually at fault. "It's all so fragile, just copy what we've done before!" we then said, repeating the mistakes that made the build system so fragile.

Eventually, the build system moved into a Docker image, where the smells where contained. But I'm still trying to refactor the build system to a portable, modern alternative. If we hadn't had a build server, we'd have fixed these core issues earlier and wouldn't have built on such a bad foundation. Devs should be building systems that work locally: the heterogeneity forces better error handling, the limited resources forces designing better scaleability, and most importantly, it prevents "but it works on the build server!".

lmm3y ago

"But it works on my macbook" is even worse than "but it works on the build server". If you have a build server you're at least forced to make sure it builds in two places (your own machine and the build server) before you merge rather than only one.

MikePlacid3y ago

> “But it works on the build server" we used to say, as, with time, it become harder and harder to build locally

This got me puzzled for a couple of minutes. Yeah, that “WTF, WTF” moment. Then I realized that our build “server” comprised of 12 different platforms (luckily reduced to just 6 in the later years), so to pass a build in production was a bit harder than to build locally.

superpope993y ago

Has Dan Luu ever explained why he doesn't put dates in his blog posts?

Jtsummers3y ago

Most of his posts seem to be date-independent. To the extent that it matters, you can check the homepage: https://danluu.com/. There you will find month and year of the posts.

aeturnum3y ago

If you enjoyed this - I highly recommend watching Adam Curtis' Can't Get You Out of My Head: https://thoughtmaybe.com/cant-get-you-out-of-my-head/

theptip3y ago

As a new hire there is a line to walk between on one hand, using your outside/fresh perspective to provide valuable insight to the org, and on the other, complaining (or appearing to complain) about decisions where you don’t have full context.

Many of the examples in the OP are probably closer to the former, but my general advice here is to keep lots of notes about what seems broken, and revisit in a month or two. Sometimes you gained context that explains why something is actually sensible. If it still seems crazy with context, you can now bubble up the feedback with confidence, and also having hopefully built some respect and trust from the team to make the message land better.

bluedino3y ago

Write total shit for code, then look like a 'genius' for 'fixing' bugs, only to have them come back again in the future (further looking like a clown to the rest of the team)

dec0dedab0de3y ago

I'm still reading, but I just got to the part about flaky, and I got annoyed because there are clear use cases for flaky or pytest-retry.

If you have an integration test that relies on an unreliable system you do not control. Sure you can mock it out for a unit test, but if you want to make sure you catch breaking API changes, you need to hit the actual system. And if it works after retrying it a few times, then so be it. no need to throw shade.

JackFr3y ago

I'm going to push back and say that test is not a valuable automatic test. The phrase "relies on an unreliable system" captures that lack of value.

dec0dedab0de3y ago

When the code your testing is a client for some remote API, and the sandbox/development/Testing version of that API doesn't have the same resources and uptime guarantee as production, then what are your options? as far as I can tell they are:

Don't test it.

Only do unit tests with the connection mocked out.

Test against production.

Try it a few times with a delay, and if it works then you know your code is good and you can move on with your deployment. Which is what flaky and pytest-retry do.

Maybe I'm missing something, but out of those 4 options retrying the test seems like the best one, with the big caveat that it is only viable if the test does indeed work after trying a few times. I really don't see any downside.

edit:

Maybe another option is to put the retry functionality directly in the client code, which would make your code more robust overall. but that is definitely more complex than using one of these libraries just for testing.

Aeolun3y ago

I think the issue is people using it when they’re too lazy to fix the test case.

namaria3y ago

I find quite interesting that people will prefer a highly malleable language like Python, and then orgs have to adopt testing to get around all the inconsistencies caused by absent type system. And then people will write libraries to get around the pesky tests to get their flexibility back.

It's fascinating really... Complex systems are always in partial failure mode and that applies to collective optimization challenges. Organizations will always be stuck in local optima in most domains.

https://www.aopa.org/news-and-media/all-news/2015/december/0...

shadytrees3y ago

formatted for wide monitors https://ddanluu.com/wat

pizzaknife3y ago

i routunely remind everyone,"we're all mercenaries."

i have marginal control over who i manage. The Product isnt saving the world, but it is allowing us to live reasonably and with a clear soul at the end of the sprint. The reason i say the "mercenary" bit is simple: weigh your dreams against blood and gold and compromise.

Wistar3y ago

AOPA: Normalization of Deviance in Aviation

calmdown133y ago

This rang so true for me. I’m constantly rediscovering things that I already understood well at my previous job. Once you get acclimatised to the current system, so many previously obvious learnings fall by the wayside.

irsagent3y ago

"rictus of horror" - What a set of words to describe a response.

afahad3y ago

I couldn’t finish reading this. Horror film. A screenplay to dystopian dread.

https://news.ycombinator.com/newsguidelines.html

Our society places an enormous value on "competitive obfuscation".

As an obvious result, our society does an incredible amount of work maintaining that obfuscation.

---

I've heard estimates that 20% (1/5th) of all healthcare-related spending in the US is overhead from insurance determinations, paperwork, etc., and that that 25% (25%/125%=1/5th) extra spending (relative to 100% of the rest of healthcare expenditure) does not exist in single-payer healthcare systems, like those used in Canada, Germany, and every other developed nation in the world.

What do we get from that extra spending? What substantive difference does that obfuscation provide?

The main difference I see is "explicit opportunity cost". Instead of deciding ahead of time that we will pay for any arbitrary healthcare need (as a single-payer program), the opportunity for each individual healthcare act is given a price, and groups of priced opportunities are provided by subscription-based insurance plans.

Every person has to find, apply for, and pay for an insurance plan that will meet their current and future healthcare needs.

Because that is explicit, there is leverage available to manipulate each opportunity cost, and even the opportunity of each person to have that opportunity provided to them.

So what does that leverage even look like, and who is using it, and for what purpose?

Politics. Instead of care being determined by your doctor, access to each type of care is explicitly made available (or unavailable) by your insurance plan. That's a huge attack surface for political motivation.

There is currently a dextroamphetamine (Adderall) shortage in the US. The other day, I went to my pharmacy to pick up my prescription for 30 generic Concerta (methylphenidate extended release, another stimulant medication used for ADHD), and learned that all they had left were 16 brand-name Concerta. I was lucky enough to have that covered by my insurance. Many different insurance plans would not have provided me that opportunity.

Why is there a shortage? Despite a significant increase in ADHD diagnosis last year, the DEA refused to raise the limit of Adderall that can be legally manufactured. Why? Because there is a long-standing political conflict between stimulant addiction prevention and ADHD treatment, and the DEA is positioned at one side of it.

That same political conflict is why some insurance companies have outright refused to include coverage for stimulant medications. Even without a nationwide shortage, some people have found themselves stuck in a position where the opportunity for medication is held just out of reach by the political decision of their insurance company, or the lack of access to insurance at all.

The same pattern can be found with practically every type of medical care that is politically controversial: contraceptives, abortions, hormones, etc. Even if you can't get a legislative ban, there is still leverage available to obfuscate opportunity itself.

When conservative politicians argue that a single-payer program would be "too socialist for America", the substantive difference they intend to preserve is the political leverage that is baked into the system we have; the political leverage that allows politics to restrict our medical care without a single vote.

---

That's just one example. This pattern is everywhere. The only answer is social objectivity. It's a hard problem.

AlbertCory3y ago

This guy needs to organize & format his writing better, since he does have really interesting things to say.

dang3y ago

"Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting."

xnorswap3y ago

If you don't have a reader view in your browser, just paste this into your global CSS:

    p 
    {
      line-height: 1.7;
      max-width: 60em;
      font-size: 1.2em;
      margin-left: 5em;
    }

It pretty much fixes the default readability which is essentially zero on this site otherwise.

sebstefan3y ago

I don't like fucking with the default CSS, my alternative is this bookmark that injects Javascript in the page and temporarily fixes the formatting at one click of a button

    javascript:(function(){ var bod = document.getElementsByTagName("body")[0]; bod.style.margin = "40px auto"; bod.style.maxWidth = "40vw"; bod.style.lineHeight = "1.6"; bod.style.fontSize = "18px"; bod.style.color = "#444"; bod.style.padding = "0 10px"; })();

And for a dark mode:

    javascript:(function(){ var body = document.getElementsByTagName("body"); var html = document.getElementsByTagName("html"); var img = document.getElementsByTagName("img"); body[0].style.background = "#131313"; body[0].style.opacity = "1.0"; html[0].style.filter = "brightness(115%) contrast(95%) invert(1) hue-rotate(180deg)"; img[0].style.filter = "contrast(95%) invert(1)"; })();

Sesse__3y ago

Dan Luu talks a bit about this here: https://twitter.com/danluu/status/1115707741102727168

ryandrake3y ago

Yuck! I hate it when sites monkey around with max-width. I've got a nice 27 inch monitor. I want to use all of it. It's refreshing to see a site that doesn't insist on second-guessing the width that I set my browser window.

https://support.mozilla.org/en-US/kb/firefox-reader-view-clu...

sebstefan3y ago

Firefox has the "reader view" option toggleable with F9 for when you stumble upon unreadable designs, if you want

jwilk3y ago

It's F9 on Windows, command-option-r on macOS, and ctrl-alt-r elsewhere.

burnished3y ago

This is one of the few sites I find immediately and highly readable (on mobile). Does it not wrap lines on larger displays?

pasquinelli3y ago

that site is better than most, imho.

martopix3y ago

> Have you ever mentioned something that seems totally normal to you only to be greeted by surprise?

Once some (foreigner) person was surprised at my dipping toast with Nutella in my latte. I was equally surprised by his surprise.

quickthrower23y ago

> It's technically possible to use @flaky for that, but in practice it's used to re-run the test multiple times and reports a pass if any of the runs pass

This is useful and fine. Someone wrote a test and it now hits a race condition or something and occasionally fails. Let’s assume we are very confident it is problem with the test not the product.

Choices:

Spend a sprint trying to fix it right now regardless of priority.

Turn it off and lose that coverage.

Buy some time.

In this context it makes sense. As long as their is a procedure to address these in some sane timeframe.

Maybe that is an example of normalization of deviance. But I think if it is discusses and trade offs thought through it is an OK thing to do at times. Remember most development is not green field. You inherit a system when you start a job.

j / k navigate · click thread line to collapse

219 comments

aj73y ago

“Let's say you notice that your company has a problem that I've heard people at most companies complain about: people get promoted for heroism and putting out fires, not for preventing fires.”

jacquesm3y ago

One of the key questions in my due diligence practice is whether people are allowed to 'be negative' and to literally stop the line to avoid shipping a defective product.

Kudos to you for speaking up, and irrespective of who got to be called a hero (that part isn't all that relevant to me) also kudos to your employer for acting on your input.

Zagill3y ago

[1] https://en.wikipedia.org/wiki/Autonomation

MichaelZuo3y ago

xyzzy1233y ago

It's disappointing (and career limiting) to do the "right" engineering and lose because you didn't correctly gauge the risk tolerance of the market.

I don't think there's any one answer for this.

My observation is that people will pay a premium for demonstrable physical safety features but privacy & security in software do not win markets.

aj73y ago

Heh, heh. They ignored my input and kept shipping. Then the customers began calling…

Meanwhile our major competitor cleverly placed their rotating element in such a position that the beam retraced itself through the rotating element, thereby substantially cancelling this effect.

smrtinsert3y ago

I've also learned this as "bring the solution", regardless of whether or not you find the problem.

actually_a_dog3y ago

Do you actually use the phrase "be negative" when you ask the question? I could see that distorting the responses by people who don't consider pulling the cord to stop the line "being negative."

quickthrower23y ago

Or … can the first officer order a goaround? Reading too much Admiral Cloudberg!

geenew3y ago

I once encountered a situation with a very expensive field laser. At one point, measurements started showing an increasing amount of offset.

Over a period of days, the error became increasingly, comically bad, until finally the system refused to boot.

A technician was called, and after hearing about the behaviour, the first request was that a photo of the laser light exit port be taken.

It was obvious why it wouldn’t boot: a mirror in the light path had fallen off.

The worst part was, the mirror had been held on by glue, and had been slowly slipping out of place. The hot climate was probably a factor.

They really should have had someone to say ‘you can’t ship that’ when the topic of glue to hold mirrors came up.

pm_me_your_quan3y ago

I wish I had pushed more strongly about it. We spent probably a full person-day of work every week on that.

c_o_n_v_e_x3y ago

>They really should have had someone to say ‘you can’t ship that’ when the topic of glue to hold mirrors came up.

KMag3y ago

XorNot3y ago

You could look at it a different way: a more accurate nuke means a nuke that's targeted at military facilities and not sized 10x larger and aimed at "everything around that city over there".

If it was ever used, that work saves lives.

They proved you wrong, at great cost.

It's really difficult to be explicit about opinions. You can't really put them in your resume, but at the same time, an opinion on management style may be an executive's primary value contribution!

LBJsPNS3y ago

fennecfoxy3y ago

I'm not in as prestigious a line of work as you are but I've found the exact same thing happens in my industry (web development, mostly).

hef198983y ago

chaps3y ago

aliqot3y ago

That's what I never understood about this story; did you guys have any suspicion it would dump radiation into the patient all at once, or was this like a concurrency bug

chaps3y ago

Huh? Are you assuming that the parent comment is about someone programming a medical device?

dang3y ago

Normalization of Deviance (2015) - https://news.ycombinator.com/item?id=22144330 - Jan 2020 (43 comments)

Normalization of deviance in software: broken practices become standard (2015) - https://news.ycombinator.com/item?id=15835870 - Dec 2017 (27 comments)

How Completely Messed Up Practices Become Normal - https://news.ycombinator.com/item?id=10811822 - Dec 2015 (252 comments)

What We Can Learn From Aviation, Civil Engineering, Other Safety-critical Fields - https://news.ycombinator.com/item?id=10806063 - Dec 2015 (3 comments)

csomar3y ago

Extrapolating from these related threads, this article should have 2K+ comments the next year.

PaulHoule3y ago

I like the bit about

   Let's look at how the first one of these, “pay attention to weak signals”,
   interacts with a single example, the “WTF WTF WTF” a new person gives off 
   when the join the company.

and kinda wonder if a company that prioritized not getting this reaction from new hires might find it is the most impactful thing they can do in terms of culture.

jkaptur3y ago

Like, how does the new hire (or anyone else) know the difference between "learning the complexity of the new system" and "internalizing/normalizing the deviance of this culture"?

aidenn03y ago

> Like, how does the new hire (or anyone else) know the difference between "learning the complexity of the new system" and "internalizing/normalizing the deviance of this culture"?

908B64B1973y ago

> by far the #1 "WTF WTF WTF" thing that new hires say is "what do you mean you aren't using $TODAYS_HOT_JS_FRAMEWORK??"

josephg3y ago

> it's not clear how to balance a fresh perspective vs. an experienced (normalized? tainted?) view.

flappyeagle3y ago

If "what do you mean you aren't using $TODAYS_HOT_JS_FRAMEWORK" is the first question they ask, then you can end their employment right then and there. Hire people whose "wtf" you take seriously.

jt21903y ago

Yes. From the article:

> [H]ow does the new hire (or anyone else) know the difference between "learning the complexity of the new system" and "internalizing/normalizing the deviance of this culture"?

The article implies that the new hire should pay close attention to the things that are incentivized, and those that are not.

ReflectedImage3y ago

That's not a "WTF". All front end developers trash their predecessors work and rewrite in the last framework. That just what they do.

GartzenDeHaes3y ago

Ensorceled3y ago

> They only way is to take down the old culture bearers. ... To change the culture, these people have to go.

chaps3y ago

Been told that before. When I spoke up, I was told that I was new and shouldn't talk about things I know nothing about.

giobox3y ago

PaulHoule3y ago

It's not (1) "reacting to the reaction" which is the endpoint but (2) not having that reaction. If (1) is important it as because that is the path to (2).

AnIdiotOnTheNet3y ago

https://blog.codinghorror.com/the-f5-key-is-not-a-build-proc...

lo_zamoyski3y ago

PaulHoule3y ago

e_i_pi_23y ago

rr8883y ago

The problem is when you join a high performing team and organization. Is a WTF something they should fix your you need to recalibrate what you think is normal?

kurthr3y ago

https://how.complexsystems.fail/ is an example, but there are many.

euroderf3y ago

"When a fail-safe system fails, it fails by failing to fail-safe."

stevehawk3y ago

kurthr3y ago

sokoloff3y ago

> Not having any new civil aviation planes for 30 years worked...

That's a lot of new civil aviation aircraft designs in the last 30 years.

neilv3y ago

Sounds like there's some similarities with everyone focusing on Leetcode interviews, and then one generation of that filtering and then mentoring the next, and repeat.

The companies don't know what that's costing them, until there's a problem that can't be ignored.

lmm3y ago

https://soaringeconomist.com/2019/10/30/experience-can-kill-...

pja3y ago

> Those people can also be fun/terrifying to fly with, because they will take risks.

Risk homeostasis in action!

somat3y ago

A thought experiment.

When is it "Normalization of Deviance"? and when is it a "Efficiency Optimization"?

I mean, the difference is pretty clear after something has failed, But very murky before.

jpollock3y ago

It is Efficiency Optimization when you know why the rule is there, and having made an estimation of the risks, perform a cost-benefit analysis.

aka "Chesterton's Fence"

Otherwise, it's "Normalization of Deviance":

* The build is broken again? Force the submit.

* Test failing? That's a flaky test, push to prod.

* That alert always indicates that vendor X is having trouble, silence it.

Those are deviant behaviours, the system is warning you that something is broken. By accepting that the signal/alert is present but uninformative, we train people to ignore them.

vs...

* The build is always broken - Detect breakage cause and auto rollback, or loosely couple the build so breakages don't propagate.

* Low-value test always failing? Delete it/rewrite it.

* Alert always firing for vendor X? Slice vendor X out of that alert and give them their own threshold.

manicennui3y ago

https://www.fastjetperformance.com/blog/how-i-almost-destroy...

Logans_Run3y ago

I'm not sure if this link has already been posted but have a look at How I Almost Destroyed a £50 million War Plane and The Normalisation of Deviance.

renewiltord3y ago

In The Field Guide to Human Error Investigations by Sidney Dekker, he quotes someone else saying something like:

> Everything that can go wrong will go right.

Murphy's Law then manifests from escaping disaster through repeated iterations of taking risks where most things play out well anyway.

I have to laugh at the "append z to the end" strat at Google, though. That's a good one.

overengineer3y ago

Here are a few examples from real-world history that reflect problems discussed in the article:

deanCommie3y ago

Reality: It is true that EVERY organization is broken in some way or another.

You have to find the one that is broken in the way that is tolerable to you.

kerblang3y ago

Great stuff - I think this goes in the "required reading" list.

The tech industry tends to revolve around "I'm a super-rational robotic genius" thinking that can't accept the existence of its own irrational tendencies, to the point that it becomes ridiculous.

SideburnsOfDoom3y ago

The standard "Required reading" text on the subject is "The Field Guide to Understanding 'Human Error'" By Sidney Dekker

goostavos3y ago

Great book! (Although, it could have used a more aggressive editor)

justin_oaks3y ago

I welcome others to share stories of the normalization of deviance in their companies.

I guess I've never worked at a company that valued unit tests across the whole of the engineering team. I introduced them and implemented them on my own team, but others ignored it.

ctroein893y ago

> and no build server

lmm3y ago

MikePlacid3y ago

> “But it works on the build server" we used to say, as, with time, it become harder and harder to build locally

superpope993y ago

Has Dan Luu ever explained why he doesn't put dates in his blog posts?

Jtsummers3y ago

Most of his posts seem to be date-independent. To the extent that it matters, you can check the homepage: https://danluu.com/. There you will find month and year of the posts.

aeturnum3y ago

If you enjoyed this - I highly recommend watching Adam Curtis' Can't Get You Out of My Head: https://thoughtmaybe.com/cant-get-you-out-of-my-head/

theptip3y ago

bluedino3y ago

Write total shit for code, then look like a 'genius' for 'fixing' bugs, only to have them come back again in the future (further looking like a clown to the rest of the team)

dec0dedab0de3y ago

I'm still reading, but I just got to the part about flaky, and I got annoyed because there are clear use cases for flaky or pytest-retry.

JackFr3y ago

I'm going to push back and say that test is not a valuable automatic test. The phrase "relies on an unreliable system" captures that lack of value.

dec0dedab0de3y ago

Don't test it.

Only do unit tests with the connection mocked out.

Test against production.

Try it a few times with a delay, and if it works then you know your code is good and you can move on with your deployment. Which is what flaky and pytest-retry do.

edit:

Aeolun3y ago

I think the issue is people using it when they’re too lazy to fix the test case.

namaria3y ago

https://www.aopa.org/news-and-media/all-news/2015/december/0...

shadytrees3y ago

formatted for wide monitors https://ddanluu.com/wat

pizzaknife3y ago

i routunely remind everyone,"we're all mercenaries."

Wistar3y ago

AOPA: Normalization of Deviance in Aviation

calmdown133y ago

irsagent3y ago

"rictus of horror" - What a set of words to describe a response.

afahad3y ago

I couldn’t finish reading this. Horror film. A screenplay to dystopian dread.

https://news.ycombinator.com/newsguidelines.html

Our society places an enormous value on "competitive obfuscation".

As an obvious result, our society does an incredible amount of work maintaining that obfuscation.

---

What do we get from that extra spending? What substantive difference does that obfuscation provide?

Every person has to find, apply for, and pay for an insurance plan that will meet their current and future healthcare needs.

Because that is explicit, there is leverage available to manipulate each opportunity cost, and even the opportunity of each person to have that opportunity provided to them.

So what does that leverage even look like, and who is using it, and for what purpose?

---

That's just one example. This pattern is everywhere. The only answer is social objectivity. It's a hard problem.

AlbertCory3y ago

This guy needs to organize & format his writing better, since he does have really interesting things to say.

dang3y ago

"Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting."

xnorswap3y ago

If you don't have a reader view in your browser, just paste this into your global CSS:

    p 
    {
      line-height: 1.7;
      max-width: 60em;
      font-size: 1.2em;
      margin-left: 5em;
    }

It pretty much fixes the default readability which is essentially zero on this site otherwise.

sebstefan3y ago

I don't like fucking with the default CSS, my alternative is this bookmark that injects Javascript in the page and temporarily fixes the formatting at one click of a button

    javascript:(function(){ var bod = document.getElementsByTagName("body")[0]; bod.style.margin = "40px auto"; bod.style.maxWidth = "40vw"; bod.style.lineHeight = "1.6"; bod.style.fontSize = "18px"; bod.style.color = "#444"; bod.style.padding = "0 10px"; })();

And for a dark mode:

    javascript:(function(){ var body = document.getElementsByTagName("body"); var html = document.getElementsByTagName("html"); var img = document.getElementsByTagName("img"); body[0].style.background = "#131313"; body[0].style.opacity = "1.0"; html[0].style.filter = "brightness(115%) contrast(95%) invert(1) hue-rotate(180deg)"; img[0].style.filter = "contrast(95%) invert(1)"; })();

Sesse__3y ago

Dan Luu talks a bit about this here: https://twitter.com/danluu/status/1115707741102727168

ryandrake3y ago