My first day at work at big-laser-company. Manufacturing engineer for a laser (then) so complex, it required a PhD to solve problems to get units out the door. The product was a ring laser. What that means is that the laser beam travels around in a race track pattern inside the laser before getting out, not a back-and-forth bouncing between two mirrors. Now this laser could be tuned to any wavelength by suitable setups and machinations, and once there, would “scan” a small amount about this wavelength, enabling scientists to study tiny spectral features in atoms and molecules with great precision. I knew all this shit. I was a Berkeley-trained physicist that built precision lasers out of scrap metal for my thesis. First day of work. I walk into the final test lab. The big laser was happily scanning away. The bright yellow needle-like output beam was permitted to hit the lab wall. As the laser scanned, the beam was MOVING on the wall. Whereupon, first day of work, I exclaimed the most obscene four words in manufacturing, for all to hear, “You can’t ship that!” (“Beam pointing instability” is detrimental to almost any laser application. It turns out that during scanning, an optical element was rotating, on a shaft, inside this laser. This mechanical motion caused beam motion.”) Well, I got an immediate reputation as a negative guy. (You can tell it’s deserved.) The solution was to retrofit 28 lasers in the field, mostly in Europe, with a component that cancelled the movement, on an expensive junket by a service guy. Who was hailed as a ”hero.”
This one question tends to separate out a very large fraction of companies that take unacceptable risks and allows the ones that don't to be justifiably proud of their attitude towards risk. These are not trivial things either, medical devices and software used in medical diagnosis, machine control and so on where an error can quite literally cost someone their life or a good chunk of their healthy life-span. Companies where people can not or won't speak up tend to have a lot of stuff that's wrong wiped under the carpet.
Kudos to you for speaking up, and irrespective of who got to be called a hero (that part isn't all that relevant to me) also kudos to your employer for acting on your input.
It's disappointing (and career limiting) to do the "right" engineering and lose because you didn't correctly gauge the risk tolerance of the market.
I don't think there's any one answer for this.
My observation is that people will pay a premium for demonstrable physical safety features but privacy & security in software do not win markets.
Meanwhile our major competitor cleverly placed their rotating element in such a position that the beam retraced itself through the rotating element, thereby substantially cancelling this effect.
Over a period of days, the error became increasingly, comically bad, until finally the system refused to boot.
A technician was called, and after hearing about the behaviour, the first request was that a photo of the laser light exit port be taken.
It was obvious why it wouldn’t boot: a mirror in the light path had fallen off.
The worst part was, the mirror had been held on by glue, and had been slowly slipping out of place. The hot climate was probably a factor.
They really should have had someone to say ‘you can’t ship that’ when the topic of glue to hold mirrors came up.
I wish I had pushed more strongly about it. We spent probably a full person-day of work every week on that.
I work in product at a hardware company and have a lot of domain experience which came from spending years in the (literal) field. There's been many times where I write a product spec and the engineers are incredulous. "Really? It gets THAT hot?" or "Do we really need to provide a bonding/grounding lug on the case?"
It's not uncommon to find engineering teams with deep domain experience in one area, but completely lacking in others. Ignoring domain experience, there should have been rigorous product testing during design that would have weeded out the glue issue.
I had a colleague who got called up when a Trident missile MIRV bus fell off a forklift and he had to do simulations to tell the Navy if it was still good or needed to be brought back in for rework/recalibration. My understanding is that either the MRIV bus itself or its container has integral devices that record peak 3-axis acceleration for just such a scenario. I imagine they're as simple as a few precise weights on a few wires with precise failure strains, so you can bracket the peak acceleration by which wires broke and which survived.
On the one hand, it's great to have more accurate nukes, which allow lower yields, smaller stockpiles, and presumably smaller craters if everything goes sideways. On the other hand, "surgical" nukes result in it more likely that one side will use them and gamble that the other side won't massively retaliate.
If it was ever used, that work saves lives.
A more correct and polite version of your advice: "It will cost us a lot more to ship this as-is, and fix it later, than it will to delay shipment and fix it now. Is it too late to do that? Did we over-commit to shipping now?"
It wasn't your responsibility to come up with that version. It was your manager's responsibility. It was also their responsibility to find the necessary decision-makers and involve them directly. I would argue that this sort of work is the only real way that "management" can provide value in the first place.
Somehow, socially, it's incredibly common for people to value the inverse of that job. People assume it is "good work" for a manager to successfully ignore unpopular concerns, and push through to the end, no matter how inefficient that makes the journey.
That works out in the case that the shipping date was over committed, such that a delay would cost more than fixing it later. Even so, that entire situation would be avoided by refusing to over-commit shipping dates in the first place. That's the same responsibility applied earlier in time, so a manager that behaves the way I have described could factor out the entire problem at its source.
This is what the average person should learn about management. Even if it's not their job personally, there is a lot of leverage behind the decision a worker makes about what management behaviors to socially favor, and what behaviors to socially reject. That leverage is multiplied at every level up the hierarchy, making that the opinion of someone in a management role is very significant, and the opinion of someone in an executive role is crucial.
It's really difficult to be explicit about opinions. You can't really put them in your resume, but at the same time, an opinion on management style may be an executive's primary value contribution!
Everyone wants to think of the cool ideas to make things work, but few people want to think of all the ways those ideas can break, fail, fail to be future-proof, be expensive, etc, etc whereas I relish in it; what's more satisfying than helping make a proposed or existing solution even better?
But the same applies to stuff outside of work, too. I find I'm quite negative about stuff in the exact same way and whilst it's fun to think "how could we fix this, how could we make it better", all people see it as is negativity and social pressure has made me start to rethink this approach in life. It's better to keep your mouth shut and let the fire start than to open your mouth and be negative, as per your analogy.
Hell it even applies to traditionalism; "we should put out that fire" "but that fire's always been there, that's the way it's always been" "but it's a fire!!!" "yeah, well it was here before you and we like it. That fire walked uphill both ways through the snow to get to school".
You are not getting points for preventing fires, you get them for putting them out. Unfortunately, some folks seems to conclude that lighting fire up, just to put them out later, is a good and easy way to earn that "hero" reputation.
Normalization of Deviance (2015) - https://news.ycombinator.com/item?id=22144330 - Jan 2020 (43 comments)
Normalization of deviance in software: broken practices become standard (2015) - https://news.ycombinator.com/item?id=15835870 - Dec 2017 (27 comments)
How Completely Messed Up Practices Become Normal - https://news.ycombinator.com/item?id=10811822 - Dec 2015 (252 comments)
What We Can Learn From Aviation, Civil Engineering, Other Safety-critical Fields - https://news.ycombinator.com/item?id=10806063 - Dec 2015 (3 comments)
Let's look at how the first one of these, “pay attention to weak signals”,
interacts with a single example, the “WTF WTF WTF” a new person gives off
when the join the company.
and kinda wonder if a company that prioritized not getting this reaction from new hires might find it is the most impactful thing they can do in terms of culture.Once you get away from "should we use version control" and into actually difficult software engineering questions, it's not clear how to balance a fresh perspective vs. an experienced (normalized? tainted?) view. I wish the article went into this more.
Like, how does the new hire (or anyone else) know the difference between "learning the complexity of the new system" and "internalizing/normalizing the deviance of this culture"?
If a new hire can't checkout, build, and test the software on the first day, then there is likely something either wrong with the hire or the infrastructure. A sufficiently old and arcane software system might take weeks before a new hire can make even a simple change, but that shouldn't impact those three items.
To that speaks of the caliber of programmers hired. If all they have seen is $TODAYS_HOT_JS_FRAMEWORK and wrote nothing but a web app using $TODAYS_HOT_JS_FRAMEWORK they might not grasp the fundamentals that would make then realize that frameworks are just abstractions (and not that different from one another).
I don't think any software engineer would even ask that question, since the answer will almost always be "$TODAYS_HOT_JS_FRAMEWORK didn't exist when the project started, and it's not worth a re-write to port it over".
Now, that brings out a second important truth: a company can't attract and retain a wide range of different caliber employees. For instance, if a place still questions the usefulness of source control (perhaps because they consider git to be too complicated) there's no way they'll attract and retain top performers. So the culture will select people that agree that source control is a waste of time.
Having a healthy, balanced view comes from enough experiences. Ideally from working in a bunch of places and seeing enough things go sideways, and correctly understanding and identifying the causal chain that led to failures.
Funnily enough its sort of like training an AI - you essentially need a lot of correctly labelled data to learn. Junior engineers don't have enough data points, and unfortunately some "senior engineers" I've worked with took (in my opinion) the wrong lessons from their experiences. (Eg the CTO who thinks version control is too complex.)
The interesting cases are when smart, experienced people disagree on what the best solution is. Should you keep your team small and smart or have a varied team with more mentorship and process? Is code review worth it in every case? What is the right amount of tests for your software? How often do we want to push to production?
When I was teaching programming my students would sometimes ask juicy questions. My favorites were the questions I could answer with "I'll tell you my opinion, but I've worked with people I look up to who think I'm wrong about this..."
> “The thing that's really insidious here is that [once a person buys] into the WTF idea… they can spread it elsewhere for the duration of their career… Once people get convinced that some deviation is normal, they often get really invested in the idea.”
> [H]ow does the new hire (or anyone else) know the difference between "learning the complexity of the new system" and "internalizing/normalizing the deviance of this culture"?
The article implies that the new hire should pay close attention to the things that are incentivized, and those that are not.
To change the culture, these people have to go. Firing them may not be feasible, but there are other options. Dethroning them in the form of a promotion or even just physically moving them can be effective. When people don't have to jump through their hoops anymore, they lose their organizational power.
I was briefly head of engineering at a company that had several "old culture bearers" that made change impossible. I was something like the 3rd or 4th engineering leader over the space of a year. Apparently the person after me was actually allowed to fire a few of these people and was able to turn things around.
People generally don't wake up in the morning and go into work motivated to make insane/shit things - context, tech debt and business realities all mount up and even the best of us can end up making choices that in isolation look crazy.
There are of course companies who are really bad and you may well be right, but so many times I have seen in my career a young new hire storm in and think everything is shit without paying heed to the context and historical pressures. The best thing you can do in many cases is spend the first ~six months at a new tech company trying to understand that context, and indeed I think more mature engineers generally do.
I'd say a company that has accomplished (2) has cut the workload in hiring employees by 30-50% in the sense that every employee who has reaction (1) either internally or externally is at risk for being disengaged or leaving soon. Not only that but you are probably wasting your dev's time and could get dramatically more productivity out of them if you aren't WTFing them to death.
There should be no controversy at all that complete instructions for installing everything required for a dev to build the project and work on it should exist and it should be possible to complete this task in hours, not the weeks that it frequently takes. And, no, "docker" is not an answer to this anymore than "The F5 Key is a Build Process"
https://blog.codinghorror.com/the-f5-key-is-not-a-build-proc...
It is not "Docker" that solves the problem, it is the discipline of scripting the image build process into a dockerfile. If you know how to write a dockerfile you can write a bash script that runs in 20 seconds as opposed to having Docker spend 20 minutes downloading images and then crash because of a typo.
You are right that a company might have good reasons for doing things in an unobvious way, but most of the time when nobody at a company claims to understand what the company is doing except for the CEO and people aren't too sure about the CEO, it is the fault of the company lacking alignment, not a natural property of freshers.
In these systems it is found that they are almost always operating (or transitioning between) failure modes. Often multiple operational failure modes are simultaneous. It becomes very important to test the system in each of it's failure modes and their combinations to maintain high up time.
https://how.complexsystems.fail/ is an example, but there are many.
Human work, development, and maintenance is itself a system that interacts with these critical systems. Frankly, failure to fail causes failure (thus chaos monkey). The mythical man month is almost a sub category of these failures as are HR hiring processes and other BS. Being too successful and not having competition (or similarly sclerotic competition) can be as much of a hazard as "move fast, break things".
One of the more interesting things I've found is that a huge number (easily a majority) of instructors are recently trained flyers, because there is a pipeline to train them and they're cheaper than using experienced pilots (esp for multi-engine and more complex airplanes). They also know all the ins-outs of the training and rule books (with recent changes) so they know how to pass all the tests and how to teach that. Sooooo you have a bunch of inexperienced pilots teaching all the new pilots... there's likely a failure there, but it hasn't reared its head. We still have a lot of ex-military folks around who didn't learn that way.
Who do you want flying when things go bad? People who have spent many hours with things about to go bad (military, emergency/fire, sail plane pilots) who have experience dealing with it. Those people can also be fun/terrifying to fly with, because they will take risks.
The A220, A350, and A380 are all newer than 30 years old. (A321 is barely younger than that and A330 barely older.) Boeing has released the 777 and 787 in the last 30 years. The Cirrus SR20 and SR22 are newer than that, as is the SF50 jet. The Diamond DA40, DA42, and DA62 are newer. The Honda Jet is newer. Cessna has a handful of business jets newer than that. The Embraer Phenom 100 and 300 are newer than that. There are variants of the CRJ newer than that (-700, -900, -1000).
That's a lot of new civil aviation aircraft designs in the last 30 years.
Sounds like there's some similarities with everyone focusing on Leetcode interviews, and then one generation of that filtering and then mentoring the next, and repeat.
The companies don't know what that's costing them, until there's a problem that can't be ignored.
In the case of software engineering (poorly studied, relative to aviation) the company will generally never learn whether a non-Leetcode&promo-focused team could've avoided the problems in the first place, nor whether non-Leetcode experience could've handled a problem that happened anyway.
Maybe. Or maybe you're better off with freshly trained people who still remember exactly what to do in all the failure scenarios. Certainly I've generally felt safer with drivers who'd just passed their test than with people who've been driving for years, for example.
Risk homeostasis in action!
https://soaringeconomist.com/2019/10/30/experience-can-kill-...
When is it "Normalization of Deviance"? and when is it a "Efficiency Optimization"?
I mean, the difference is pretty clear after something has failed, But very murky before.
aka "Chesterton's Fence"
Otherwise, it's "Normalization of Deviance":
* The build is broken again? Force the submit.
* Test failing? That's a flaky test, push to prod.
* That alert always indicates that vendor X is having trouble, silence it.
Those are deviant behaviours, the system is warning you that something is broken. By accepting that the signal/alert is present but uninformative, we train people to ignore them.
vs...
* The build is always broken - Detect breakage cause and auto rollback, or loosely couple the build so breakages don't propagate.
* Low-value test always failing? Delete it/rewrite it.
* Alert always firing for vendor X? Slice vendor X out of that alert and give them their own threshold.
https://www.fastjetperformance.com/blog/how-i-almost-destroy...
> Everything that can go wrong will go right.
Murphy's Law then manifests from escaping disaster through repeated iterations of taking risks where most things play out well anyway.
I have to laugh at the "append z to the end" strat at Google, though. That's a good one.
- The Space Shuttle Challenger disaster in 1986 was caused by the normalization of deviance, where engineers became accustomed to problems with the O-ring seals and began to accept them as normal. This led to the eventual catastrophic failure of the shuttle's launch, killing all seven crew members.
- The 2008 financial crisis was caused in part by a normalization of deviance in the banking industry, where risky and complex financial instruments were routinely used without proper oversight or understanding of the potential risks. This led to a widespread collapse of the financial system and a global economic recession.
- The Volkswagen emissions scandal in 2015 was caused by a normalization of deviance in the automotive industry, where engineers and executives became accustomed to cheating emissions tests and misleading customers about the true environmental impact of their vehicles. This led to significant financial and reputational damage to the company.
- The Theranos scandal in 2018 was caused by a normalization of deviance in the healthcare industry, where the company's leaders became accustomed to misrepresenting the capabilities of their blood testing technology and misleading investors and customers about its accuracy. This led to significant legal and financial repercussions for the company and its executives. (ChatGPT)
You have to find the one that is broken in the way that is tolerable to you.
Arguably the closest we know to a panacea in terms of engineering culture and best practices is Google. And what are they now known for? An inability to ship anything meaningful anymore. Spinning around in circles launching and re-launching new chat apps.
These are not unrelated. High engineering standards are always in tension with product delivery. As a security engineer once told me, "the most secure system is the one that never gets launched into production."
So while Dan is right, and all the examples are right, and things like non-broken builds and a fast CI/CD pipeline are totally achievable, don't learn the WRONG lesson from this which is that when you arrive to a company and notice a bunch of WTFs, the first thing you must do is start fixing them in spite of any old timers who say "Actually that's not as bad as it seems". Sometimes they're wrong. USUALLY, they're right.
The tech industry tends to revolve around "I'm a super-rational robotic genius" thinking that can't accept the existence of its own irrational tendencies, to the point that it becomes ridiculous.
Reading it felt like a personal attack in many places. However, reading it forever changed how I think about things. It's a much more useful framing for everyone involved if you start with the question of "why did they think this was the right thing to do?" as opposed to "this person made a bad choice / mistake". My (extraordinary) impatience naturally predisposes me towards the latter, but the core argument of the book is that that's lazy -- you can hand wave away anything and everything with "operator error".
One company I worked had no unit tests, no infrastructure as code, and no build server. This held strong for a while until enough developers implemented some unit tests, infrastructure as code (e.g. terraform), and a build server as skunkworks projects. Eventually management tolerated them, but never endorsed them. Some teams at the company still never embraced good practices because it wasn't forced on them.
I guess I've never worked at a company that valued unit tests across the whole of the engineering team. I introduced them and implemented them on my own team, but others ignored it.
Personal experience is that a build server normalizes deviance. "But it works on the build server" we used to say, as, with time, it become harder and harder to build locally. "Just fix your environment!" we used to say, when it was the build system that was actually at fault. "It's all so fragile, just copy what we've done before!" we then said, repeating the mistakes that made the build system so fragile.
Eventually, the build system moved into a Docker image, where the smells where contained. But I'm still trying to refactor the build system to a portable, modern alternative. If we hadn't had a build server, we'd have fixed these core issues earlier and wouldn't have built on such a bad foundation. Devs should be building systems that work locally: the heterogeneity forces better error handling, the limited resources forces designing better scaleability, and most importantly, it prevents "but it works on the build server!".
This got me puzzled for a couple of minutes. Yeah, that “WTF, WTF” moment. Then I realized that our build “server” comprised of 12 different platforms (luckily reduced to just 6 in the later years), so to pass a build in production was a bit harder than to build locally.
Many of the examples in the OP are probably closer to the former, but my general advice here is to keep lots of notes about what seems broken, and revisit in a month or two. Sometimes you gained context that explains why something is actually sensible. If it still seems crazy with context, you can now bubble up the feedback with confidence, and also having hopefully built some respect and trust from the team to make the message land better.
If you have an integration test that relies on an unreliable system you do not control. Sure you can mock it out for a unit test, but if you want to make sure you catch breaking API changes, you need to hit the actual system. And if it works after retrying it a few times, then so be it. no need to throw shade.
Don't test it.
Only do unit tests with the connection mocked out.
Test against production.
Try it a few times with a delay, and if it works then you know your code is good and you can move on with your deployment. Which is what flaky and pytest-retry do.
Maybe I'm missing something, but out of those 4 options retrying the test seems like the best one, with the big caveat that it is only viable if the test does indeed work after trying a few times. I really don't see any downside.
edit:
Maybe another option is to put the retry functionality directly in the client code, which would make your code more robust overall. but that is definitely more complex than using one of these libraries just for testing.
It's fascinating really... Complex systems are always in partial failure mode and that applies to collective optimization challenges. Organizations will always be stuck in local optima in most domains.
i have marginal control over who i manage. The Product isnt saving the world, but it is allowing us to live reasonably and with a clear soul at the end of the sprint. The reason i say the "mercenary" bit is simple: weigh your dreams against blood and gold and compromise.
https://www.aopa.org/news-and-media/all-news/2015/december/0...
As an obvious result, our society does an incredible amount of work maintaining that obfuscation.
---
I've heard estimates that 20% (1/5th) of all healthcare-related spending in the US is overhead from insurance determinations, paperwork, etc., and that that 25% (25%/125%=1/5th) extra spending (relative to 100% of the rest of healthcare expenditure) does not exist in single-payer healthcare systems, like those used in Canada, Germany, and every other developed nation in the world.
What do we get from that extra spending? What substantive difference does that obfuscation provide?
The main difference I see is "explicit opportunity cost". Instead of deciding ahead of time that we will pay for any arbitrary healthcare need (as a single-payer program), the opportunity for each individual healthcare act is given a price, and groups of priced opportunities are provided by subscription-based insurance plans.
Every person has to find, apply for, and pay for an insurance plan that will meet their current and future healthcare needs.
Because that is explicit, there is leverage available to manipulate each opportunity cost, and even the opportunity of each person to have that opportunity provided to them.
So what does that leverage even look like, and who is using it, and for what purpose?
Politics. Instead of care being determined by your doctor, access to each type of care is explicitly made available (or unavailable) by your insurance plan. That's a huge attack surface for political motivation.
There is currently a dextroamphetamine (Adderall) shortage in the US. The other day, I went to my pharmacy to pick up my prescription for 30 generic Concerta (methylphenidate extended release, another stimulant medication used for ADHD), and learned that all they had left were 16 brand-name Concerta. I was lucky enough to have that covered by my insurance. Many different insurance plans would not have provided me that opportunity.
Why is there a shortage? Despite a significant increase in ADHD diagnosis last year, the DEA refused to raise the limit of Adderall that can be legally manufactured. Why? Because there is a long-standing political conflict between stimulant addiction prevention and ADHD treatment, and the DEA is positioned at one side of it.
That same political conflict is why some insurance companies have outright refused to include coverage for stimulant medications. Even without a nationwide shortage, some people have found themselves stuck in a position where the opportunity for medication is held just out of reach by the political decision of their insurance company, or the lack of access to insurance at all.
The same pattern can be found with practically every type of medical care that is politically controversial: contraceptives, abortions, hormones, etc. Even if you can't get a legislative ban, there is still leverage available to obfuscate opportunity itself.
When conservative politicians argue that a single-payer program would be "too socialist for America", the substantive difference they intend to preserve is the political leverage that is baked into the system we have; the political leverage that allows politics to restrict our medical care without a single vote.
---
That's just one example. This pattern is everywhere. The only answer is social objectivity. It's a hard problem.
p
{
line-height: 1.7;
max-width: 60em;
font-size: 1.2em;
margin-left: 5em;
}
It pretty much fixes the default readability which is essentially zero on this site otherwise. javascript:(function(){ var bod = document.getElementsByTagName("body")[0]; bod.style.margin = "40px auto"; bod.style.maxWidth = "40vw"; bod.style.lineHeight = "1.6"; bod.style.fontSize = "18px"; bod.style.color = "#444"; bod.style.padding = "0 10px"; })();
And for a dark mode: javascript:(function(){ var body = document.getElementsByTagName("body"); var html = document.getElementsByTagName("html"); var img = document.getElementsByTagName("img"); body[0].style.background = "#131313"; body[0].style.opacity = "1.0"; html[0].style.filter = "brightness(115%) contrast(95%) invert(1) hue-rotate(180deg)"; img[0].style.filter = "contrast(95%) invert(1)"; })();https://support.mozilla.org/en-US/kb/firefox-reader-view-clu...
Once some (foreigner) person was surprised at my dipping toast with Nutella in my latte. I was equally surprised by his surprise.
This is useful and fine. Someone wrote a test and it now hits a race condition or something and occasionally fails. Let’s assume we are very confident it is problem with the test not the product.
Choices:
Spend a sprint trying to fix it right now regardless of priority.
Turn it off and lose that coverage.
Buy some time.
In this context it makes sense. As long as their is a procedure to address these in some sane timeframe.
Maybe that is an example of normalization of deviance. But I think if it is discusses and trade offs thought through it is an OK thing to do at times. Remember most development is not green field. You inherit a system when you start a job.