As this attitude was adopted, things shifted too far (at least in the opinion of industry groups, and my observation) to the point where people underperforming to the point of negligence weren't blamed, and the corrective actions to prevent reoccurrences of problems they caused ended up being cumbersome and expensive without really improving safety. (And in this industry, everything relates back to safety.)
In recent years, things have shifted back towards a more pragmatic middle ground. There are tools to assess if a problem was organizational (and it still almost always is) or if there was some element of personal negligence involved. This follows with an industry wide trend of trying to fix the real problems that affect safety and operations, not over-engineer cumbersome corrective actions.
Every problem is organizational, even those caused by individuals, because it's the organization's job to recognize and remove those individuals where appropriate.
I enjoy organizations where individual's expectations are clearly defined, and I prefer if there are consequences for missing those expectations because I feel like it increases the reliability of the team.
Well (to play devil's advocate just a bit) - isn't the ultimate end-goal of an end-state robust process one in which people can not just underperform, but be replaced completely?
Given the above discussion about a nuke plant, think of all the complexity inside it: - monitors - alerts & triggers - valves - compressors and other rotating equipment - fire safety - electrical systems
All of those have to be checked, tested, maintained, and fixed on a periodic basis over their lifetimes.
An incompetent person (or group) will eventually be the cause of something.
Maybe 100 years in the future we'll have self-operating nuke plants, but doubtful in my lifetime because of the incredible scale of complexity.
Its a nightmare, because there's no room for experiment left anymore. Everyone just sticks to the template, afraid to do more than required, never deleting unused code etc. An attitude like this never ever helps!
These are some of the excuses they put up.
And then they sit 10 years or more with that bad stuff in there, build even uglier ways around it.
But the time comes to actually do something about. And what was once a one day job becomes "we will hire a consultancy firm to guide us".
No really I appreciate risk management but when it cripples your ability to make decisions, innovate or otherwise ACT on information that could help you be a more efficient team and the develoment team becomes a room full of people doing nothing but maintenance for years on end people leave and companies fail.
I just watched that very thing happen this year to my company for exactly that reason. Someone with the word "senior" in their job title was so risk averse that the market caught up, passed us and started eating our lunch.
God help them because I can't do it anymore, and writing on the wall says they'll be closing up shop this fall. I'm out the door for good at the end of the week.
Outsourcing of blame - as a Service. Where's my VC???
When it is all said and done, if you fucked up, you should get some shit for it. However this should be good natured, YOU should be laughing at it and everyone else laughing WITH you.
The discussions about everybody mistakes should be open, with emphasis on everybody, but leave mockery out of it. Inform everyone when they make mistakes without mocking them or attacking their egos and keep it factual. Don't assume everyone is friend with everybody nor that everybody is happy, it is not true. The line between laughing at it and with me is thin and oftentimes muddled.
Relaxed laughing at mistakes is result of good teamwork, but you don't get to good work by demanding that people accept being laughed at or mocked.
Publicly i.e. I front of the team? No. It serves no purpose than to stroke some egos and reduce the "Overton window" of development discussion and experimentation.
If you're intending that you (and the rest of your team) should learn from your mistakes, then I fully agree
But defining this person important is independent from the action of loading this person with guilt and financial, career relevant or social sanctions . That makes access to the important knowledge actually difficult, because no one will want to admit errors and share how they happened and what could have been done to prevent them...
Over the weekend, firmware patches were applied, and the server rebooted. After reboot, everything worked fine, so the tech marked the change successful and went home.
Well, apparently the NICs would work just fine, but not all settings were applied until you opened the UI provided by the vendor. When you opened the UI, the final settings would be applied, and the NICs would reboot, just long enough to kill TCP connections.
That loss of TCP connection killed the parent system, and then all the other children systems also died when the parent died.
So who would you even blame there? The guy who set the tripwire? The guy who tripped on the tripwire? The guy who designed a system that could be brought down by a momentary loss of connection?
I'm lucky that my boss wasn't the type to point fingers, because I was the guy who was there when it happened, and it sure got a lot of attention.
The UI part suggests that it was Windows, and if it was, it's not quite the case that "just long enough" to kill TCP connections, as you need quite a lot of downtime to terminate a typical TCP session.
In Windows, if a NIC goes down, all the TCP connections that use the NIC get closed immediately. (Or at least this was the case a few years ago. I had a similar system with similar drawbacks deployed back then, though it was an automated warehouse, not an assembly plant.)
> So who would you even blame there?
The idiots who designed the system to run on non-industrial-grade operating system. Windows was never a good choice to control industrial installations.
It is not about some small and well defined set of "idiots", it is essentially industry-wide design mistake.
Yes, that seems more likely.
I think Windows can be a decent platform for light industrial applications - which this system in particular was. The problem is all of the partners and suppliers were either stuck in the past, or had weird ideas.
The parent system was *nix based, but there was a flaw in a communications protocol that lead to the channel bouncing between two boxes, and eventually bringing down the parent system.
My lesson from that was that you can have flaws on any system, no matter how solid the OS.
Every time someone breaks something, we get harder to break.
If the tasks were difficult that would be one thing, but I'm talking about stuff like committing code to prod that was clearly never even executed once.
If you have those two things then someone is already motivated to learn from what happened and will probably never make that mistake again(which are the large majority of engineers in my experience).
Unfortunately both very demoralizing and very common.
I just don’t really get it. Even when I was a junior, if I overheard “this thing is broken,” I was the first to pop up and say “oh, I bet that was me, let me have a look.”
Imagine if Apple came out and said “yeah, that blank root password bug, it was all because of John Smith and his crap patch that caused this.”
Outsiders don’t have the same perspective as insiders. If Charlie’s commit message read “implementing the really difficult thing we talked about,” the team might be aware of mitigating factors that Alice won’t. But even without those mitigating factors, all you’ve done is badmouth your own devs to the public. Additionally, you are not considering whether Charlie is an otherwise stellar developer that has never had a bad patch before. Alice may incorrectly presume that the only reason he’s being called out is because this is a habit of his, perhaps.
First, it's generally best to praise publicly, and criticize in private.
Second, saying "@Users's commit screwed-the-pooch" blames but, frankly, may not be the whole picture. It's entirely possible that the commit caused the issue, but everything was done by the book in which case it's really an organizational failure.
Personally, I sympathize with your argument. I have no personal problem with Torvald-style correction. I used to work under an asshole who would threaten to have me fired routinely. Personally, I prefer the blowhards because you can always tell where you stand. Still, not everyone is wired this way and part of leadership is recognizing that and playing to various folks strengths and weaknesses.
The proper thing here is to acknowledge that there is an issue, but not assign blame. Go to the person you think is responsible in private, and let them admit the mistake in public if they want. Assigning them blame publicly shows a huge lack of respect, even if it was their fault, while admitting blame freely shows modesty.
Plus, what if it's not Charlie's fault, and his commit simply revealed the problem? Perhaps the actual issue is in a little used function deep down in the codebase, and his commit is just the first one to actually exercise that area the right way? Maybe this whole thing comes around to being Jim's fault instead.
The old IBM story is worth mentioning in relation to this: http://www.mbiconcepts.com/watson-sr-and-thoughtful-mistakes...
*Obviously with the caveat that some people are repeat offenders who are careless or just not good employees
As an industry we don't have a response to a truly neglectful mistake yet.
The backups were crap and the only reason it survived was because someone took a server to work from home.
When all was said and done, they never really found who did it, they just made organisational changes to ensure it didn't happen again. No blame game.
To the old NPL team, sorry about that. Culture is important.
Preventing goals means that the strategy needs to ensure good ball possession, and staying on the offense, to reduce the burden on the defense, to reduce the burden on the goalkeeper, who is the last line of defense.
If the last line of defense fails that's not an individual failure but a team failure, coach included, since the coach selects who gets to play, when and their roles.
Same in software: bad management passes the burden to developers, bad development passes the burden to testers, bad testing passes the burden to release management.
This my favorite interview question to ask candidates:
"What is your all time biggest screw up, and how did you come back from it" - I then tell them the story of me loosing several hundred thousand dollars and the funny things that happened around it to set the tone. If you have been in tech for any length of time you have one of these stories (if not a few). I have heard some great ones by simply asking and it gives great insight into a candidate (humor, stress response, the things you have seen).
Of course I write enough stupid bugs myself that I'm bound to think this way.
Also, focusing on the code itself, for me at least, easily leads to thoughts like “this function is crap! What idiot wrote this!?”. Finding out who broke it leads to thoughts like “I see John introduced this buggy function. I should go check with him, maybe he had a good reason.”
Though these categories may seem like they are orientated on individuals' actions, they may be used to determine where the risk lies in systems (and people's use thereof) and how measures can be taken to avoid the same problems being repeated.
Much of the time, the complexity of systems (using the term in the widest possible sense) is under-estimated, and automated integrity checks are not used as religiously as they may be.
There are some things that I consider basic competence standards, like not storing passwords in plain text in any system you're building. I wouldn't fire an intern for getting that wrong but I also wouldn't let an intern near a production authentication system without some oversight.
If someone is a security engineer with a responsibility to know these kinds of things as part of their job role and certification, then if they'd implemented passwords-in-clear to cut corners somewhere, even if it's to meet a really important deadline, I'd be extremely unhappy. Of course I'd establish the general pattern of what had gone wrong first, and if it was a superior being abusive to the security engineer to get the product launched on time I'd still be really unhappy but not at the engineer.
Occasionally one does follow the chain of causes back though and finds not the organisation's culture but an individual who really should have known better.
If unlucky dev #13 broke something because humans can no longer reason about the relevant part of the system, then it doesn't matter that #13 was the one who broke something. What really matters is that people get busy removing the sandtraps from their software.
However, many FLOSS projects run on the sheer joy and freedom that comes with maintaining a particular subsystem or area of the code. Most devs have a quick understanding of the responsibilities associated with that. But in cases where that responsibility doesn't come naturally, who broke becomes the focus. Addressing that issue will determine whether or not future breakages occur.
The best programmer vs the worst user, and every mix in between, shall produce situations needing attention this article addresses.
I've been in this situation on both sides. "Of course it should be clear what this phrase means, how could they fuck this up?" ... and ... "I have on idea what this means, both choices could mean what I want but either choice ends me up on the wrong page of this bullshit 'choose my own adventure' that I'll have to repeat if I'm wrong".
I'm interested in finding out if I'm understanding this wrong, and//or other thoughts.
If a programmer has a habit of sloppy code, or violates the team's standards in some ways, then a good leader will keep track of the fact that one person is responsible for a recurring pattern of mistakes.
I absolutely agree with Rachel By The Bay, that many bugs arise from the complexity of the situation, and it would be wrong to blame the person who just happens to trip over that bug. But a good leader should take action against anyone who repeatedly screws up, and who seems unwilling to improve.
I've written about this before. This is from "How To Destroy A Tech Startup In Three Easy Steps":
----------------------
Wednesday, July 15th, 2015
I got to work at 11:00 a.m. John announced that our demo had stopped working. Sipping my coffee, I logged into the server to find out what the problem was. I looked at the error log for the API app, but it seemed okay. Then I checked the error log for the NLP app.
java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1955) at Celolot.nlp.Extractor.fuckBitchesGetMoney.java:87
What the hell was this?
“FuckBitchesGetMoney”?
What kind of name is that for a function?
A computer programmer can name their functions anything, but there are some “best practices” regarding names, and this particular function name violated all of them.
I asked Sital why he had given this name to his function. He looked at me straight, shrugged, and stated that the name was from the 1995 song by The Notorious B.I.G., “Get Money.” I replied that rap lyrics were not part of our naming conventions. He promised that he would change it.
Coming from anyone else, I might have interpreted the function name as an act of angry rebellion, but Sital was too forthright for that. Apparently, he thought the name was funny and went with it because he wanted to add some humor to his code. Never did he stop to think it might be unprofessional.
I looked through his code and found several other functions that had inappropriate names. I sent him a list and asked him to change their names to something standard.
A week later the function was still there. FuckBitchesGetMoney. Yet I don’t think that any of this was a deliberate act of rebellion. He was just oddly forgetful and disorganized.
https://www.amazon.com/Destroy-Tech-Startup-Easy-Steps/dp/09...
Also, there are many more stories to be told now!
It's always the same f*ing people that break it though!
Sometimes that's just the people who change things the most and work the hardest. It's harder to break anything when you don't actually change anything.
* Do they have too much access to systems?
* Is there something really wrong with the deployment system?
* What training can be provided?
All of that is more constructive than your comment, as cathartic as it may be.
Of course there are other possibilities - the people breaking things are doing the hard bits that no one else dare to.
"Bravado is no excuse for lack of preparation." - Leeroy Jenkins
In reality if something breaks, and you are stupid enough to mention it, then (a) you are considered an a-hole for blaming <responsible-person-for-topic> even if you didn't and (b) responsible for fixing it.
So your main job is somehow make your stuff work despite all the other stuff that doesn't work and all the other people that try to stop you, silently. The less you criticize the better. What you get in return is that if you fuck up, people will try to avoid blaming you as well. Also if you don't succeed at making anything happen you get a little arrogant smile from your manager and a mediocre feedback round. But otherwise nothing happens.
The only change to that pattern happens when you piss off your manager or your manager's manager. Then suddenly each and everyt activity you do will be scrutinized and if there's a problem it will be used against you. The best hope they have is that you go away by yourself.
I'd recommend you satisfy their hope maximally by running the hell away from that dumpster fire of bullshit office politics.