I do a lot of support work for Control Systems. It isn't unheard to find a chunk of PLC code that treats some sort of physical equipment in a unique way that unintentionally creates problems. I like to parrot a line I heard elsewhere: "Every time Software is used to fix a [Electrical/Mechanical] problem, a Gremlin is born".
But often enough when I find a root cause of a bug, or some sort of programmed limitation, the client wants removed. I always refuse until I can find out why that code exists. Nobody puts code in there for no reason, so I need to know why we have a timer, or an override in the first place. Often the answer is the problem it was solving no longer exists, and that's excellent, but for all the times where that code was put there to prevent something from happening and the client had a bunch of staff turnover, the original purpose is lost. Without documentation telling me why it was done that way I'm very cautious to immediately undo someone else's work.
I suppose the other aspect is knowing that I trust my coworkers. They don't (typically) do something for no good reason. If it is in there, it is there for a purpose and I must trust my coworkers to have done their due diligence in the first place. If that trust breaks down then everything becomes more difficult to make decisions on.
Sleep due to rate limiting from another service? COMMENT. Who's requiring it, the limits if I know exactly what they are at the time (and noting that I do not and this is just an educated guess that seems to work, if not), what the system behavior might look like if it's exceeded (if I know). Using a database for something trivial the filesystem could plausibly do, but in-fact cannot for very good reasons (say, your only ergonomic way to access the FS the way that you need to, in that environment, results in resource exhaustion via runaway syscalls under load)? Comment. Workaround for a bug in some widely-used library that Ubuntu inexplicably refuses to fix in their LTS release? Comment. That kind of thing.
I have written so very many "yes, I know this sucks, but here's why..." comments.
I also do it when I write code that I know won't do well at larger scale, but can't be bothered to make it more scalable just then, and it doesn't need to be under current expectations (which, 99% of the time, ends up being fine indefinitely). But that may be more about protecting my ego. :-) "Yes, I know this is reading the whole file into memory, but since this is just a batch-job program with an infrequent and predictable invocation and this file is expected to be smallish... whatever. If you're running out of memory, maybe start debugging here. If you're turning this into something invoked on-demand, maybe rewrite this." At least they know I knew it'd break, LOL.
I remember a former client tracking me down to ask about a bug that they had struggled to fix for months. There was a comment that I'd left 10 years earlier saying that while the logic was confusing, there was a good reason it was like that. Another developer had come along and commented out the line of code, leaving a comment saying that it was confusing!
It's absolutely imperative that the next guy knows what the fuck I'm doing by tampering with safety limits.
COMMENT WRITTEN: 2023-03-21
COMMENT LAST REVIEWED/VERIFIED AS STILL TRUE: 2023-05-04
WHY THIS CODE: This sucks but ...
It took several hours to figure it out, but the sleep was there in case a file had not finished downloading
Isn't that just the regular Chesterton's Fence argument though?
The one the article is specifically written to point out is not enough by itself, because you need to know what else has been built with the assumption that that code is there?
You're not wrong, but in the context of a PLC controlling a motor or gate it is far more segregated than the code you're probably thinking of. Having a timer override on a single gate's position limit sensor would have no effect on a separate sensor/gate/motor.
If the gate's function block had specific code built into it that affected all gates then what you're talking about would be more applicable.
> I don't know why this rung is needed but delete it and see what happens for yourself
Did not fuck around; did not find out.
Generally they're controlling industrial equipment of some sort, and making changes without a thorough understanding of what's happening now and how your change will affect the equipment and process is frowned upon.
At least some of this is cultural. EEs and MEs have historically viewed software less seriously than electrical and mechanical systems. As a result, engineering cultures dominated by EEs/MEs tend to produce shit code. Inexcusably incompetent software engineering remains common among ostensibly Professional Engineers.
I've basically found my niche in the industry as a Software Engineer though I can't say I see myself staying in the industry much longer. The amount of time's I've gotten my hands on code published by my EE coworkers only to rewrite it to work 10x faster at half the size with less bugs? Yikes. HMI/PLC work is almost like working in C++ at times, there's so many potential pitfalls for people that don't really understand the entire system, but the mentality by EE/ME types in the industry is to treat the software as a second class citizen.
Even the clients treat their IT/OT systems that way. A production mill has super strict service intervals with very defined procedures to make sure there is NO downtime to production. But get the very same management team to build a redundant SCADA server? Or even have them schedule regular reboots? God no.
Software, being less familiar, is not viewed as a fundamental architectural component because there often isn't sufficient understanding of the structure or nuance involved in building it. In my experiences software or firmware engineers tend to be isolated from the people who designed the physical systems, and a lot of meaning is lost between the two teams because the software side does not understand the limitations and principles of the hardware and the hardware team does not understand the capabilities and limitations of the software.
I hate PLC work for other reasons. I'm starting to look at going back to more traditional software role. I'm a bit tired of the road work and find the compensation for the amount asked of you to be drastically underwhelming. This meme is very much relevant:
This so much. Depending on the git blame, I'll either remove it blindly or actually think about it way more.
Early in my career, I was confused by seemingly-crazy questions in the Hacker Test (https://www-users.york.ac.uk/~ss44/joke/hacker.htm) like...
> 0133 Ever fix a hardware problem in software?
> 0134 ... Vice versa?
But after spending years developing embedded systems, I don't even blink at such questions. Yes, of course I have committed such necessarily evils!
I would like to think that if I sent out an email about git hygiene that you would support me against the people who don’t understand why I get grumpy at them for commits that are fifty times as long as the commit message, and mix four concerns two of which aren’t mentioned at all.
Git history is useless until you need it, and then it’s priceless.
I can’t always tell what I meant by a block of code I wrote two years ago, let alone what you meant by one you wrote five years ago.
One of my proudest commits had a 1:30 commit:message length ratio. The change may have only been ~3 lines, but boy was there a lot of knowledge represented there!
What the article is describing reminds me of the XKCD comic workflow: https://xkcd.com/1172/
A system exists external to the creators original purpose, and can take on purposes that were never intended but naturally evolve. It isn’t enough to say “well that is not in the spec”, because that doesn’t change reality.
After a year long foray into the world of PLC, I felt like I was programming in the dark ages.
I'm assuming its a bit better at very big plants/operations, but still.
I'm definitely going to use this, and I think there's a more general statement: "Every time software is used to fix a problem in a lower layer (which may also be software), a gremlin is born."
I think I get the gist, but that sentence is missing some words.
Gizmo caca.
(...I just watched both Gremlins movies last weekend...)
Just did one of those this morning. Hmmm.
The zinc gutter had leaked for probably decades and it destroyed part of the roof structure. The roof was held up by the wooden paneling the used to cover it on the inside (70s). So the wooden paneling was actually load bearing
Actually I've found way more stuff in this house. For example at the end of the roof the tiles weren't wide enough and instead of buying extra roofing tiles, they decided to fill it with cement and pieces of ceramic flower pots.
If you ever see a house stripped down to the sticks for a rebuild, you will hopefully notice a few braces added. Not to keep the walls from falling down, but to keep them square and true until the walls are rebuilt.
(I've seen dozens of houses from ~1900-1920 in the process of full gut rebuilds, and none of them had diagonally installed sheathing.)
At first look it seem like someone backed into the garage door and mangled the hell out of it but on more careful inspection the roof is being barely held up by the tracks that the door runs in and is pretty near to giving up the ghost. Was just going to splice the ends of the rafters (like someone did on the other side who knows how many years ago...if it works, it works) and replace the garage door but now its looking like I'll need a whole new roof.
What really worries me is the dodgy wiring strung all across the basement which is a combination of newish wires, old cloth covered wires and liberal applications of electrical tape to splice it all together. Luckily none of the wires seem to be load bearing...
My "fix" held for about 11 years, but apparently it very slowly weakened, creating a small divot on the roof. Which got bigger and bigger with each rain, but since I never go on the garage roof, I didn't notice.
Until during one heavy rain I got a surprise skylight!
So yeah, you probably want to fix that before you get a total collapse like I did.
Sounds like your guy had a similar experience.
Most problems with houses come back to managing water, air, or some other infiltration. But mostly it's water.
Software differs from all other means of production in that we can in fact test any change we make before realizing it in the world.
With good tests, I don't care what the intent was, or whether this feature has grown new uses or or new users. I "fix" it and run the tests, and they indicate whether the fix is good.
With good tests, there's no need for software archeology, the grizzled old veteran who knows every crack, the new wunderkind who can model complex systems in her brain, the comprehensive requirements documentation, or the tentative deploy systems that force user sub-populations to act as lab rats.
Indeed, with good tests, I could randomly change the system and stop when I get improvements (exactly how Google reports AI "developed" improvements to sorting).
And yet, test developers are paid half or less, test departments are relatively small, QA is put on a fixed and limited schedule, and no tech hero ever rose up through QA. Because it's derivative and reactive?
Except when the tests verify what the code was designed to do, but other systems have grown dependencies on what the code actually does.
Or when you're removing unused code and its associated tests, but it turns out the code is still used.
Or when your change fails tests, but only because the tests are brittle so you fix the test to match the new situation. Except it turns out something had grown a dependency on the old behavior.
Tests are great, and for sufficiently self contained systems they can be all you need. In larger systems, though, sometimes you also need telemetry and/or staged rollouts.
Assuming you mean systems in terms of actually separate systems communicating via some type of messaging, isn't that where strong enforcement of a contract comes into play so that downstream doesn't have to care about what the code actually does as long as it produces a correct message?
> Or when your change fails tests, but only because the tests are brittle so you fix the test to match the new situation. Except it turns out something had grown a dependency on the old behavior.
I think this supports GPs point about tests being second-class and not receiving the same level of upkeep or care that the application code receives and one can argue that you should try to prevent being in a position where tests become so brittle you end up with transient dependencies on behavior that was not clear.
Your code was supposed to calculate the VAT for a list of purchases, but progressively becomes a way to also force update the VAT cache for each product category and will be called in originally unexpected contexts.
BTW this is the same for comments: they'll cover the original intent and side effects, but not what the method/class is used for or actually does way down the line. In a perfect world these comments are updated as the world around them changes, but in practice that's seldom done except if the internal code changes as well.
Having worked on very long-running projects, testing or well-funded QA doesn’t tend to save you from organizational issues.
What typically seems to happen is tests rot, as tests often seem to have a shelf life. Eventually some subset of the tests start to die - usually because of some combination of dependency issues, API expectations changes, security updates, account and credential expirations, machine endpoint and state updates, and so on - and because the results from the test no longer indicate correctness of the program, and the marginal business value of fixing one individual broken test is typically very low, they often either get shut off entirely, or are “forced to pass” even if they ought to be an error.
Repeat for a decade or two and there quickly start being “the tests we actually trust”, and “the tests we’re too busy to actually fix or clean up.”
Which tests are the good ones and bad ones quickly become tribal knowledge that gets lost with job and role changes, and at some point, the mass of “tests that are lying that they’re working” and “tests we no longer care to figure out if they’re telling the truth that they’re failing” itself becomes accidentally load-bearing.
Huh? Your first part seemed to be repeating TDD optimism but then you switch test departments. Just make your claims consistent, I'd suggest you instead talk about tests being written by the programmers, kept with the code and automatically run with the build process.
However, I don't think even TDD done right can replace good design and good practices. Notably, even very simple specifications can't be replaced by tests; if f(S) just specified to spit out a string concatenated with itself, there's not obvious test treating f as a black box that verifies that f is correct. Formal specifications matter, etc. You can spot check this but if the situation is one magic wrong value screws you in some way, then your test won't show this.
there's no need for software archeology, the grizzled old veteran who knows every crack, the new wunderkind who can model complex systems in her brain, the comprehensive requirements documentation, or the tentative deploy systems that force user sub-populations to act as lab rats.
Wow, sure, all that stuff can be parodied but it's all a response to software being hard. And software is hard, sorry.
Software needs people with the suspicious minds of good testers, but security people make more money for the same skillset.
This is to say —- I agree with the article, but much nicer is to work at a place where you don’t expect to make this particular discovery very often, hah
"You" might know it in "your" creations, but in my career I am much more often working and reworking in other people's creations.
I think the point of the article is not that you should avoid using decorative studs as load-bearing elements, but that you should be aware that others may have done so before you came along.
This is an even more conservative position than the default Chesterton's Fence reading, which is itself dismissed by a lot of people as pedantically restrictive.
For me, the parent article resonates. I have definitely had ceilings come crashing down on my head when I removed a piece of "ornamental" trim (programmatically speaking)
In a normal, real-life context, I can see why someone would feel that way.
In a software engineering context I think it's just a further emphasis that you ought to understand what something is doing before fiddling with it, and both the original intent and what it is currently doing are interesting information. Many times I've removed dead code, only to learn that not only was it alive (which wouldn't have been that surprising, it's easy to miss some little thing), but that it is very alive and actually a critical component of the system through some aspect I didn't even know existed, which means I badly screwed up my analysis.
The differences between the physical world and the software world greatly amplify the value of the Chesterton's Fence metric. In the physical world we can all use our eyes and clearly see what is going on, and while maybe there's still a hidden reason for the fence that doesn't leap to mind, it's still a 3-dimensional physical problem in the end. Software is so much more interconnected and has so much less regard for physical dimensions that it is much easier to miss relationships at a casual glance. Fortunately if we try, it's actually easier to create a complete understanding of what a given piece of code does, but it is something we have to try at... the "default view" we tend to end up with of software leaves a lot more places for things to hide on us. We must systematically examine the situation with intention. If you don't have time, desire, or ability to do that, the Chesterton's Fence heuristic is more important.
We're also more prone to drown under Chesterton's fences if we're not careful, though. I've been in code bases where everyone is terrified to touch everything because it seems like everything depends on everything and the slightest change breaks everything. We have to be careful not to overuse the heuristic too. Software engineering is hard.
Is it always lazy in the bad way, though? In software there's no sharp distinction between "built to carry weight" and "built to tack drywall onto." Whether a system is robust or dangerously unscalable depends on the context. You can always do the thought experiment, "What if our sales team doubled in size and then sold and onboarded customers as fast as they could until we had 100% of the market," and maybe using your database as a message queue is fine for that.
If it results in a sad dev team, then that's a case where it was a mistake. It's hard to maintain, or it's an operational nightmare. That isn't the inevitable result of using a (software) decorative stud as a (software) load-bearing element, though. There are a lot of systems happily and invisibly doing jobs they weren't designed for, saving months of work building a "proper" solution.
More often than not I've seen this happen because they, in fact, do not know.
You could conclude that the code is unnecessary and remove it, or you could conclude that some test cases need to be added to exercise it. How do you decide which is correct?
The problem is usually that well thought and and designed software was build for a moving target, and invariably things have changed over time. It's not necessarily a sign of lazy design, it's where the real world intersects with the nice neat pretend world we design for :)
there is no excuse for not owning and knowing the software you are supposed to be in control of.
Do you? Sometimes quick-one-time fixes becomes the center of important software.
Whenever I find myself doing this, I at least leave a comment typically worded along the lines of "the dev is too lazy, not enough time to do it right, or just has no clue what to do, so here you go..."
The analogy for the load bearing stud might be a hackathon project that never expected to see production. In reality, a lot of what we do is hack on something until it barely works, and move on to the next thing.
Today's SE has wandered far, far afield from original goals, tragically enough, but that was the original conception. One of the reasons for today's relatively toothless SE departments is the rise of finance into maintenance planning. Inventory depreciation is a cruel mistress, and "what gets spared" is rarely a SE judgement these days, at least in my experience. This has predictable results, but is partially offset by the exceptionally high bar for aerospace maintenance staff, who are generally pretty damn badass compared to, say, a washing machine repairman. Finance, naturally, would like to knock that bar down a few pegs, too.
> With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.
> Except that over time it had become accidentally load bearing
> through other (ill conceived) changes to the structure this stud was now helping hold up the second floor of the house
Evidently, you couldn't easily tell why it was there. Moreover, I'm not persuaded that it accidentally became load bearing. It seems quite plausible that it deliberately became load bearing, for reasons which are ill conceived to you but not to the people who had them.
No, they could tell why it was there. It's just that knowing why it was there in the first place doesn't tell you what it's doing now.
The lab was full of smart people who were used to looking at things and making their own well-reasoned conclusions about whether it was OK to change something. This was a warning not to be too hasty about doing that!
IME, having robust alerting and monitoring tools, good rollback plans and procedures/automation should eliminate this fear entirely. If I was afraid to touch anything for fear of breaking it, I'd likely never get anything done.
Sure, but that all sounds like stuff that happens after you deploy/release… you really need to catch things sooner than that. Don’t make the user into the one who has to find the breakage, please. No matter how fast you roll back. Test your software thoroughly!
When I come in on a consulting basis I often have to help developers unwind the unintended effects of years of patches before we can safely make seemingly-simple changes. It's analysis-intensive, and like an archaeologist any artifacts I can dig up that you've left behind providing clues to what was in your head at the time can be helpful. In some cases where a set of functions is super critical we've made it part of the culture every time altered code is checked in to perform a walkthrough analysis to uncover any fresh unintended side-effects, ensure adherence to design requirements and discover and spotlight non-obvious relationships. The key is to turn the unintended or non-obvious into explicit. Sounds painful but in practice due to the high risk attached to getting that portion of code wrong the developers actually love that time for this is worked into the project schedules - it helps them build confidence the alteration is correct and safe.
I wish it were easier to impress the importance of this on inexperienced developers without them having to go through the hell of maintaining code in an environment where it was lacking.
It's a skill and art to keep the appropriate level of conciseness to avoid the documentation (no matter which form it takes) becoming too verbose or unwieldy to be helpful.
this is also why tests are so important. if you want to remove something, you have to think twice... once for the original code and once to fix the broken tests.
This is why I'm against squashing commits in git, that our other development teams don't understand: you're going through extra effort to make it more difficult in the future.
We're probably at a stage where doing this in an automated fashion might be reasonable. The context for why something works a particular way (say in some github repo) is the commit message and change detail, the PR, related issues, related chats in <chat tool of choice>, and then things temporally related to those. Feeding an LLM with the context and generating ADRs for an entire project automatically could be a feasible approach.
> I tried to cut it and it started to bind the saw. I stopped and thought, and then did not cut anymore until I had shored up the upstairs.
https://www.jefftk.com/p/accidentally-load-bearing#fb-231219...
In your circumstance, the board should have had some wiggle but prolly wouldn't have.
This is exactly I found out that a stud was now load bearing.
Dependencies flow in and onto new components like water. This is a fundamental law.
Even secret interfaces, etc that you think no one could find or touch will be found and fondled and depended upon.
This is why the correct view is that all software is a liability and we should write as little of it as possible, just the minimal that is needed to get the dollars and nothing more.
A good example of a famous failure if this:
https://en.m.wikipedia.org/wiki/Hyatt_Regency_walkway_collap...
That's why it is a good practice to have DiRT or similar exercises. Break things on purpose so that people don't rely on them when they shouldn't.
One of the biggest challenges I find in refactoring/rearchitecting is getting people to separate why something was done from whether it still needs to exist.
Too many times someone has insisted that a weird piece of code needs to exist because we never would have shipped without it. They treat every piece of the structure as intrinsic, and can't conceive of any part of it being scaffolding, which can now be removed to replace with something more elegant and just as functional.
When you put a doorway through a loadbearing wall, contractors and smart people place a temporary jack to brace the wall until a lintel can be substituted for the missing vertical members. Some developers behave as if the jack is now a permanent fixture, and I have many theories why but don't know which ones are the real reasons.
Think of how the SCRAM at Pripyat/Chernobyl caused the reactor to go supercritical because of the graphite tips. The control rods should have reduced reactivity in the reactor, but the first section of the rod increased reactivity.
Or how a hard hat dropped from a height may cause injury.
On the other hand, if you can't figure out what something is for, sometimes the easiest way to find out is to remove it and see what breaks. Maybe that's not such a great idea for that load-bearing stud, but in software it's easy to undo.
Guess I'm thinking of being very aware of how the structure reacts to failure, and not necessarily the strenght of all individual parts.
> It is Hyrums law.
No, it's "Hyrum's" not "Hyrums": https://www.hyrumslaw.com/
Or, to put it another way, figure out what the consequences are before you decide whether you're willing to intend them or not.
https://www.reddit.com/r/ProgrammerHumor/comments/q9x1d2/ask...
I put in a new beam to replace the load bearing wall the preivous-previous owners had removed, with posts down to the basement: https://www.jefftk.com/p/bathroom-construction-framing
https://www.youtube.com/watch?v=jfBOIbjbLv0
(or, jump to https://youtu.be/jfBOIbjbLv0?t=1560)
Both generalize to various forms of causal and failure analysis in systems design.
Really good, brief post.