undefined | Better HN

0 pointssaulpw2y ago0 comments

On the other hand, I've spent weeks with a team looking for a bug, and by the time we found something that appeared to fix it, we were way behind on everything else that really needed to get done. How long would it take to find the root cause? We tried. It wasn't worth weeks or months of effort, to anyone. This isn't JPL and human lives weren't on the line. We just needed it not to crash so we could all get on with the "real" task of shipping useful and profitable software.

0 comments

1 comments · 1 top-level

tetha2y ago

Yeah, that is why software engineering and system operations is hard.

For example, the article doesn't get to a root cause in an absolute way. There is no absolute SEGFAULT of the OS causing the misbehavior. However, they nail down the crash to a gif, and if the gif is in, it crashes, and if the gif is out it doesn't. If the gif is loaded otherwise it crashes, too. At that level, to me, that would be enough, because we're users of the browser's rendering there.

Finding a solid cause that can demonstrate and reproduce a problem, and basing a workaround around that at a boundary you're unwilling to cross can be fine. If it's within the company, it absolutely is fine as long as you escalate beyond that boundary.

However, I have enough teams who are like "Oh, we set all values to 25 one by one and when we arrived at flum-value at 25 it stopped crashing. Fixed." Why 25? Who knows. Why flum? Who knows. Maybe the other value changed at the same time fixed it? Who knows. Do we use 26 once it starts crashing again? Fuck knows. Maybe 24 is better?

We have no explanation for 25, so why would 25 be a good fix?

j / k navigate · click thread line to collapse