undefined | Better HN

0 pointsBatteryMountain5y ago0 comments

The question is not IF, but WHEN.

So ideally you have some kind of monitoring that reports/shows how many services are alive (and where they live in a cluster), how many errors they generate etc. Then based on some thresholds you can take them out of circulation and let them cool down. If certain kinds of errors occurs, or at a certain frequency, the system can notify a site reliability engineer (or equivalent) to check it out. Then they can decide if it should be permanently removed and to log an internal support ticket and so forth for the developers or product teams.

Production issues are a part of life. You need to have some visibility on issues and their severity. Every company and tech stack is different, also depending on their SLA's and uptime promises.

Ads not rendering in an app might be less severe than a pump failure at a fuel station, so they have different kinds of monitoring and and reaction times to faults. Obviously things like hospitals, banks, airlines/aircraft manufacturers have way different requirements and infrastructure from say a system that manages all school libraries for a state/province.

There are too many products and approaches to mention here if you were looking for a list of those. I have one or two favorite approaches and a handful of tools for this kind of stuff, half of which is homemade, so not something you can google. But you can google it and see a few different approaches. "microservices monitoring java" or "microservices monitoring best practice" or something along those lines will get you on a path. Try to find 5 different approaches and reflect what each one is missing or how they may help you, and then ponder what would you like to see from a reporting system with hundreds/thousands of services.

And then obviously the the best lessons will come from production itself.

Good luck!

0 comments

killtimeatwork5y ago

> Production issues are a part of life.

Only if you accept them. The alternative is to do very few, rigorously tested releases per year. This way you don't have production issues. That's how industries like banking make sure bank transfers and card payments work and people's money is not randomly lost... It's a shame many other industries just accept their product failing for users as something normal/inevitable.

lordgilman5y ago

I can't say my experience echoes your comment. I'm a former employer of a financial services (billing) company built around a mainframe code base started in the 70s. We probably qualify for the sort of business you had in mind with your comment.

We did four releases a year, across the entire organization (so mainframe and more modern platforms), on Saturday nights/early Sunday mornings. There was plenty of testing but there was still plenty of errors only found on the day of, and rushed to fix in the wee hours or daylight hours of Sunday morning.

The only thing that seemed to correlate with release quality was the overall risk of the release, i.e. the complexity and number of new features written during that quarter.

killtimeatwork5y ago

> We did four releases a year, across the entire organization (so mainframe and more modern platforms), on Saturday nights/early Sunday mornings. There was plenty of testing but there was still plenty of errors only found on the day of, and rushed to fix in the wee hours or daylight hours of Sunday morning.

This way, you had bugs in prod for less than a day once every quarter, as opposed to having buggy prod all the time, as is common in organizations doing Continuous Deployment.

taormina5y ago

That's adorable. You know that no matter how much testing you do, that something WILL slip through the cracks? Always.

killtimeatwork5y ago

Of course. Even the Space Shuttles blew up, twice. I'm guessing even pace makers and software in nuclear power plants have bugs. The point is, these things are exceedingly rare or have very limited scope (occur only in most obscure corner cases and also do limited damage), while in web companies which adopted Continuous Deployment, serious bugs are just common and I think seen as part of life.

scruple5y ago

Work in healthcare where we have heavily tested, quarterly releases. Well, we had a release today and some stuff was pretty horribly broken, despite being so heavily tested. We didn't adequately load test one piece of the new release under production-like conditions. Oops. Thankfully the fix was simple and a hotfix only took a couple of hours in total. Yet another lesson learned.

1 more reply

SkyBelow5y ago

Then you get the worst of both worlds. You are in an industry where few very well tested releases are needed to meet SLA and customer expectations, but you have enough of the company looking at entirely different industries and wanting to follow their pipeline instead.

j / k navigate · click thread line to collapse

0 comments

killtimeatwork5y ago

> Production issues are a part of life.

lordgilman5y ago

The only thing that seemed to correlate with release quality was the overall risk of the release, i.e. the complexity and number of new features written during that quarter.

killtimeatwork5y ago

This way, you had bugs in prod for less than a day once every quarter, as opposed to having buggy prod all the time, as is common in organizations doing Continuous Deployment.

taormina5y ago

That's adorable. You know that no matter how much testing you do, that something WILL slip through the cracks? Always.

killtimeatwork5y ago

scruple5y ago

1 more reply

SkyBelow5y ago

j / k navigate · click thread line to collapse