undefined | Better HN

0 pointsmherdeg6y ago0 comments

I'd love to see a software-industry-wide quality manifesto. The tenets could include things like:

* Measure whether the service you provide is actually working the way your customers expect.

(Not just "did my server send back an http 200 response", not just "did my load balancer send back an http 200", not just "did my UI record that it handled some data", but actually measure: did this thing do what users expect? How many times, when someone tried to get something done with your product, did it work and they got it done?)

* Sanity-check your metrics.

(At a regular cadence, go listen for user feedback, watch them use your product, listen to them, and see whether you are actually measuring the things that are obviously causing pain for your users.)

* Start measuring whether the thing works before you launch the product.

(The first time you say "OK, this is silently failing for some people, and it's going to take me a week to bolt on instrumentation to figure out how bad it is", should be the last time.)

* Keep a ranked list of the things that are working the least well for customers the most often.

(Doesn't have to be perfect, but just the process of having product & business & engineering people looking at the same ranked list of quality problems, and helping them reason about how bad each one is for customers, goes a long way.)

0 comments

six2seven6y ago

You might be interested in Software Craftsmanship [0] manifesto. There are many communities and initiatives around the world gathering folks with the interest in producing high-quality software. From the few of the folks I have been working with that are involved in SC, I can definitely recommend the movement and so I'm also exploring options in joining some local meet-ups and/or events.

[0] http://manifesto.softwarecraftsmanship.org/

jiggawatts6y ago

This is also one of my pet peeves. It's easier than ever to collect this data and analyse it. Unfortunately, most of our clients are doing neither, or they are collecting the logs but carefully ignoring them.

I've lost count of the number of monitoring systems I've opened up just to see a wall of red tapering off to orange after scrolling a couple of screens further down.

At times like this I like to point out that "Red is the bad colour". I generally get a wide-eyed uncomprehending look followed by any one of a litany of excuses:

- I though it was the other team's responsibility

- It's not in my job description

- I just look after the infrastructure

- I just look after the software

- I'm just a manager, I'm not technical

- I'm just a tech, it's management's responsibility

Unfortunately, as a consultant I can't force anyone to do anything, and I'm fairly certain that the reports I write that are peppered with fun phrases such as "catastrophic risk of data corruption", "criminally negligent", etc... are printed out only so that they can be used as a convenient place to scribble some notes before being thrown in the paper recycling bin.

Remember the "HealthCare.gov" fiasco in 2013? [1] Something like 1% of the interested users managed to get through to the site, which cost $200M to develop. I remember the Obama got a bunch of top guys from various large IT firms to come help out, and the guy from Google had an amazing talk a couple of months later about what he found.

The takeaway message for me was that the Google guy's opinion was that the root cause of the failure was simply that: "Nobody was responsible for the overall outcome". That is, the work was siloed, and every group, contractor, or vendor was responsible only for their own individual "stove-pipe". Individually each component was all "green lights", but in aggregate it was terrible.

I see this a lot with over-engineered "n-tier" applications. A hundred brand new servers that are slow as molasses with just ten UAT users, let alone production load. The excuses are unbelievable, and nobody pays attention to the simple unalterable fact that this is TEN SERVERS PER USER and it's STILL SLOW!

People ignore the latency costs of firewalls, as one example. Nobody knows about VMware's "latency sensitivity tuning" option, which is a turbo button for load balancers and service bus VMs. I've seen many environments where ACPI deep-sleep states are left on, and hence 80% of the CPU cores are off and the other 20% are running at 1 GHz! Then they buy more servers, reducing the average load further and simply end up with even more CPU cores powered off permanently.

It would be hilarious of it wasn't your money they were wasting...

[1] https://en.wikipedia.org/wiki/HealthCare.gov#Issues_during_l...

j / k navigate · click thread line to collapse