This is also one of my pet peeves. It's easier than ever to collect this data and analyse it. Unfortunately, most of our clients are doing neither, or they are collecting the logs but carefully ignoring them.
I've lost count of the number of monitoring systems I've opened up just to see a wall of red tapering off to orange after scrolling a couple of screens further down.
At times like this I like to point out that "Red is the bad colour". I generally get a wide-eyed uncomprehending look followed by any one of a litany of excuses:
- I though it was the other team's responsibility
- It's not in my job description
- I just look after the infrastructure
- I just look after the software
- I'm just a manager, I'm not technical
- I'm just a tech, it's management's responsibility
Unfortunately, as a consultant I can't force anyone to do anything, and I'm fairly certain that the reports I write that are peppered with fun phrases such as "catastrophic risk of data corruption", "criminally negligent", etc... are printed out only so that they can be used as a convenient place to scribble some notes before being thrown in the paper recycling bin.
Remember the "HealthCare.gov" fiasco in 2013? [1] Something like 1% of the interested users managed to get through to the site, which cost $200M to develop. I remember the Obama got a bunch of top guys from various large IT firms to come help out, and the guy from Google had an amazing talk a couple of months later about what he found.
The takeaway message for me was that the Google guy's opinion was that the root cause of the failure was simply that: "Nobody was responsible for the overall outcome". That is, the work was siloed, and every group, contractor, or vendor was responsible only for their own individual "stove-pipe". Individually each component was all "green lights", but in aggregate it was terrible.
I see this a lot with over-engineered "n-tier" applications. A hundred brand new servers that are slow as molasses with just ten UAT users, let alone production load. The excuses are unbelievable, and nobody pays attention to the simple unalterable fact that this is TEN SERVERS PER USER and it's STILL SLOW!
People ignore the latency costs of firewalls, as one example. Nobody knows about VMware's "latency sensitivity tuning" option, which is a turbo button for load balancers and service bus VMs. I've seen many environments where ACPI deep-sleep states are left on, and hence 80% of the CPU cores are off and the other 20% are running at 1 GHz! Then they buy more servers, reducing the average load further and simply end up with even more CPU cores powered off permanently.
It would be hilarious of it wasn't your money they were wasting...
[1] https://en.wikipedia.org/wiki/HealthCare.gov#Issues_during_l...