I assume this is one of those things that changes dramatically based on where you live—for me (western US), this statement seems almost comically exaggerated.
Many insurance companies offer lower premiums if you install a car alarm. So I guess they work at least a little, otherwise they wouldn't lower their premiums.
It may not actually stop a thief, but it may get a thief to chose a car that doesn't have an alarm, or maybe it is just a correlation, but there is at least something.
Still, I think they should be made illegal, they are a nuisance, there are already laws against making excessive noise and car alarms should be included. And if they create an arms race, by getting thieves to prefer cars without alarms, that's even more reason to ban them.
I can think of another reason.
They run their computations- and say the insurance can be priced at $100. But hey what if we just increase to $105, and then offer a $5 discount to people who have car alarms? We get extra money (average of >100) and people think they are saving money. Who knows, might even be getting some type of kick back from the car alarm industry for promoting them.
Maybe I'm making shit up- but I've grown to hate insurance companies so much that it also makes perfect sense.
There was one particularly annoying car alarm that was popular in the 1990's that had a sequence of "police/ambulance siren" sounds.
Bieww Bieww Bieww Bieww, oooo eeee oooo eeee, rrr rrr rrr rrr, ee oo ee oo, booooooooup booooooooup booooooooup booooooooup, oooooooo eeeeeeee oooooooo eeeeeeee <repeat>
Or something like that.
No SLO in jeopardy, or no immediate measure that needs to be taken? Don't page the oncall; send a low-priority ticket for the service owner to investigate the next business day.
Steps need to be taken, but they're mechanical in nature or otherwise don't give the SRE an opportunity to exercise their brain in an interesting fashion? Replace the alert with an automated handler that only pages the oncall if it encounters an exception.
No playbook, or the playbook consists of useless non-actionable items like, "This alert means the service is running out of frobs"? Write a playbook that explains what the oncall is expected to do when the service needs frobs.
Edit: A dead reply asks if I've ever experienced a novel incident. Of course. Say, for instance, a "This should never happen" error-level log is suddenly happening like crazy, for the first time ever. In that case, you page the oncall, they do their best to debug it, see if they can reach the SWE service owners, read through the code to see if it could be an indicator that SLOs are being violated (e.g., user data corruption) or might be violated soon, and then write a stub playbook to be fleshed out the next business day, probably alongside a code change to handle this situation without spamming the logs so much.
For example, the web application crashing was logged as a DEBUG statement, but starting was logged at an ERROR level. This was clearly done at some point because DEBUG generated far too much log info w/millions of active users, but some Engineer wanted to know that the app started. Gross.
I solved for this by doing a couple things. The first was to define standards for log levels, ability to correlate log statements with each other for a given request, and to define the level of context a "proper" log level should provide.
For example, FATAL = there's no way anything can work properly. These are pretty rare, but incorrect configuration values were a common culprit. ERROR indicates something, possibly transient going wrong. Every now and then, not a big deal that can wait until later, but a rapid accumulation could mean something more serious is going on. INFO contained information about the state of the system, such as general measures of activity and other signals to indicate the system is working as expected. Most of our metrics capture was instrumented based off these statements.
In terms of the messages, we rapidly evolved the quality of the messages. For something like the aforementioned configuration error, the system initially just spat out an "Unexpected error" and a module name. The first improvement then stated something like "invalid configuration value" and finally we ended up on a message that stated the value was incorrect, identified which configuration value was wrong, and had a code that referenced documentation and escalation owner.
When all was said and done, we'd reduced our downtime from hours per year to less than 5 minutes, eliminated over 95% of our pages, and reduced escalations to Engineering from several days per week to a level where it was hard to remember the last one.
As the head of Engineering, I had to fight an uphill battle against the product & sales team for almost a year to make all of this happen, but I was fully vindicated when we were acquired and our operational maturity was lauded during the due diligence process.
> No SLO in jeopardy, or no immediate measure that needs to be taken?
A little contradictory here. Maybe your "or" should've been an "and". But anyways:
I can think of several scenarios for which an SLO is not in jeopardy for which you should get paged. All of which boils down to "Yes it is important for the business, our users, our brand" but still not worth having an SLO around because it's either 1) hard to measure or 2) too difficult to implement.
In an ideal world with infinite engineering resources on your team, you could have and _maintain_ an SLO for every part of the business that your system effects. In the real world, trade-offs need to made and certain key SLOs prioritized.
Could you describe one of the scenarios you're envisioning?
It's quite correct to send a "success" message when a batch process is completed successfully, but it's quite wrong to send that message to a human. It should be sent to a machine that should translate a missing success message into an error message/alert for humans to respond to.
For example, I have a set of nightly backup jobs. The last step of each backup process is to send a success message to my monitoring system. I only get a "Missing Backup" alert when the monitoring system detects that it didn't receive the success message it expected for a particular backup.
My old boss didn't seem to understand the concept that people don't generally notice missing messages. Or he was too lazy/incompetent to use a monitoring system that could translate gaps in successes into errors.
Of course, it's a different problem if those notifications have a mix of actionable and non-actionable messages (e.g. both success and error messages). Then it's a signal/noise problem.
> It’s a trap.
> In the long run, false positives can — and will often — hurt you more than false negatives. Let’s learn about the base rate fallacy.
Not sure about anyone else, but speaking of alarms, this style of writing trips my "self-promoting snake-oil Internet bullshitter" alarm. It's like nails on a damn chalkboard, and if you're writing like this, you've already lost me; however, maybe I ought not be pointing that out, since signals are nice to have.
Incidentally, I wasn't sure which way the author was gonna go with the core analogy. My smoke alarms have false-alarmed probably 10x as much as my car alarm, even counting times one of us has hit the alarm button on the fob by accident. I've certainly never been so annoyed by my car alarm that I've ripped it out and stuck it in a freezer, as I have with a smoke alarm.
(If I were writing like the author I suppose that last part would have read:
"I've certainly never been so annoyed by my car alarm that I've ripped it out and stuck it in a chest freezer.
I have, with a smoke alarm."
Except also I'd have found a way to use "we" and "you" a bunch.)
A trope of this style is “{interesting half story} but more on that later”.
I don’t think it is a big deal and I don’t see much self promotion here other than vanilla blogging, i.e. sounds like this person is knowledgeable let’s check their bio.
There's been quite a bit of research done, generally easy to find if you look, that talks about the difference and tests them, but the short summary:
- Ionization type sensors detect the products of fast flaming combustion and "things cooking in the kitchen." Your oven, if a bit dirty, will reliably trip an ionization type. They are quick on the draw for this. The downside is that they're very, very poor at detecting the sort of slow, smoking, smoldering combustion that is associated with house fires that kill people in the middle of the night.
- The photoelectric type is very good at detecting smoke in the air - but it isn't nearly as prone to false triggers on ovens, a burner burning some spills off, etc.
They've been A/B tested in a wide variety of conditions, and in some cases, the ionization type is a bit quicker. In other cases, the ionization type is slower, by time ranges north of half an hour - I've seen some test reports where there was a 45 minute gap, while the photoelectric type was going off, before the ionization type fired!
In general, "rapid fires during the day" are somewhat destructive to property, but rarely kill people. If your kitchen catches on fire while you're cooking, it may burn the house down, but generally people are able to get out.
The fires that kill people are "slow starting fires during the night" - the sort that smolder for potentially hours, often slowly filling the house with toxic smoke, before actually bursting into open flames. On this sort of fire, the photoelectric type will fire long, long before the ionization type - in some cases, they get around to alarming quite literally "after the occupants are dead from the smoke."
Using smoke alarms as a way to talk about monitoring systems is nice, but in terms of actual smoke detectors, get at least a few photoelectric sorts in the main areas of your home.
Do not get the "combined sensor" sort, since these tend to be and-gated and the worst of both worlds.
Edited to add some resources:
A presentation on the matter from a while back by one of the experts in this field: https://wahigroup.com/Resources/Documents/Ion%20vs%20Photo%2...
Another paper: https://www.semanticscholar.org/paper/Detection-of-Smoke-%3A...
> Full-scale fire tests are carried out to study the effectiveness of the various types of smoke detectors to provide an early warning of a fire. Both optical smoke detectors and ionization smoke detectors have been used. Alarm times are related to human tenability limits for toxic effects, visibility loss and heat stress. During smouldering fires it is only the optical detectors that provide satisfactory safety. With flaming fires the ionization detectors react before the optical ones. If a fire were started by a glowing cigarette, optical detectors are generally recommended. If not, the response time with these two types of detectors are so close that it is only in extreme cases that this difference between optical and ionization detectors would be critical in saving lives.
Dual sensors are not and gated. While nobody will admit what algorithm they use, they detect most fires unlike the single sensor type.
As for dual sensors and gating... do you actually trust your life to "nobody will admit what algorithm they use"?
My house has all the smoke detectors wired together (they're on an AC circuit, with battery backup, with a signal line running between them all), so I have some photoelectric and some ionization, depending on where in the house they are.
Edit: in large scale fire alarm systems there also are rules about combinations of triggered sensors that cause immediate escalation (if there is smoke and elevated temperature in two adjacent zones, it probably is not a false alarm and such things, often it even takes into account the failure modes of the physical alarm loop wiring). This is an interesting idea for IT monitoring: page someone only when multiple metrics indicate an issue.
Where I was expecting the author to go:
- Clearly was talking about residential smoke detectors, not commercial. That could have been explicit.
- Smoke detectors do have a high false-positive rate but almost always at the right time. A home smoke alarm going off while I'm cooking is quite different to a smoke alarm going off when I'm sleeping. To the author's point, there are very few false positives while I'm sleeping so when they happen, I'm getting up.
Speaking of the commercial context, I wonder what sort of businesses would get a lot of false alarms and how that varies across industries.
I would actually prefer if rules mandated they could only have large capacitors and just NOT CARE if the power goes out.
Next would be to require a sensor pick up an area that's IR hot AND smoke to go off. I'm sick of bathroom steam sometimes setting them off too.
Finally, ONLY FOR EMERGENCY would the loud and annoying cry be allowed. Tests, low battery, anything not indicating a clear and immediate threat to life should be a low noise, low light, indication. Maybe a 2 second low-quality sound clip that says 'bat' at a soft voice volume with a strobe at the end of the voice (when a human would be looking for the noise). Fog/Steam/etc, E.G. possible fire without detected heat but at a weak detection level, could also use the 'info' level of alert, not the DANGER level.
We have a similar message about setting monitoring thresholds in our documentation [2] because users have to explicitly specify a downtime timeout before they’re alerted about their website / API endpoint / cron job being down. The timeout / "grace period" is necessary because in many cases a failure is some transient network glitch which will fix itself before a human is alerted.
If you make the timeout too short, you’ll get lots of false positive alerts, and as the article says, your on-call engineers will be overwhelmed or just start ignoring the alerts.
If you make the timeout too long, it just takes that many minutes of downtime longer before you find out about it.
It may sound counterintuitive, but the latter is usually preferable. :)
[1] https://heiioncall.com/blog/kubernetes-liveness-probes-and-c...
Kubernetes can only do so much for you here. Liveness probes are designed to restart categorically broken software; for example, a combination of two requests causes no further requests to be handled. Maybe that's rare enough that a simple restart is an improvement over a replica that times out all requests directed at it. (You can fortunately see this behavior in real-world scenarios. You can also architect your application to self-check, of course, but the common "if path == '/healthz' { response.WriteHeaders(200) }" isn't this.) Readiness probes can shed load, but only by loading the other replicas by taking this replica's endpoints out of the service until things calm down. If the system as a whole doesn't have enough capacity, then picking one replica and saying "you can rest for 5 minutes" is just going to cause the other replicas to become overloaded and for the whole system to eventually fail.
There are other techniques here that work better.
Rate limiting is very common inside Big Tech; when a calling service induces too much load, it's told to simply go away via a fast path. That can prevent the thundering herd by allowing a % of requests to make progress, while other requests are rejected. Some progress is made while the system is degraded, and if there is spare capacity and a buffer, eventually the buffer is drained. (This post is too long to rant about buffering in distributed systems and what backpressure is, but if a buffer size of 1 can become full, then a buffer of any size can become full. So buffering is rarely a solution, but often the cause of outages.)
Circuit breaking is also common, where when a significant fraction of requests end with 5xx (usually a timeout), the load balancer just fast-paths a 5xx response for that replica's share of requests. This actually reduces load on the system, allowing it to process some requests instead of becoming a fleet of replicas in CrashLoopBackoff.
CPU limits are another complicating factor, but not much of one. Every piece of software runs with a CPU limit; only a finite number of CPUs can fit in your data center, or the Universe for that matter. A common problem that people run into is multithreaded software that doesn't understand that it's CPU limited. This does not cause failures, but typically induces a weird tail latency. CPU limits are enforced at discrete intervals; every 100ms, you're allowed to use 1 CPU. But you're also allowed to use 10 CPUs every 10ms, and sit idle for 90ms. (The system will enforce this; you may want to do work on 10 CPUs, but you're going to sleep after that first 10ms burst.) Usually, your system can be architected with CPU limits in mind; for example, by setting something like GOMAXPROCS to the CPU limit instead of the number of physical CPUs, avoiding the ability to consume the time allotted before the accounting interval ends. But, these mistakes very rarely lead to cascading failure, just very confusing 99.9%-ile latency numbers when under load, and a request spans that forced-idle interval.
Anyway, I have laid all of this foundation so I can get to my rant. There are a lot of "Kubernetes best practices" out there, and two that I have run into are that all applications must have a liveness probe, and that all applications must run at a Guaranteed QoS (and have cpu request == cpu limit != 0). These are interesting things to think about, but not a guaranteed way to enhance reliability (or lower cost). Your workload might be burstable, in which case a Burstable QoS might be exactly what you need; you trade reliability (a guarantee that all containers will be able to use a certain amount of CPU) for efficiency (you can dip into foobar service's CPU shares when barbaz needs to do a rare high-CPU activity). Liveness probes can be good too, where you have a single-threaded event loop that can get wedged accidentally, and restarting is the only way out. But, neither practice can be blindly applied to every workload that can be run in a container.
The article is about finding the appropriate sensitivity of alerts on some signal in order to maximize the predictive value.
But you should care more about the quality of the signals you are monitoring than about the sensitivity of your thresholds.
The article mentions load-average as an example signal, but to me, that's a poor signal to monitor. Instead, if your SLO is defined for error rate, alert on error rate.
Alerts on your SLO will have a high predictive value for predicting violations of your SLO, by definition. The tunable parameter here is the time window, not the threshold. E.g. if your error budget is defined for a 30d window, you may want alerts at the SLO threshold for 24h and 1h windows.
Alert on causes, not symptoms.
This is so true. Case in point: Growatt inverters have - like every other inverter - a maximum voltage on the grid connection at which they will shut down. They're pretty trigger happy about this and fail to take into account the resistance of the feed wire of the inverter to the (much lower impedance) grid hookup. As a result even on cabling sized properly for the interconnect they tend to falsely trigger well before the point where they should. The only way to avoid this problem is to either hack into the inverter somehow (which I've so far failed to do) or to use oversized cables (which isn't always an option).
The sensitivity is fantastic, the quality of the signal is hopeless. Obviously they err on the side of caution but the margin is so ridiculously large that you end up losing a lot of usable power for no reason at all. At least it should allow for either a resistance for the interconnect to be specified so that it can take into account the voltage drop across that wire, which at 10A is appreciable for even short runs of fairly beefy cable.
You obey the smoke alarm because the cost of ignoring the alarm when it is a true positive is potentially infinite (you die). You ignore the car alarm because (1) most likely it is a false positive but also (2) most likely it is somebody else's car.
Witness companies like Rivian triggering car alarms on aggressive behavior detected from ML on video. Don't even need to touch the car.
In any case, not all signals are the same. Most systems have a lot of components interacting and what turns to be dangerous is usually a combination of factors, but in the end, what defines that it was or not is that the system is doing what it should. You can put some guessing thresholds, but you must contrast it with that the system works.
And they should be actionable too, at least for alerts instead of slow day notifications, or metrics giving context to perceived problems that could take out the guessing from the thresholds.
The end advice is right: you want to build the smoke detector, not the car alarm. But … getting that done, now that's the trick. If the org has car alarms, that's the same org whose PMs will not see the "impact" of that ticket to get the monitoring made ship shape, and that ticket will be backlog-icebox-graveyard'ed.
I've had to get a number of "technical" non-technical roles to try to see that, no, the monitoring software can not automatically generate¹ metrics around your application. Yes, you have to actually instrument the code to add those!
Then that gets combined with systems that are just … not the best? at their job: Datadog has so many rough corners around statistics, where graphs will alias, graphs will change shape depending on zoom, units are a PITA or nowhere to be seen, etc. Sumo a god-awful UI — literally tabs inside tabs, and I can't copy the URL?! — and barely understand structured logging. Splunk is marginally better. Pagerduty permits only the simplest handling of events — don't limit me to a handful of tailored rules: rules are logic, logic is a function: give me WASM. And I want usable business-hours only alerts².
Self-hosted systems are perpetually met with "that's not our core focus", but nobody ever seems to convert the cost of managed monitoring/alert systems into "number of FTE that could be hired to maintain a self-hosted system".
(Oddly, the example in the article is a car alarm. Load avg. is, IMO, a useless metric. Better to measure CPU consumption and IOPS consumption, separately, or probably better, more derivative stats around the things doing the IOPS/CPU.)
¹yes, most system come with some collectors to get system-level stuff like CPU usage, etc. I mean metrics specific to your application.
²PD claims to support them, but in practice, they don't work: alerts received off hours don't alert, true … but they never alert once business resume, either! If you're in an org trying to dig itself out of a mess, you need them to not die in the low-prio pile.
(Ugh. Give me ACL systems in these systems that don't suck: PD locks the routing rules behind like "Admin", and security doesn't want to grant the rank and file "Admin", and so 80% of my devs have no idea how the system works because they're not allowed to see how the system works! Give me the ability to do a WMA business-days-only line for diurnal patterns! The list just goes on and on…)