Car alarms and smoke alarms: tradeoff between sensitivity and specificity (2012) (opens in new tab)

(blog.danslimmon.com)

134 pointslngarner3y ago78 comments

78 comments

49 comments · 14 top-level

sammalloy3y ago· 10 in thread

The entire car alarm industry is a scam, promoted by Republican congressman Darrell Issa. It has seriously disrupted our lives in every way imaginable and has drowned out the beauty of nature. I can’t think of a single car that has been protected by a car alarm since they were invented. They are useless and should be banned for the health and safety of mitigating noise pollution.

BalinKing3y ago

> It has seriously disrupted our lives in every way imaginable

I assume this is one of those things that changes dramatically based on where you live—for me (western US), this statement seems almost comically exaggerated.

dahfizz3y ago

Yeah, I can't remember the last time I heard a car alarm.

2 more replies

GuB-423y ago

> I can’t think of a single car that has been protected by a car alarm since they were invented.

Many insurance companies offer lower premiums if you install a car alarm. So I guess they work at least a little, otherwise they wouldn't lower their premiums.

It may not actually stop a thief, but it may get a thief to chose a car that doesn't have an alarm, or maybe it is just a correlation, but there is at least something.

Still, I think they should be made illegal, they are a nuisance, there are already laws against making excessive noise and car alarms should be included. And if they create an arms race, by getting thieves to prefer cars without alarms, that's even more reason to ban them.

eatsyourtacos3y ago

>Many insurance companies offer lower premiums if you install a car alarm. So I guess they work at least a little, otherwise they wouldn't lower their premiums.

I can think of another reason.

They run their computations- and say the insurance can be priced at $100. But hey what if we just increase to $105, and then offer a $5 discount to people who have car alarms? We get extra money (average of >100) and people think they are saving money. Who knows, might even be getting some type of kick back from the car alarm industry for promoting them.

Maybe I'm making shit up- but I've grown to hate insurance companies so much that it also makes perfect sense.

1 more reply

eutectic3y ago

If you change the requirements for an insurance policy you attract a different risk pool. Just like health insurance companies offering gym membership to find healthier customers.

kloch3y ago

This was true in the early 1990's but I rarely hear a car alarm these days.

There was one particularly annoying car alarm that was popular in the 1990's that had a sequence of "police/ambulance siren" sounds.

Bieww Bieww Bieww Bieww, oooo eeee oooo eeee, rrr rrr rrr rrr, ee oo ee oo, booooooooup booooooooup booooooooup booooooooup, oooooooo eeeeeeee oooooooo eeeeeeee <repeat>

Or something like that.

jimbob453y ago

For poor people whose ability to live depends on having a car, car alarms must be at least sort of useful to know if your car is being stolen at night. I’m sure they’re just a noisy inconvenience to the wealthy though.

asdff3y ago

Usually when they are going off in my neighborhood the car is streetparked and that's how they end up getting triggered (sometimes a loud motorcycle can even do it). So the owner could be like well over a 10 minute walk away and out of earshot anyhow. As a result most of the time when I hear a car alarm it rings until the thing shuts itself off automatically to save the car battery.

izacus3y ago

Does the alarm ever prevent theft?

2 more replies

smeagull3y ago

When I hear a car alarm, I look away and hope the annoying car gets stolen as fast as possible.

raldi3y ago· 5 in thread

When the oncall gets paged, an SLO should be in jeopardy in a way that requires immediate measures to be taken by a well-trained human as described in actionable terms in a linked playbook.

No SLO in jeopardy, or no immediate measure that needs to be taken? Don't page the oncall; send a low-priority ticket for the service owner to investigate the next business day.

Steps need to be taken, but they're mechanical in nature or otherwise don't give the SRE an opportunity to exercise their brain in an interesting fashion? Replace the alert with an automated handler that only pages the oncall if it encounters an exception.

No playbook, or the playbook consists of useless non-actionable items like, "This alert means the service is running out of frobs"? Write a playbook that explains what the oncall is expected to do when the service needs frobs.

Edit: A dead reply asks if I've ever experienced a novel incident. Of course. Say, for instance, a "This should never happen" error-level log is suddenly happening like crazy, for the first time ever. In that case, you page the oncall, they do their best to debug it, see if they can reach the SWE service owners, read through the code to see if it could be an indicator that SLOs are being violated (e.g., user data corruption) or might be violated soon, and then write a stub playbook to be fleshed out the next business day, probably alongside a code change to handle this situation without spamming the logs so much.

fatnoah3y ago

In a previous life as a full-stack Engineer at a startup, this was my white whale. The state of logging, monitoring, and alerting was such that signal quality was low, and only indirect observations of the system were possible since the logging was borderline useless. The result was multiple pages per night, with each one resulting in a scavenger hunt because signal was so low that it was nigh impossible to even identify what playbook to run.

For example, the web application crashing was logged as a DEBUG statement, but starting was logged at an ERROR level. This was clearly done at some point because DEBUG generated far too much log info w/millions of active users, but some Engineer wanted to know that the app started. Gross.

I solved for this by doing a couple things. The first was to define standards for log levels, ability to correlate log statements with each other for a given request, and to define the level of context a "proper" log level should provide.

For example, FATAL = there's no way anything can work properly. These are pretty rare, but incorrect configuration values were a common culprit. ERROR indicates something, possibly transient going wrong. Every now and then, not a big deal that can wait until later, but a rapid accumulation could mean something more serious is going on. INFO contained information about the state of the system, such as general measures of activity and other signals to indicate the system is working as expected. Most of our metrics capture was instrumented based off these statements.

In terms of the messages, we rapidly evolved the quality of the messages. For something like the aforementioned configuration error, the system initially just spat out an "Unexpected error" and a module name. The first improvement then stated something like "invalid configuration value" and finally we ended up on a message that stated the value was incorrect, identified which configuration value was wrong, and had a code that referenced documentation and escalation owner.

When all was said and done, we'd reduced our downtime from hours per year to less than 5 minutes, eliminated over 95% of our pages, and reduced escalations to Engineering from several days per week to a level where it was hard to remember the last one.

As the head of Engineering, I had to fight an uphill battle against the product & sales team for almost a year to make all of this happen, but I was fully vindicated when we were acquired and our operational maturity was lauded during the due diligence process.

peteradio3y ago

You know all that work was worth it when you get a good lauding.

dgunay3y ago

Going through something like this as a SWE at a startup. Lots of noise in our alerts and logging, so alert fatigue is a real problem. Do you have any advice on navigating this scenario (esp. negotiating with product to get monitoring and ops in a usable state)

1 more reply

okdood643y ago

> When the oncall gets paged, an SLO should be in jeopardy in a way that requires immediate measures

> No SLO in jeopardy, or no immediate measure that needs to be taken?

A little contradictory here. Maybe your "or" should've been an "and". But anyways:

I can think of several scenarios for which an SLO is not in jeopardy for which you should get paged. All of which boils down to "Yes it is important for the business, our users, our brand" but still not worth having an SLO around because it's either 1) hard to measure or 2) too difficult to implement.

In an ideal world with infinite engineering resources on your team, you could have and _maintain_ an SLO for every part of the business that your system effects. In the real world, trade-offs need to made and certain key SLOs prioritized.

raldi3y ago

The first one is "p and q"; the negation of that is "(not p) or (not q)".

Could you describe one of the scenarios you're envisioning?

rtkwe3y ago· 5 in thread

It's a constant pain of mine to try to get people to stop having business as usual or successfully completed $PROCESS emails come out of our batch processes on our teams at work. They absolutely drown my inbox so I'm forced to filter them then the actual failures get buried in the unchecked "batch spam" folders.

justin_oaks3y ago

I had a boss who had an inbox with literally hundreds of thousands of unread emails. A good chunk of those emails were "success" messages from batch processes.

It's quite correct to send a "success" message when a batch process is completed successfully, but it's quite wrong to send that message to a human. It should be sent to a machine that should translate a missing success message into an error message/alert for humans to respond to.

For example, I have a set of nightly backup jobs. The last step of each backup process is to send a success message to my monitoring system. I only get a "Missing Backup" alert when the monitoring system detects that it didn't receive the success message it expected for a particular backup.

My old boss didn't seem to understand the concept that people don't generally notice missing messages. Or he was too lazy/incompetent to use a monitoring system that could translate gaps in successes into errors.

rtkwe3y ago

Even that is utterly unnecessary because we use ControlM for basically all of the batch work in my area that I know of and there's already automation that opens an Incident on a job failure that can flow into the whole on call system! If a job or cycle is critical and needs to finish by a certain time you can setup messages to go out at that time and everything.

hevans663y ago

My pet peeve is these $PROCESS notifications that go to slack channels. I worked at a company that had an #engineering_humans slack channel because we got chased out of #engineering by bots.

justin_oaks3y ago

I'm fine if they go to THEIR OWN slack channel. Then I can mute or leave that channel.

Of course, it's a different problem if those notifications have a mix of actionable and non-actionable messages (e.g. both success and error messages). Then it's a signal/noise problem.

WrtCdEvrydy3y ago

The one that pushes buttons is the alarms that have no docs attached so when they blow off at 2AM, they just get muted until someone comes in and complains at 6AM.

tra33y ago· 4 in thread

I need to sit down and go through the math again, I got lost in the middle somewhere. All I know is our alerts are way too noisy now to the point where they are useless.

hevans663y ago

Yes! This. This has happened to me at least two previous companies I have worked at. Everybody sets up thresholds on every possible Datadog metric and alerts become useless. That's part of the ethos of monitoring at my current company. We only set up alerts through https://heiioncall.com/ that we are convinced you absolutely need to look at right now. Anything that is not that gets shoved to a slack channel (that I have long since muted).

nh23423fefe3y ago

I dunno, article doesn't seem to want me to understand. It's just another, "here's a random stats calculation you cant perform in your head, isnt the english a bad way to describe this calculation?!?!? your intuition sucks when i dont explain myself....."

sparrish3y ago

Alert fatigue... it's common when alerts are non-actionable and it causes a lot of downtime.

bluGill3y ago

We write up stories to fix them, and upper management tracks progress on completion so they are not buried in the backlog.

yamtaddle3y ago· 4 in thread

> When presented with this tradeoff, the path of least resistance is to say “Let’s just keep the threshold lower. We’d rather get woken up when there’s nothing broken than sleep through a real problem.” And I can sympathize with that attitude. Undetected outages are embarrassing and harmful to your reputation. Surely it’s preferable to deal with a few late-night fire drills.

> It’s a trap.

> In the long run, false positives can — and will often — hurt you more than false negatives. Let’s learn about the base rate fallacy.

Not sure about anyone else, but speaking of alarms, this style of writing trips my "self-promoting snake-oil Internet bullshitter" alarm. It's like nails on a damn chalkboard, and if you're writing like this, you've already lost me; however, maybe I ought not be pointing that out, since signals are nice to have.

Incidentally, I wasn't sure which way the author was gonna go with the core analogy. My smoke alarms have false-alarmed probably 10x as much as my car alarm, even counting times one of us has hit the alarm button on the fob by accident. I've certainly never been so annoyed by my car alarm that I've ripped it out and stuck it in a freezer, as I have with a smoke alarm.

(If I were writing like the author I suppose that last part would have read:

"I've certainly never been so annoyed by my car alarm that I've ripped it out and stuck it in a chest freezer.

I have, with a smoke alarm."

Except also I'd have found a way to use "we" and "you" a bunch.)

quickthrower23y ago

I see a lot of this style of writing in articles submitted on HN. I think they are just trying to make the writing more lively, not trying to BS.

A trope of this style is “{interesting half story} but more on that later”.

I don’t think it is a big deal and I don’t see much self promotion here other than vanilla blogging, i.e. sounds like this person is knowledgeable let’s check their bio.

burnished3y ago

Im not sure what you are responding to in the quoted text but after reading the article I think I can assure you that the author isnt selling you anything more salacious than you would find in a more interesting introduction to probability and statistics lecture.

raldi3y ago

What do you mean by "this style of writing"? What aspects of the quote do you object to?

jacquesm3y ago

At a guess the bit 'let's learn about the base rate fallacy'.

1 more reply

Syonyk3y ago· 3 in thread

Now, if you're annoyed by the false positive rate on your actual smoke alarms, go replace the one nearest your kitchen with a photoelectric type, not the standard ionization type that's cheaper, the default style installed, and ought to be illegal in homes (IMO).

There's been quite a bit of research done, generally easy to find if you look, that talks about the difference and tests them, but the short summary:

- Ionization type sensors detect the products of fast flaming combustion and "things cooking in the kitchen." Your oven, if a bit dirty, will reliably trip an ionization type. They are quick on the draw for this. The downside is that they're very, very poor at detecting the sort of slow, smoking, smoldering combustion that is associated with house fires that kill people in the middle of the night.

- The photoelectric type is very good at detecting smoke in the air - but it isn't nearly as prone to false triggers on ovens, a burner burning some spills off, etc.

They've been A/B tested in a wide variety of conditions, and in some cases, the ionization type is a bit quicker. In other cases, the ionization type is slower, by time ranges north of half an hour - I've seen some test reports where there was a 45 minute gap, while the photoelectric type was going off, before the ionization type fired!

In general, "rapid fires during the day" are somewhat destructive to property, but rarely kill people. If your kitchen catches on fire while you're cooking, it may burn the house down, but generally people are able to get out.

The fires that kill people are "slow starting fires during the night" - the sort that smolder for potentially hours, often slowly filling the house with toxic smoke, before actually bursting into open flames. On this sort of fire, the photoelectric type will fire long, long before the ionization type - in some cases, they get around to alarming quite literally "after the occupants are dead from the smoke."

Using smoke alarms as a way to talk about monitoring systems is nice, but in terms of actual smoke detectors, get at least a few photoelectric sorts in the main areas of your home.

Do not get the "combined sensor" sort, since these tend to be and-gated and the worst of both worlds.

Edited to add some resources:

A presentation on the matter from a while back by one of the experts in this field: https://wahigroup.com/Resources/Documents/Ion%20vs%20Photo%2...

Another paper: https://www.semanticscholar.org/paper/Detection-of-Smoke-%3A...

> Full-scale fire tests are carried out to study the effectiveness of the various types of smoke detectors to provide an early warning of a fire. Both optical smoke detectors and ionization smoke detectors have been used. Alarm times are related to human tenability limits for toxic effects, visibility loss and heat stress. During smouldering fires it is only the optical detectors that provide satisfactory safety. With flaming fires the ionization detectors react before the optical ones. If a fire were started by a glowing cigarette, optical detectors are generally recommended. If not, the response time with these two types of detectors are so close that it is only in extreme cases that this difference between optical and ionization detectors would be critical in saving lives.

bluGill3y ago

The law requires you have both types of good reason. Either alone will detect less than half of all house fires.

Dual sensors are not and gated. While nobody will admit what algorithm they use, they detect most fires unlike the single sensor type.

Syonyk3y ago

Where does the law require both types? I'm not aware of any housing codes specifically requiring photoelectric types, and any house I've looked at, including mine, came with purely ionization types. Though it's been a few years, and it may have changed recently - this is less of a niche concern lately.

As for dual sensors and gating... do you actually trust your life to "nobody will admit what algorithm they use"?

My house has all the smoke detectors wired together (they're on an AC circuit, with battery backup, with a signal line running between them all), so I have some photoelectric and some ionization, depending on where in the house they are.

riceart3y ago

Lol “the law” .. what law? Maybe in some dumb ass jurisdiction - but you’re a bit full of yourself if you think where you happen to live is “the law”.

1 more reply

dfox3y ago· 2 in thread

Smoke(/fire in general) alarms are not a good example of a thing with high specificity. You perceive it that way, but what you see is the result of somebody getting paged about it and then checking (preferably physically, but also through eg. CCTV) whether there really is an emergency situation and canceling the alarm before its escalation timeout. Apparently, for typical commercial building false fire alarms are more or less an weekly occurrence.

Edit: in large scale fire alarm systems there also are rules about combinations of triggered sensors that cause immediate escalation (if there is smoke and elevated temperature in two adjacent zones, it probably is not a false alarm and such things, often it even takes into account the failure modes of the physical alarm loop wiring). This is an interesting idea for IT monitoring: page someone only when multiple metrics indicate an issue.

tobyjsullivan3y ago

It was an interesting example and maybe deserved a few more caveats to actually serve the point. After all, we've all heard a fire alarm of some sort in the past year (if not the past month) but how many were actual fires? (Technically the author said smoke which helps but not really.)

Where I was expecting the author to go:

- Clearly was talking about residential smoke detectors, not commercial. That could have been explicit.

- Smoke detectors do have a high false-positive rate but almost always at the right time. A home smoke alarm going off while I'm cooking is quite different to a smoke alarm going off when I'm sleeping. To the author's point, there are very few false positives while I'm sleeping so when they happen, I'm getting up.

Speaking of the commercial context, I wonder what sort of businesses would get a lot of false alarms and how that varies across industries.

mjevans3y ago

I have been _plagued_ by smoke alarms that treat a low battery as a sign of a fire. To the point that I am trained by their crying wolf that it is _always_ a false alarm. Particularly when I'm asleep and they've gone off.

I would actually prefer if rules mandated they could only have large capacitors and just NOT CARE if the power goes out.

Next would be to require a sensor pick up an area that's IR hot AND smoke to go off. I'm sick of bathroom steam sometimes setting them off too.

Finally, ONLY FOR EMERGENCY would the loud and annoying cry be allowed. Tests, low battery, anything not indicating a clear and immediate threat to life should be a low noise, low light, indication. Maybe a 2 second low-quality sound clip that says 'bat' at a soft voice volume with a strobe at the end of the voice (when a human would be looking for the noise). Fog/Steam/etc, E.G. possible fire without detected heat but at a weak detection level, could also use the 'info' level of alert, not the DANGER level.

1 more reply

compumike3y ago· 1 in thread

I do like how the author presents the case for how damaging false-positives can be in SRE monitoring. But, FYI, it can get worse if these monitors are hooked to self-actuating feedback loops! I recently wrote about a production incident on the Heii On-Call blog, in the context of witnessing how Kubernetes liveness probes and CPU limits worked together to create a self-reinforcing CrashLoopBackOff. [1] Partially because the liveness probe thresholds (timeoutSeconds and failureThreshold fields) were too aggressive.

We have a similar message about setting monitoring thresholds in our documentation [2] because users have to explicitly specify a downtime timeout before they’re alerted about their website / API endpoint / cron job being down. The timeout / "grace period" is necessary because in many cases a failure is some transient network glitch which will fix itself before a human is alerted.

If you make the timeout too short, you’ll get lots of false positive alerts, and as the article says, your on-call engineers will be overwhelmed or just start ignoring the alerts.

If you make the timeout too long, it just takes that many minutes of downtime longer before you find out about it.

It may sound counterintuitive, but the latter is usually preferable. :)

[1] https://heiioncall.com/blog/kubernetes-liveness-probes-and-c...

[2] https://heiioncall.com/docs

jrockway3y ago

So this is a pretty common cascading failure scenario. Even ignoring CPU limits, if your service gets slow when it's over capacity, this will almost always happen. Latency increases to the point where liveness probes fail, causing the size of the fleet to decrease because of liveness-induced restarts, causing the other replicas to experience more load, causing them to become slow enough to fail liveness probes, and soon enough, everything is dead.

Kubernetes can only do so much for you here. Liveness probes are designed to restart categorically broken software; for example, a combination of two requests causes no further requests to be handled. Maybe that's rare enough that a simple restart is an improvement over a replica that times out all requests directed at it. (You can fortunately see this behavior in real-world scenarios. You can also architect your application to self-check, of course, but the common "if path == '/healthz' { response.WriteHeaders(200) }" isn't this.) Readiness probes can shed load, but only by loading the other replicas by taking this replica's endpoints out of the service until things calm down. If the system as a whole doesn't have enough capacity, then picking one replica and saying "you can rest for 5 minutes" is just going to cause the other replicas to become overloaded and for the whole system to eventually fail.

There are other techniques here that work better.

Rate limiting is very common inside Big Tech; when a calling service induces too much load, it's told to simply go away via a fast path. That can prevent the thundering herd by allowing a % of requests to make progress, while other requests are rejected. Some progress is made while the system is degraded, and if there is spare capacity and a buffer, eventually the buffer is drained. (This post is too long to rant about buffering in distributed systems and what backpressure is, but if a buffer size of 1 can become full, then a buffer of any size can become full. So buffering is rarely a solution, but often the cause of outages.)

Circuit breaking is also common, where when a significant fraction of requests end with 5xx (usually a timeout), the load balancer just fast-paths a 5xx response for that replica's share of requests. This actually reduces load on the system, allowing it to process some requests instead of becoming a fleet of replicas in CrashLoopBackoff.

CPU limits are another complicating factor, but not much of one. Every piece of software runs with a CPU limit; only a finite number of CPUs can fit in your data center, or the Universe for that matter. A common problem that people run into is multithreaded software that doesn't understand that it's CPU limited. This does not cause failures, but typically induces a weird tail latency. CPU limits are enforced at discrete intervals; every 100ms, you're allowed to use 1 CPU. But you're also allowed to use 10 CPUs every 10ms, and sit idle for 90ms. (The system will enforce this; you may want to do work on 10 CPUs, but you're going to sleep after that first 10ms burst.) Usually, your system can be architected with CPU limits in mind; for example, by setting something like GOMAXPROCS to the CPU limit instead of the number of physical CPUs, avoiding the ability to consume the time allotted before the accounting interval ends. But, these mistakes very rarely lead to cascading failure, just very confusing 99.9%-ile latency numbers when under load, and a request spans that forced-idle interval.

Anyway, I have laid all of this foundation so I can get to my rant. There are a lot of "Kubernetes best practices" out there, and two that I have run into are that all applications must have a liveness probe, and that all applications must run at a Guaranteed QoS (and have cpu request == cpu limit != 0). These are interesting things to think about, but not a guaranteed way to enhance reliability (or lower cost). Your workload might be burstable, in which case a Burstable QoS might be exactly what you need; you trade reliability (a guarantee that all containers will be able to use a certain amount of CPU) for efficiency (you can dip into foobar service's CPU shares when barbaz needs to do a rare high-CPU activity). Liveness probes can be good too, where you have a single-threaded event loop that can get wedged accidentally, and restarting is the only way out. But, neither practice can be blindly applied to every workload that can be run in a container.

cbarrick3y ago· 1 in thread

I think this article is missing the forest for the trees.

The article is about finding the appropriate sensitivity of alerts on some signal in order to maximize the predictive value.

But you should care more about the quality of the signals you are monitoring than about the sensitivity of your thresholds.

The article mentions load-average as an example signal, but to me, that's a poor signal to monitor. Instead, if your SLO is defined for error rate, alert on error rate.

Alerts on your SLO will have a high predictive value for predicting violations of your SLO, by definition. The tunable parameter here is the time window, not the threshold. E.g. if your error budget is defined for a 30d window, you may want alerts at the SLO threshold for 24h and 1h windows.

Alert on causes, not symptoms.

jacquesm3y ago

> But you should care more about the quality of the signals you are monitoring than about the sensitivity of your thresholds.

This is so true. Case in point: Growatt inverters have - like every other inverter - a maximum voltage on the grid connection at which they will shut down. They're pretty trigger happy about this and fail to take into account the resistance of the feed wire of the inverter to the (much lower impedance) grid hookup. As a result even on cabling sized properly for the interconnect they tend to falsely trigger well before the point where they should. The only way to avoid this problem is to either hack into the inverter somehow (which I've so far failed to do) or to use oversized cables (which isn't always an option).

The sensitivity is fantastic, the quality of the signal is hopeless. Obviously they err on the side of caution but the margin is so ridiculously large that you end up losing a lot of usable power for no reason at all. At least it should allow for either a resistance for the interconnect to be specified so that it can take into account the voltage drop across that wire, which at 10A is appreciable for even short runs of fairly beefy cable.

mertd3y ago

The post is somewhat incomplete without also discussing the cost of the wrong decision.

You obey the smoke alarm because the cost of ignoring the alarm when it is a true positive is potentially infinite (you die). You ignore the car alarm because (1) most likely it is a false positive but also (2) most likely it is somebody else's car.

bdamm3y ago

Both of these are exactly the kind of problem where our AI future is going to deliver cost effective modern alternatives. Primitive sensors wake up more sophisticated analyzers and use deep sensors (including video) to determine if there is a real problem.

Witness companies like Rivian triggering car alarms on aggressive behavior detected from ML on video. Don't even need to touch the car.

gmuslera3y ago

Some complementary reading could be My Philosophy on Alerting ( https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa... ) and https://how.complexsystems.fail/

In any case, not all signals are the same. Most systems have a lot of components interacting and what turns to be dangerous is usually a combination of factors, but in the end, what defines that it was or not is that the system is doing what it should. You can put some guessing thresholds, but you must contrast it with that the system works.

And they should be actionable too, at least for alerts instead of slow day notifications, or metrics giving context to perceived problems that could take out the guessing from the thresholds.

yafbum3y ago

I'd like to know more about the chip designer who, perhaps unwittingly, created the alarm-filled soundscape of most American cities https://youtu.be/tmCnleSBAIg. Would love to know more about the composition process that went into it.

deathanatos3y ago

Nothing it the article is wrong, per se, but it all seems awfully disconnected from the realities I see in monitoring and alerting?

The end advice is right: you want to build the smoke detector, not the car alarm. But … getting that done, now that's the trick. If the org has car alarms, that's the same org whose PMs will not see the "impact" of that ticket to get the monitoring made ship shape, and that ticket will be backlog-icebox-graveyard'ed.

I've had to get a number of "technical" non-technical roles to try to see that, no, the monitoring software can not automatically generate¹ metrics around your application. Yes, you have to actually instrument the code to add those!

Then that gets combined with systems that are just … not the best? at their job: Datadog has so many rough corners around statistics, where graphs will alias, graphs will change shape depending on zoom, units are a PITA or nowhere to be seen, etc. Sumo a god-awful UI — literally tabs inside tabs, and I can't copy the URL?! — and barely understand structured logging. Splunk is marginally better. Pagerduty permits only the simplest handling of events — don't limit me to a handful of tailored rules: rules are logic, logic is a function: give me WASM. And I want usable business-hours only alerts².

Self-hosted systems are perpetually met with "that's not our core focus", but nobody ever seems to convert the cost of managed monitoring/alert systems into "number of FTE that could be hired to maintain a self-hosted system".

(Oddly, the example in the article is a car alarm. Load avg. is, IMO, a useless metric. Better to measure CPU consumption and IOPS consumption, separately, or probably better, more derivative stats around the things doing the IOPS/CPU.)

¹yes, most system come with some collectors to get system-level stuff like CPU usage, etc. I mean metrics specific to your application.

²PD claims to support them, but in practice, they don't work: alerts received off hours don't alert, true … but they never alert once business resume, either! If you're in an org trying to dig itself out of a mess, you need them to not die in the low-prio pile.

(Ugh. Give me ACL systems in these systems that don't suck: PD locks the routing rules behind like "Admin", and security doesn't want to grant the rank and file "Admin", and so 80% of my devs have no idea how the system works because they're not allowed to see how the system works! Give me the ability to do a WMA business-days-only line for diurnal patterns! The list just goes on and on…)

j / k navigate · click thread line to collapse

78 comments

49 comments · 14 top-level

sammalloy3y ago· 10 in thread

BalinKing3y ago

> It has seriously disrupted our lives in every way imaginable

I assume this is one of those things that changes dramatically based on where you live—for me (western US), this statement seems almost comically exaggerated.

dahfizz3y ago

Yeah, I can't remember the last time I heard a car alarm.

2 more replies

GuB-423y ago

> I can’t think of a single car that has been protected by a car alarm since they were invented.

Many insurance companies offer lower premiums if you install a car alarm. So I guess they work at least a little, otherwise they wouldn't lower their premiums.

It may not actually stop a thief, but it may get a thief to chose a car that doesn't have an alarm, or maybe it is just a correlation, but there is at least something.

eatsyourtacos3y ago

>Many insurance companies offer lower premiums if you install a car alarm. So I guess they work at least a little, otherwise they wouldn't lower their premiums.

I can think of another reason.

Maybe I'm making shit up- but I've grown to hate insurance companies so much that it also makes perfect sense.

1 more reply

eutectic3y ago

If you change the requirements for an insurance policy you attract a different risk pool. Just like health insurance companies offering gym membership to find healthier customers.

kloch3y ago

This was true in the early 1990's but I rarely hear a car alarm these days.

There was one particularly annoying car alarm that was popular in the 1990's that had a sequence of "police/ambulance siren" sounds.

Bieww Bieww Bieww Bieww, oooo eeee oooo eeee, rrr rrr rrr rrr, ee oo ee oo, booooooooup booooooooup booooooooup booooooooup, oooooooo eeeeeeee oooooooo eeeeeeee <repeat>

Or something like that.

jimbob453y ago

asdff3y ago

izacus3y ago

Does the alarm ever prevent theft?

2 more replies

smeagull3y ago

When I hear a car alarm, I look away and hope the annoying car gets stolen as fast as possible.

raldi3y ago· 5 in thread

When the oncall gets paged, an SLO should be in jeopardy in a way that requires immediate measures to be taken by a well-trained human as described in actionable terms in a linked playbook.

No SLO in jeopardy, or no immediate measure that needs to be taken? Don't page the oncall; send a low-priority ticket for the service owner to investigate the next business day.

fatnoah3y ago

peteradio3y ago

You know all that work was worth it when you get a good lauding.

dgunay3y ago

1 more reply

okdood643y ago

> When the oncall gets paged, an SLO should be in jeopardy in a way that requires immediate measures

> No SLO in jeopardy, or no immediate measure that needs to be taken?

A little contradictory here. Maybe your "or" should've been an "and". But anyways:

raldi3y ago

The first one is "p and q"; the negation of that is "(not p) or (not q)".

Could you describe one of the scenarios you're envisioning?

rtkwe3y ago· 5 in thread

justin_oaks3y ago

I had a boss who had an inbox with literally hundreds of thousands of unread emails. A good chunk of those emails were "success" messages from batch processes.

rtkwe3y ago

hevans663y ago

My pet peeve is these $PROCESS notifications that go to slack channels. I worked at a company that had an #engineering_humans slack channel because we got chased out of #engineering by bots.

justin_oaks3y ago

I'm fine if they go to THEIR OWN slack channel. Then I can mute or leave that channel.

Of course, it's a different problem if those notifications have a mix of actionable and non-actionable messages (e.g. both success and error messages). Then it's a signal/noise problem.

WrtCdEvrydy3y ago

The one that pushes buttons is the alarms that have no docs attached so when they blow off at 2AM, they just get muted until someone comes in and complains at 6AM.

tra33y ago· 4 in thread

I need to sit down and go through the math again, I got lost in the middle somewhere. All I know is our alerts are way too noisy now to the point where they are useless.

hevans663y ago

nh23423fefe3y ago

sparrish3y ago

Alert fatigue... it's common when alerts are non-actionable and it causes a lot of downtime.

bluGill3y ago

We write up stories to fix them, and upper management tracks progress on completion so they are not buried in the backlog.

yamtaddle3y ago· 4 in thread

> It’s a trap.

> In the long run, false positives can — and will often — hurt you more than false negatives. Let’s learn about the base rate fallacy.

(If I were writing like the author I suppose that last part would have read:

"I've certainly never been so annoyed by my car alarm that I've ripped it out and stuck it in a chest freezer.

I have, with a smoke alarm."

Except also I'd have found a way to use "we" and "you" a bunch.)

quickthrower23y ago

I see a lot of this style of writing in articles submitted on HN. I think they are just trying to make the writing more lively, not trying to BS.

A trope of this style is “{interesting half story} but more on that later”.

I don’t think it is a big deal and I don’t see much self promotion here other than vanilla blogging, i.e. sounds like this person is knowledgeable let’s check their bio.

burnished3y ago

raldi3y ago

What do you mean by "this style of writing"? What aspects of the quote do you object to?

jacquesm3y ago

At a guess the bit 'let's learn about the base rate fallacy'.

1 more reply

Syonyk3y ago· 3 in thread

There's been quite a bit of research done, generally easy to find if you look, that talks about the difference and tests them, but the short summary:

- The photoelectric type is very good at detecting smoke in the air - but it isn't nearly as prone to false triggers on ovens, a burner burning some spills off, etc.

Using smoke alarms as a way to talk about monitoring systems is nice, but in terms of actual smoke detectors, get at least a few photoelectric sorts in the main areas of your home.

Do not get the "combined sensor" sort, since these tend to be and-gated and the worst of both worlds.

Edited to add some resources:

A presentation on the matter from a while back by one of the experts in this field: https://wahigroup.com/Resources/Documents/Ion%20vs%20Photo%2...

Another paper: https://www.semanticscholar.org/paper/Detection-of-Smoke-%3A...

bluGill3y ago

The law requires you have both types of good reason. Either alone will detect less than half of all house fires.

Dual sensors are not and gated. While nobody will admit what algorithm they use, they detect most fires unlike the single sensor type.

Syonyk3y ago

As for dual sensors and gating... do you actually trust your life to "nobody will admit what algorithm they use"?

riceart3y ago

Lol “the law” .. what law? Maybe in some dumb ass jurisdiction - but you’re a bit full of yourself if you think where you happen to live is “the law”.

1 more reply

dfox3y ago· 2 in thread

tobyjsullivan3y ago

Where I was expecting the author to go:

- Clearly was talking about residential smoke detectors, not commercial. That could have been explicit.

Speaking of the commercial context, I wonder what sort of businesses would get a lot of false alarms and how that varies across industries.

mjevans3y ago

I would actually prefer if rules mandated they could only have large capacitors and just NOT CARE if the power goes out.

Next would be to require a sensor pick up an area that's IR hot AND smoke to go off. I'm sick of bathroom steam sometimes setting them off too.

1 more reply

compumike3y ago· 1 in thread

If you make the timeout too short, you’ll get lots of false positive alerts, and as the article says, your on-call engineers will be overwhelmed or just start ignoring the alerts.

If you make the timeout too long, it just takes that many minutes of downtime longer before you find out about it.

It may sound counterintuitive, but the latter is usually preferable. :)

[1] https://heiioncall.com/blog/kubernetes-liveness-probes-and-c...

[2] https://heiioncall.com/docs

jrockway3y ago

There are other techniques here that work better.

cbarrick3y ago· 1 in thread

I think this article is missing the forest for the trees.

The article is about finding the appropriate sensitivity of alerts on some signal in order to maximize the predictive value.

But you should care more about the quality of the signals you are monitoring than about the sensitivity of your thresholds.

The article mentions load-average as an example signal, but to me, that's a poor signal to monitor. Instead, if your SLO is defined for error rate, alert on error rate.

Alert on causes, not symptoms.

jacquesm3y ago

> But you should care more about the quality of the signals you are monitoring than about the sensitivity of your thresholds.

mertd3y ago

The post is somewhat incomplete without also discussing the cost of the wrong decision.

bdamm3y ago

Witness companies like Rivian triggering car alarms on aggressive behavior detected from ML on video. Don't even need to touch the car.

gmuslera3y ago

Some complementary reading could be My Philosophy on Alerting ( https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa... ) and https://how.complexsystems.fail/

And they should be actionable too, at least for alerts instead of slow day notifications, or metrics giving context to perceived problems that could take out the guessing from the thresholds.

yafbum3y ago

deathanatos3y ago

Nothing it the article is wrong, per se, but it all seems awfully disconnected from the realities I see in monitoring and alerting?

¹yes, most system come with some collectors to get system-level stuff like CPU usage, etc. I mean metrics specific to your application.

j / k navigate · click thread line to collapse