Leveraging AI for efficient incident response (opens in new tab)

(engineering.fb.com)

110 pointsAmaresh1y ago55 comments

55 comments

45 comments · 16 top-level

LASR1y ago· 10 in thread

We've shifted our oncall incident response over to mostly AI at this point. And it works quite well.

One of the main reasons why this works well is because we feed the models our incident playbooks and response knowledge bases.

These playbooks are very carefully written and maintained by people. The current generation of models are pretty much post-human in following them, performing reasoning and suggesting mitigations.

We tried indexing just a bunch of incident slack channels and result was not great. But with explicit documentation, it works well.

Kind of proves what we already know, garbage in, garbage out. But also, other functions, eg: PM, Design have tried automating their own workflows, but doesn't work as well.

nevon1y ago

I'm really curious to hear more about what kind of thing is covered in your playbooks. I've often heard and read about the value of playbooks, but I've yet to see it bear fruit in practice. My main work these past few years has been in platform engineering, and so I've also been involved in quite a few incidents over that time, and the only standardized action I can think of that has been relevant over that time is comparing SLIs between application versions and rolling back to a previous version if the newer version is failing. Beyond that, it's always been some new failure mode where the resolution wouldn't have been documented because it's never happened before.

On the investigation side of things I can definitely see how an AI driven troubleshooting process could be valuable. Lots of developers are lacking debugging skills, so an AI driven process that looks at the relevant metrics and logs and can reason around what the next line of inquiry should be could definitely speed things up.

twunde1y ago

Playbooks that I've found value in: - Generic application version SLI comparison. The automated version of this is automated rollbacks (Harness supports this out of the box, but you can certainly find other competitors or build your own) - Database performance debugging - Disaster recovery (bad db delete/update, hardware failure, region failure)

In general, playbooks are useful for either common occurences that happen frequently (ie every week we need to run a script to fix something in the app) or things that happen rarely but when they do happen need a plan (ie disaster recovery)

jononor1y ago

Expert systems redux? Being able to provide the expertise in form of plain written English (or another language), will at least make it much more feasible to build them up. And it can also meaningfully be consumed by a human.

If it works well for incident response, then there are many usecases that are similar - basically most kinds of diagnostics/troubleshooting of systems. At least the relatively bounded ones, where it is feasible to on have documentation on the particular system. Say debugging of a building HVAC system.

nyrikki1y ago

Why won't it hit the same limits of frame problem or qualification problem?

Expert systems failed in part because of the inability to learn, while HVAC is ladder logic, that I honestly haven't spent much time in, LLMs are inductive.

It will be a useful tool, but expert systems had a very restricted solution space.

SoftTalker1y ago

I have found it rare that an organization has incident "playbooks that are very carefully written and maintained"

If you already have those, how much can an AI add? Or conversely, not surprising that it does well when it's given a pre-digested feed of all the answers in advance.

wredue1y ago

Meanwhile, we’ve tried AI products just for assigning incidents and are forced to turn them off because of how shitty of a job they do.

vvram1y ago

That's great to hear. What is your current tool chain in the effort? Do you have a structure for Playbooks and KBs you would recommend

stenlix1y ago

Curious if you explored any external tools before building in-house? Looking to do something similar at my company

bamboozled1y ago

What does AI add to your playooks ?

snovv_crash1y ago

I'm guessing the being awake and fresh at 3am within a few seconds of the incident occuring part.

1 more reply

donavanm1y ago· 7 in thread

Im really interested in the implied restriction/focus on “code changes.”

IME a very very large number of impacting incidents arent strictly tied to “a” code change, if any at all. It _feels_ like theres an implied solution to tying running version back to deployment rev, to deployment artifacts, and vcs.

Boundary conditions and state changes in the distributed system were the biggest bug bear I ran in to at AWS. Then below that were all of the “infra” style failures like network faults, latency, API quota exhaustion, etc. And for all the cloudformation/cdk/terraform in the world its non trivial to really discover those effects and tie them to a “code change.” Totally ignoring older tools that may be managed via CLI or the ol’ point and click.

vjeux1y ago

From my experience, the vast majority of reliability issues at Meta come from 3 areas:

- Code changes

- Configuration changes (this includes the equivalent of server topology changes like cloudformation, quota changes)

- Experimentation rollout changes

There has been issues that are external (like user behavior change for new year / world cup final, physical connection between datacenters being severed…) but they tend to be a lot less frequent.

All the 3 big buckets are tied to a single trackable change with an id so this leads to the ability to do those kind of automated root cause analysis at scale.

Now, Meta is mostly a closed loop where all the infra and product is controlled as one entity so those results may not be applicable outside.

donavanm1y ago

Interesting. It sounds like “all” service state management (admin config, infra, topology) is discoverable/legible for meta. I think that contrasts with AWS where there is a strong DevTools org, but many services and integrations are more of an API centric service-to-service model with distributed state which is much harder to observe. Every cloud provider I know of also has a (externally opaque) division between “native” cloud-service-built-on-cloud-infra and (typically older) “foundational” services that are much closer to “bare metal” with their own bespoke provisioning and management. Ex EC2 has great visibility inside of their placement and launch flows, but itll never look like/interop with cfn & cloudtrail that ~280 other “native” services use.

Definitely agree that the bulk Of “impact” is back to changes introduced in the SDLC. Even for major incidents infrastructure is probably down to 10-20% of causes in a good org. My view in GP is probably skewed towards major incidents impairing multiple services/regions as well. While I worked on a handful of services it was mostly edge/infra side, and I focused the last few years specifically on major incident management.

Id still be curious about internal system state and faults due to issues like deadlocked workflows, incoherent state machines, and invalid state values. But maybe its simply not that prevalent.

vitus1y ago

> this leads to the ability to do those kind of automated root cause analysis at scale.

I'm curious how well that works in the situation where your config change or experiment rollout results in a time bomb (e.g. triggered by task restart after software rollout), speaking as someone who just came off an oncall shift where that was one of our more notable outages.

Google also has a ledger of production events which _most_ common infra will write to, but there are so many distinct systems that I would be worried about identifying spurious correlations with completely unrelated products.

> There has been issues that are external (like ... physical connection between datacenters being severed…) but they tend to be a lot less frequent.

That's interesting to hear, because my experience at Google is that we'll see a peering metro being fully isolated from our network at least once a year; smaller fiber cuts that temporarily leave us with a SPOF or with a capacity shortfall happen much much more frequently.

(For a concrete example: a couple months ago, Hurricane Beryl temporarily took a bunch of peering infrastructure in Texas offline.)

re-thc1y ago

> IME a very very large number of impacting incidents arent strictly tied to “a” code change, if any at all

Usually this implies there are bigger problems. If something keeps breaking without any change (config / code) then it was likely always broken and just ignored.

So when companies do have most of the low hanging fruit resolved it's the changes that break things.

I've seen places where everything is duck taped together but BUT it still only breaks on code changes. Everyone learns to avoid stressing anything fragile.

donavanm1y ago

See other child reply upthread, lots of service-to-service style interactions that look more like distributed state than a CR. And my view was across an org scope where even “infrequent” quickly accumulated. AWS is on the order of 50,000 SDEs, running 300 public services (plus a multiple more internal), and each team/microservice with 50 independent deployment targets.

UK-AL1y ago

At my place 90% of them are 3rd parties going down, and you can't do much other than leave. But the new 3rd parties are just as bad. All you can do gracefully handle failure.

lmeyerov1y ago

Interestingly, with the move to IaC, diagnosing at the level of code change makes increasing sense. It's impressive to see their results given that perspective. Not obvious!

Seperately, we have been curious about extending louie.ai to work not just with logs/DBs, but go in the reverse direction ('shift right'): talk directly to a live OSAgent like an EDR or OSQuery, whether on a live system or a cloud image copy. If of interest to any teams, would love to chat.

nyellin1y ago· 3 in thread

We've open sourced something with similar goals that you can use today: https://github.com/robusta-dev/holmesgpt/

We're taking a slightly different angle than what Facebook published, in that we're primarily using tool calling and observability data to run investigations.

What we've released really shines at surfacing up relevant observability data automatically, and we're soon planning to add the change-tracking elements mentioned in the Facebook post.

If anyone is curious, I did a webinar with PagerDuty on this recently.

BodyCulture1y ago

https://news.ycombinator.com/item?id=41327430

BodyCulture1y ago

Can we see the recording of this webinar somewhere?

nyellin1y ago

Here you go: https://www.youtube.com/live/Jml1hk6I5Wo?si=YbjJKRkO4yf0bOlx

And thanks for submitting!

AeZ1E1y ago· 2 in thread

nice to see meta investing in AI investigation tools! but 42% accuracy doesn't sound too impressive to me... maybe there's still some fine-tuning needed for better results? glad to hear about the progress though!

Kirth1y ago

Really, a tool where 42% of incident responses the on call engineers are greeted by a pointer that likely lets them resolve the incident almost immediately and move on, rather than spending potentially hours figuring out which component it is they need to address and how, isn't impressive to you?

chaoz__1y ago

It depends on whether it's generating 58% of answers that lead on-call engineers down the wrong path. Honestly, it's more of a question -- I did not read the article deeply.

ketzo1y ago· 2 in thread

This is really cool. My optimistic take on GenAI, at least with regard to software engineering, is that it seems like we're gonna have a lot of the boring / tedious parts of our jobs get a lot easier!

benreesman1y ago

Claude 3.5 Sonnet still can’t cut me a diff summary based on the patch that I’m generally willing to hand in as my own work and it’s by far the best API-mediated, investor-subsidized one.

Forget the diff, I don’t want my name on the natural language summary.

viraptor1y ago

You mean it doesn't understand the change you've made based on the diff?

1 more reply

minkles1y ago· 2 in thread

I'm going to point out the obvious problem here: 42% RC identification is shit.

That means the first person on the call doing the triage has a 58% chance of being fed misinformation and bias which they have to distinguish from reality.

Of course you can't say anything about an ML model being bad that you are promoting for your business.

donavanm1y ago

No. Youre missing the UX forest for the pedantry trees here. Ive worked on a team that did similar change detection with little to no ML magic. It matters how its presented as a hint (“top five suggested”) and not THE ANSWER. In addition its VERY common to do things like present confidence or weight to the user. And why theres a huge need for explainability.

And this is just part of the diagnosis process. The system should still be providing breadcrumbs or short cuts for the user to test the suggested hypothesis.

Which is why any responsible system like this will include feedback loops and evaluation of false positive/negative outcomes and tune for sensitivity & specificity over time.

minkles1y ago

No I'm not. It's crap.

I have about 30 years experience both on hard engineering (electronics) and software engineering particularly on failure analysis and reliability engineering. Most people are lazy and get led astray with false information. This is a very dangerous thing. You need a proper conceptualisation framework like a KT problem analysis to eliminate incorrect causes and keep people thinking rationally and get your MTTR down to something reasonable.

1 more reply

pants21y ago· 1 in thread

> The biggest lever to achieving 42% accuracy was fine-tuning a Llama 2 (7B) model

42% accuracy on a tiny, outdated model - surely it would improve significantly by fine-tuning Llama 3.1 405B!

teleforce1y ago

Yes very interesting potential, it looks like it can be increased in accuracy considerably because Llama 3.1 with 405B parameters has very similar performance with the latest GPT-4o.

BurningFrog1y ago· 1 in thread

PSA:

9 times out of 10, you can and should write "using" instead of "leveraging".

fire_lake1y ago

Given how AI can automate and scale bad decisions, isn’t leveraging the right word here?

coding1231y ago· 1 in thread

AI 1: This user is suspicious, lock account

User: Ahh, got locked out, contact support and wait

AI 2: The user is not suspicious, unlock account

User: Great, thank you

AI 1: This account is suspicious, lock account

ElevenLathe1y ago

Luckily I subscribe to my own consumer AI service to automate all this for me. To paraphrase The Simpsons: "AI: the cause of and solution to all life's problems."

mafribe1y ago

The paper goes out of its way not to compare the 42% figure with anything. Is "42% within the top 5 suggestions" good or bad?

How would an experienced engineer score on the same task?

TheBengaluruGuy1y ago

Interesting. Just a few weeks back, I was reading about their previous work https://atscaleconference.com/the-evolution-of-aiops-at-meta... -- didn't realise there's more work!

Also, some more researches in the similar space by other enterprises:

Microsoft: https://yinfangchen.github.io/assets/pdf/rcacopilot_paper.pd...

Salesforce: https://blog.salesforceairesearch.com/pyrca/

Personal plug: I'm building a self-service AIOps platform for engineering teams (somewhat similar to this work by Meta). If you're looking to read more about it, visit -- https://docs.drdroid.io/docs/doctor-droid-aiops-platform

MOARDONGZPLZ1y ago

I would love if they leveraged AI to detect AI on the regular Facebook feed. I visit occasionally and it’s just a wasteland of unbelievable AI content with tens of thousands of bot (I assume…) likes. Makes me sick to my stomach and I can’t even browse.

aray071y ago

I do think AI will automate a lot of the grunt work involved with incidents and make the life of on-call engineers better.

We are currently working on this at: https://github.com/opslane/opslane

We are starting by tackling adding enrichment to your alerts.

benreesman1y ago

Way back in the day on FB Ads we trained a GBDT on a bunch of features extracted from the diff that had been (post-hoc) identified as the cause of a SEV.

Unlike a modern LLM (or most any non-trivial NN), a GBDT’s feature importance is defensively rigorous.

After floating the results to a few folks up the chain we burned it and forget where.

_pdp_1y ago

I will be more interested to understand how they deal with injection attacks. Any alert where the attacker controls some parts of the text that ends up in the model could be used to either evade it worse use it to hack it. Slack had an issue like that recently.

devneelpatel1y ago

This is exactly what we do at OneUptime.com. Show you AI generated possible Incident remediation based on your data + telemetry + code. All of this is 100% open-source.

j / k navigate · click thread line to collapse

55 comments

45 comments · 16 top-level

LASR1y ago· 10 in thread

We've shifted our oncall incident response over to mostly AI at this point. And it works quite well.

One of the main reasons why this works well is because we feed the models our incident playbooks and response knowledge bases.

These playbooks are very carefully written and maintained by people. The current generation of models are pretty much post-human in following them, performing reasoning and suggesting mitigations.

We tried indexing just a bunch of incident slack channels and result was not great. But with explicit documentation, it works well.

Kind of proves what we already know, garbage in, garbage out. But also, other functions, eg: PM, Design have tried automating their own workflows, but doesn't work as well.

nevon1y ago

twunde1y ago

jononor1y ago

nyrikki1y ago

Why won't it hit the same limits of frame problem or qualification problem?

Expert systems failed in part because of the inability to learn, while HVAC is ladder logic, that I honestly haven't spent much time in, LLMs are inductive.

It will be a useful tool, but expert systems had a very restricted solution space.

SoftTalker1y ago

I have found it rare that an organization has incident "playbooks that are very carefully written and maintained"

If you already have those, how much can an AI add? Or conversely, not surprising that it does well when it's given a pre-digested feed of all the answers in advance.

wredue1y ago

Meanwhile, we’ve tried AI products just for assigning incidents and are forced to turn them off because of how shitty of a job they do.

vvram1y ago

That's great to hear. What is your current tool chain in the effort? Do you have a structure for Playbooks and KBs you would recommend

stenlix1y ago

Curious if you explored any external tools before building in-house? Looking to do something similar at my company

bamboozled1y ago

What does AI add to your playooks ?

snovv_crash1y ago

I'm guessing the being awake and fresh at 3am within a few seconds of the incident occuring part.

1 more reply

donavanm1y ago· 7 in thread

Im really interested in the implied restriction/focus on “code changes.”

vjeux1y ago

From my experience, the vast majority of reliability issues at Meta come from 3 areas:

- Code changes

- Configuration changes (this includes the equivalent of server topology changes like cloudformation, quota changes)

- Experimentation rollout changes

There has been issues that are external (like user behavior change for new year / world cup final, physical connection between datacenters being severed…) but they tend to be a lot less frequent.

All the 3 big buckets are tied to a single trackable change with an id so this leads to the ability to do those kind of automated root cause analysis at scale.

Now, Meta is mostly a closed loop where all the infra and product is controlled as one entity so those results may not be applicable outside.

donavanm1y ago

Id still be curious about internal system state and faults due to issues like deadlocked workflows, incoherent state machines, and invalid state values. But maybe its simply not that prevalent.

vitus1y ago

> this leads to the ability to do those kind of automated root cause analysis at scale.

> There has been issues that are external (like ... physical connection between datacenters being severed…) but they tend to be a lot less frequent.

(For a concrete example: a couple months ago, Hurricane Beryl temporarily took a bunch of peering infrastructure in Texas offline.)

re-thc1y ago

> IME a very very large number of impacting incidents arent strictly tied to “a” code change, if any at all

Usually this implies there are bigger problems. If something keeps breaking without any change (config / code) then it was likely always broken and just ignored.

So when companies do have most of the low hanging fruit resolved it's the changes that break things.

I've seen places where everything is duck taped together but BUT it still only breaks on code changes. Everyone learns to avoid stressing anything fragile.

donavanm1y ago

UK-AL1y ago

At my place 90% of them are 3rd parties going down, and you can't do much other than leave. But the new 3rd parties are just as bad. All you can do gracefully handle failure.

lmeyerov1y ago

Interestingly, with the move to IaC, diagnosing at the level of code change makes increasing sense. It's impressive to see their results given that perspective. Not obvious!

nyellin1y ago· 3 in thread

We've open sourced something with similar goals that you can use today: https://github.com/robusta-dev/holmesgpt/

We're taking a slightly different angle than what Facebook published, in that we're primarily using tool calling and observability data to run investigations.

What we've released really shines at surfacing up relevant observability data automatically, and we're soon planning to add the change-tracking elements mentioned in the Facebook post.

If anyone is curious, I did a webinar with PagerDuty on this recently.

BodyCulture1y ago

https://news.ycombinator.com/item?id=41327430

BodyCulture1y ago

Can we see the recording of this webinar somewhere?

nyellin1y ago

Here you go: https://www.youtube.com/live/Jml1hk6I5Wo?si=YbjJKRkO4yf0bOlx

And thanks for submitting!

AeZ1E1y ago· 2 in thread

Kirth1y ago

chaoz__1y ago

It depends on whether it's generating 58% of answers that lead on-call engineers down the wrong path. Honestly, it's more of a question -- I did not read the article deeply.

ketzo1y ago· 2 in thread

benreesman1y ago

Claude 3.5 Sonnet still can’t cut me a diff summary based on the patch that I’m generally willing to hand in as my own work and it’s by far the best API-mediated, investor-subsidized one.

Forget the diff, I don’t want my name on the natural language summary.

viraptor1y ago

You mean it doesn't understand the change you've made based on the diff?

1 more reply

minkles1y ago· 2 in thread

I'm going to point out the obvious problem here: 42% RC identification is shit.

That means the first person on the call doing the triage has a 58% chance of being fed misinformation and bias which they have to distinguish from reality.

Of course you can't say anything about an ML model being bad that you are promoting for your business.

donavanm1y ago

And this is just part of the diagnosis process. The system should still be providing breadcrumbs or short cuts for the user to test the suggested hypothesis.

Which is why any responsible system like this will include feedback loops and evaluation of false positive/negative outcomes and tune for sensitivity & specificity over time.

minkles1y ago

No I'm not. It's crap.

1 more reply

pants21y ago· 1 in thread

> The biggest lever to achieving 42% accuracy was fine-tuning a Llama 2 (7B) model

42% accuracy on a tiny, outdated model - surely it would improve significantly by fine-tuning Llama 3.1 405B!

teleforce1y ago

Yes very interesting potential, it looks like it can be increased in accuracy considerably because Llama 3.1 with 405B parameters has very similar performance with the latest GPT-4o.

BurningFrog1y ago· 1 in thread

PSA:

9 times out of 10, you can and should write "using" instead of "leveraging".

fire_lake1y ago

Given how AI can automate and scale bad decisions, isn’t leveraging the right word here?

coding1231y ago· 1 in thread

AI 1: This user is suspicious, lock account

User: Ahh, got locked out, contact support and wait

AI 2: The user is not suspicious, unlock account

User: Great, thank you

AI 1: This account is suspicious, lock account

ElevenLathe1y ago

Luckily I subscribe to my own consumer AI service to automate all this for me. To paraphrase The Simpsons: "AI: the cause of and solution to all life's problems."

mafribe1y ago

The paper goes out of its way not to compare the 42% figure with anything. Is "42% within the top 5 suggestions" good or bad?

How would an experienced engineer score on the same task?

TheBengaluruGuy1y ago

Interesting. Just a few weeks back, I was reading about their previous work https://atscaleconference.com/the-evolution-of-aiops-at-meta... -- didn't realise there's more work!

Also, some more researches in the similar space by other enterprises:

Microsoft: https://yinfangchen.github.io/assets/pdf/rcacopilot_paper.pd...

Salesforce: https://blog.salesforceairesearch.com/pyrca/

MOARDONGZPLZ1y ago

aray071y ago

I do think AI will automate a lot of the grunt work involved with incidents and make the life of on-call engineers better.

We are currently working on this at: https://github.com/opslane/opslane

We are starting by tackling adding enrichment to your alerts.

benreesman1y ago

Way back in the day on FB Ads we trained a GBDT on a bunch of features extracted from the diff that had been (post-hoc) identified as the cause of a SEV.

Unlike a modern LLM (or most any non-trivial NN), a GBDT’s feature importance is defensively rigorous.

After floating the results to a few folks up the chain we burned it and forget where.

_pdp_1y ago

devneelpatel1y ago

This is exactly what we do at OneUptime.com. Show you AI generated possible Incident remediation based on your data + telemetry + code. All of this is 100% open-source.

j / k navigate · click thread line to collapse