One of the main reasons why this works well is because we feed the models our incident playbooks and response knowledge bases.
These playbooks are very carefully written and maintained by people. The current generation of models are pretty much post-human in following them, performing reasoning and suggesting mitigations.
We tried indexing just a bunch of incident slack channels and result was not great. But with explicit documentation, it works well.
Kind of proves what we already know, garbage in, garbage out. But also, other functions, eg: PM, Design have tried automating their own workflows, but doesn't work as well.
On the investigation side of things I can definitely see how an AI driven troubleshooting process could be valuable. Lots of developers are lacking debugging skills, so an AI driven process that looks at the relevant metrics and logs and can reason around what the next line of inquiry should be could definitely speed things up.
In general, playbooks are useful for either common occurences that happen frequently (ie every week we need to run a script to fix something in the app) or things that happen rarely but when they do happen need a plan (ie disaster recovery)
If it works well for incident response, then there are many usecases that are similar - basically most kinds of diagnostics/troubleshooting of systems. At least the relatively bounded ones, where it is feasible to on have documentation on the particular system. Say debugging of a building HVAC system.
Expert systems failed in part because of the inability to learn, while HVAC is ladder logic, that I honestly haven't spent much time in, LLMs are inductive.
It will be a useful tool, but expert systems had a very restricted solution space.
If you already have those, how much can an AI add? Or conversely, not surprising that it does well when it's given a pre-digested feed of all the answers in advance.
IME a very very large number of impacting incidents arent strictly tied to “a” code change, if any at all. It _feels_ like theres an implied solution to tying running version back to deployment rev, to deployment artifacts, and vcs.
Boundary conditions and state changes in the distributed system were the biggest bug bear I ran in to at AWS. Then below that were all of the “infra” style failures like network faults, latency, API quota exhaustion, etc. And for all the cloudformation/cdk/terraform in the world its non trivial to really discover those effects and tie them to a “code change.” Totally ignoring older tools that may be managed via CLI or the ol’ point and click.
- Code changes
- Configuration changes (this includes the equivalent of server topology changes like cloudformation, quota changes)
- Experimentation rollout changes
There has been issues that are external (like user behavior change for new year / world cup final, physical connection between datacenters being severed…) but they tend to be a lot less frequent.
All the 3 big buckets are tied to a single trackable change with an id so this leads to the ability to do those kind of automated root cause analysis at scale.
Now, Meta is mostly a closed loop where all the infra and product is controlled as one entity so those results may not be applicable outside.
Definitely agree that the bulk Of “impact” is back to changes introduced in the SDLC. Even for major incidents infrastructure is probably down to 10-20% of causes in a good org. My view in GP is probably skewed towards major incidents impairing multiple services/regions as well. While I worked on a handful of services it was mostly edge/infra side, and I focused the last few years specifically on major incident management.
Id still be curious about internal system state and faults due to issues like deadlocked workflows, incoherent state machines, and invalid state values. But maybe its simply not that prevalent.
I'm curious how well that works in the situation where your config change or experiment rollout results in a time bomb (e.g. triggered by task restart after software rollout), speaking as someone who just came off an oncall shift where that was one of our more notable outages.
Google also has a ledger of production events which _most_ common infra will write to, but there are so many distinct systems that I would be worried about identifying spurious correlations with completely unrelated products.
> There has been issues that are external (like ... physical connection between datacenters being severed…) but they tend to be a lot less frequent.
That's interesting to hear, because my experience at Google is that we'll see a peering metro being fully isolated from our network at least once a year; smaller fiber cuts that temporarily leave us with a SPOF or with a capacity shortfall happen much much more frequently.
(For a concrete example: a couple months ago, Hurricane Beryl temporarily took a bunch of peering infrastructure in Texas offline.)
Usually this implies there are bigger problems. If something keeps breaking without any change (config / code) then it was likely always broken and just ignored.
So when companies do have most of the low hanging fruit resolved it's the changes that break things.
I've seen places where everything is duck taped together but BUT it still only breaks on code changes. Everyone learns to avoid stressing anything fragile.
Seperately, we have been curious about extending louie.ai to work not just with logs/DBs, but go in the reverse direction ('shift right'): talk directly to a live OSAgent like an EDR or OSQuery, whether on a live system or a cloud image copy. If of interest to any teams, would love to chat.
We're taking a slightly different angle than what Facebook published, in that we're primarily using tool calling and observability data to run investigations.
What we've released really shines at surfacing up relevant observability data automatically, and we're soon planning to add the change-tracking elements mentioned in the Facebook post.
If anyone is curious, I did a webinar with PagerDuty on this recently.
And thanks for submitting!
Forget the diff, I don’t want my name on the natural language summary.
That means the first person on the call doing the triage has a 58% chance of being fed misinformation and bias which they have to distinguish from reality.
Of course you can't say anything about an ML model being bad that you are promoting for your business.
And this is just part of the diagnosis process. The system should still be providing breadcrumbs or short cuts for the user to test the suggested hypothesis.
Which is why any responsible system like this will include feedback loops and evaluation of false positive/negative outcomes and tune for sensitivity & specificity over time.
I have about 30 years experience both on hard engineering (electronics) and software engineering particularly on failure analysis and reliability engineering. Most people are lazy and get led astray with false information. This is a very dangerous thing. You need a proper conceptualisation framework like a KT problem analysis to eliminate incorrect causes and keep people thinking rationally and get your MTTR down to something reasonable.
42% accuracy on a tiny, outdated model - surely it would improve significantly by fine-tuning Llama 3.1 405B!
9 times out of 10, you can and should write "using" instead of "leveraging".
User: Ahh, got locked out, contact support and wait
AI 2: The user is not suspicious, unlock account
User: Great, thank you
AI 1: This account is suspicious, lock account
How would an experienced engineer score on the same task?
Also, some more researches in the similar space by other enterprises:
Microsoft: https://yinfangchen.github.io/assets/pdf/rcacopilot_paper.pd...
Salesforce: https://blog.salesforceairesearch.com/pyrca/
Personal plug: I'm building a self-service AIOps platform for engineering teams (somewhat similar to this work by Meta). If you're looking to read more about it, visit -- https://docs.drdroid.io/docs/doctor-droid-aiops-platform
We are currently working on this at: https://github.com/opslane/opslane
We are starting by tackling adding enrichment to your alerts.
Unlike a modern LLM (or most any non-trivial NN), a GBDT’s feature importance is defensively rigorous.
After floating the results to a few folks up the chain we burned it and forget where.