We use functional testing for Datadog and other cloud dependencies so you can know how they are functioning in real time.
Anyone from Metrist able to explain this?
In more detail, this can be for three reasons: 1.) We use functional testing so we're simply showing what aspects of the platform are working and what's not. Due to definitions of "outages" and such in SLA's, vendors like Datadog might not disclose/categorize certain dysfunctions as outages and so they won't show them on their status page. In other words, some outages might be more "minor" and they won't include them on the status page. 2.) Status pages are manual, Metrist is automatic. DD might not have updated or even be fully aware of the outage. Our tests are just showing the objective data as it's happening. 3.) Everyone experiences outages differently. This data from the demo is Metrist's experience with Datadog and can be slightly different from other people (another reason why status pages can be vague). That's why we have an orchestrator that allows people to set up personalized monitoring so they can know exactly how a vendor is affecting them in real-time. And if an outage is relevant to and affecting them.
Does that answer your question? LMK if I can follow up with more info. :)
This bugs me to no end. I don't want to name names but I had a devops service that was returning an odd error implying I was doing something wrong. Status page said everything was good. After several hours I emailed to be told it was actually down, they were aware, and were working on it. It eventually gets fixed, they email back, and all is well. The status page never did show any downtime.
One follow up is there are instances where Datadog report outages but Metrist says it's green.
Is that because the functional tests are still working but some other part of Datadog was reported as down?
The AWS team has a hard challenge of reporting availability and deciding when a system is not green across dozens of API use cases per service, hundreds of services, hundreds of data centers, dozens of availability zones, and millions of clients.
Metrist has no visibility into services internal SLA, SLO, and SLIs. [1]
[1] https://cloud.google.com/blog/products/devops-sre/sre-fundam...
Here are some examples where the SaaS says they are down/degraded, but Metrist thinks they're up:
https://app.metrist.io/demo/jira
https://app.metrist.io/demo/circleci
Here is another where Metrist thinks the service is down, but self-reportedly up:
In my opinion, what you want from statuspage depends on your product and how your customers are using it. Automation strikes me as plausible if your product is a well-defined set of APIs but it’s borderline impossible for a complex web app with many functions which each may or may not be critical. SLO alerts can work here but presuppose you have good SLIs. And capturing good SLIs for a complex product is a whole lot of work that small companies won’t prioritize. All of this is to say, there are downsides to automation that have nothing to do with the biz-side bullshit you’ve correctly noted.
Preventing alert flapping is hardly "very highly polished", it's the base level for having even internal monitoring. And if it works for you internally, you could definitely make some automatic information available.
It's not rocket science we're talking about. Having a status page that can automatically say "Everything OK" or "Something seems to be not OK" would even be enough, but most companies don't offer something like that.
For it to work, you have to know exactly which monitors could go into exactly which states (or range of values, etc) such that it constitutes a user facing outage that should be displayed publicly.
So you have to have automation around: How many users are affected? HOW are they affected? Is our homepage down but APIs still work? Is 1 API having issues but the other 100 just fine? Is there just some elevated latency but everything still works fine, or is it so elevated that we would call it an incident? Is it an internal tool that's broken? Is this a real alert or just an incidental alert due to a planned failover or a deploy and the metric loss is not indicative of user experience? Etc, etc.
It's truly an impossibly hard problem.