Show HN: A Datadog for Datadog (opens in new tab)

(app.metrist.io)

25 pointslngarner3y ago21 comments

Get updates on when Datadog is recovering in real-time (aka about 25 minutes faster than the status page) with the free Metrist.io app and Slack integration.

We use functional testing for Datadog and other cloud dependencies so you can know how they are functioning in real time.

Show HN: A Datadog for Datadog

(app.metrist.io)

25 pointslngarner3y ago21 comments

Get updates on when Datadog is recovering in real-time (aka about 25 minutes faster than the status page) with the free Metrist.io app and Slack integration.

We use functional testing for Datadog and other cloud dependencies so you can know how they are functioning in real time.

21 comments

16 comments · 4 top-level

vinayan33y ago· 6 in thread

Surprised to see how many differences in disagreements between Datadog and Metrist if Datadog is down or not.

Anyone from Metrist able to explain this?

lngarnerOP3y ago

Hi! Thanks for asking. Basically, Status pages get updated manually, and people decide whether and when an outage is sufficiently bad to warrant a status page update. We monitor actual functionality and will capture smaller glitches that either escape human attention altogether or never get escalated to the point where the status page is updated.

In more detail, this can be for three reasons: 1.) We use functional testing so we're simply showing what aspects of the platform are working and what's not. Due to definitions of "outages" and such in SLA's, vendors like Datadog might not disclose/categorize certain dysfunctions as outages and so they won't show them on their status page. In other words, some outages might be more "minor" and they won't include them on the status page. 2.) Status pages are manual, Metrist is automatic. DD might not have updated or even be fully aware of the outage. Our tests are just showing the objective data as it's happening. 3.) Everyone experiences outages differently. This data from the demo is Metrist's experience with Datadog and can be slightly different from other people (another reason why status pages can be vague). That's why we have an orchestrator that allows people to set up personalized monitoring so they can know exactly how a vendor is affecting them in real-time. And if an outage is relevant to and affecting them.

Does that answer your question? LMK if I can follow up with more info. :)

TrueGeek3y ago

> Status pages get updated manually

This bugs me to no end. I don't want to name names but I had a devops service that was returning an odd error implying I was doing something wrong. Status page said everything was good. After several hours I emailed to be told it was actually down, they were aware, and were working on it. It eventually gets fixed, they email back, and all is well. The status page never did show any downtime.

1 more reply

vinayan33y ago

Thanks for responding and providing details.

One follow up is there are instances where Datadog report outages but Metrist says it's green.

Is that because the functional tests are still working but some other part of Datadog was reported as down?

1 more reply

ozten3y ago

My guess would be that Metrist made one or more API calls that failed within a time-slice (hopefully more than one failure). They then mark the entire day orange or red and compare it to AWS's green. Which is true, for the entire day their status symbol was probably green.

The AWS team has a hard challenge of reporting availability and deciding when a system is not green across dozens of API use cases per service, hundreds of services, hundreds of data centers, dozens of availability zones, and millions of clients.

Metrist has no visibility into services internal SLA, SLO, and SLIs. [1]

[1] https://cloud.google.com/blog/products/devops-sre/sre-fundam...

capableweb3y ago

Metrist seems to consistently rate "downtime" different than the various services, for better or worse.

Here are some examples where the SaaS says they are down/degraded, but Metrist thinks they're up:

https://app.metrist.io/demo/jira

https://app.metrist.io/demo/circleci

Here is another where Metrist thinks the service is down, but self-reportedly up:

https://app.metrist.io/demo/newrelic

lngarnerOP3y ago

Thanks for pointing that out! Since status pages are updated manually, we monitor actual functionality. We often see that pages functionally recover long before the status pages update that everything is in working order. Again, because it's manual and status pages are often more for marketing than development purposes. And also we're in "Show HN" and may not be 100% perfect ;) but we stick to the above explanation :)

1 more reply

bloodyplonker223y ago· 4 in thread

I have always wished that companies would tie their status pages to their monitoring systems, but alas, they don't due to PR concerns and overall disingenuousness.

eep_social3y ago

To be successful this would require a very highly polished set of monitors, otherwise common situations like alert flapping would end up spamming updates. Having a human in the loop increases latency but can also increase the quality of the communication.

In my opinion, what you want from statuspage depends on your product and how your customers are using it. Automation strikes me as plausible if your product is a well-defined set of APIs but it’s borderline impossible for a complex web app with many functions which each may or may not be critical. SLO alerts can work here but presuppose you have good SLIs. And capturing good SLIs for a complex product is a whole lot of work that small companies won’t prioritize. All of this is to say, there are downsides to automation that have nothing to do with the biz-side bullshit you’ve correctly noted.

capableweb3y ago

> To be successful this would require a very highly polished set of monitors, otherwise common situations like alert flapping would end up spamming updates. Having a human in the loop increases latency but can also increase the quality of the communication.

Preventing alert flapping is hardly "very highly polished", it's the base level for having even internal monitoring. And if it works for you internally, you could definitely make some automatic information available.

It's not rocket science we're talking about. Having a status page that can automatically say "Everything OK" or "Something seems to be not OK" would even be enough, but most companies don't offer something like that.

2 more replies

rco87863y ago

I don't have any PR concern or disingenuousness, reality is that this is an exceedingly hard task to actually pull off with little or no payoff in the end.

For it to work, you have to know exactly which monitors could go into exactly which states (or range of values, etc) such that it constitutes a user facing outage that should be displayed publicly.

So you have to have automation around: How many users are affected? HOW are they affected? Is our homepage down but APIs still work? Is 1 API having issues but the other 100 just fine? Is there just some elevated latency but everything still works fine, or is it so elevated that we would call it an incident? Is it an internal tool that's broken? Is this a real alert or just an incidental alert due to a planned failover or a deploy and the metric loss is not indicative of user experience? Etc, etc.

It's truly an impossibly hard problem.

lngarnerOP3y ago

Alas INDEED

tra33y ago· 2 in thread

Now we just need datadog to monitor metrist. Who monitors the monitors?

jonatron3y ago

Monitor monitors monitor monitors.

lngarnerOP3y ago

Haha for real! Monitorception

ydnaclementine3y ago

Thoughts and prayers to the engineers

j / k navigate · click thread line to collapse

21 comments

16 comments · 4 top-level

vinayan33y ago· 6 in thread

Surprised to see how many differences in disagreements between Datadog and Metrist if Datadog is down or not.

Anyone from Metrist able to explain this?

lngarnerOP3y ago

Does that answer your question? LMK if I can follow up with more info. :)

TrueGeek3y ago

> Status pages get updated manually

1 more reply

vinayan33y ago

Thanks for responding and providing details.

One follow up is there are instances where Datadog report outages but Metrist says it's green.

Is that because the functional tests are still working but some other part of Datadog was reported as down?

1 more reply

ozten3y ago

Metrist has no visibility into services internal SLA, SLO, and SLIs. [1]

[1] https://cloud.google.com/blog/products/devops-sre/sre-fundam...

capableweb3y ago

Metrist seems to consistently rate "downtime" different than the various services, for better or worse.

Here are some examples where the SaaS says they are down/degraded, but Metrist thinks they're up:

https://app.metrist.io/demo/jira

https://app.metrist.io/demo/circleci

Here is another where Metrist thinks the service is down, but self-reportedly up:

https://app.metrist.io/demo/newrelic

lngarnerOP3y ago

1 more reply

bloodyplonker223y ago· 4 in thread

I have always wished that companies would tie their status pages to their monitoring systems, but alas, they don't due to PR concerns and overall disingenuousness.

eep_social3y ago

capableweb3y ago

2 more replies

rco87863y ago

I don't have any PR concern or disingenuousness, reality is that this is an exceedingly hard task to actually pull off with little or no payoff in the end.

For it to work, you have to know exactly which monitors could go into exactly which states (or range of values, etc) such that it constitutes a user facing outage that should be displayed publicly.

It's truly an impossibly hard problem.

lngarnerOP3y ago

Alas INDEED

tra33y ago· 2 in thread

Now we just need datadog to monitor metrist. Who monitors the monitors?

jonatron3y ago

Monitor monitors monitor monitors.

lngarnerOP3y ago

Haha for real! Monitorception

ydnaclementine3y ago

Thoughts and prayers to the engineers

j / k navigate · click thread line to collapse