(Its something I'm really keen on myself - We use jsonnet internally to version control our dashboards and load them into config maps in our kubernetes clusters)
Congrats on the round!
We currently use Grafana Loki + Grafana and it's working amazingly well, we've tested load over 600 logs/second (1 Core VPS 2GB RAM) without any issues.
I got it wrong with Cortex (scalable Prometheus) and have been paying the price - poor adoption, smaller community and less mindshare. We're 100% committed to making sure Loki and Grafana are super easy to self-host, and are even putting time and effort into making Cortex and Metrictank easier too.
It's also frustrating that it's effectively impossible with the Cloudwatch data source to select resources to alert on using their tags (e.g. select the elb with app=foo and env=prod). The dimension_values() function lets you filter based on tags, but instead of being usable directly in a panel's query it requires using a variable as an intermediary, which then disables alerting.
Currently there's no way to access dashboards unless you have an account on Grafana. There's no way to share dashboards with the public.
if not, maybe a distribution of values in the given time instead of just statistics? i'd be happy with sorted ranking of any kind.
Example: You have a graph and want to see a value, so you touch inside it. Instead of the value you get "Add annotation" without the value showing.
Haven't found a way to disable this. Grafana v6.4.2. Is this known?
That being said, Grafana will always treat other datasources as first class citizens, like we already do with Influx, MySQL, CloudWatch, Stackdriver, Elastic etc. This is our "un-vendor" approach.
Load significantly dropped when I shut off Carbon/Graphite after moving to InfluxDB.
We did not use grafana for debugging instantaneous, fast events, but it was fantastic for things like monitoring temperature, current modes, and running statistics.
I was introduced to grafana in early 2014. I was a bit sceptical as I was using graphitus to make dashboards. However I soon converted.
I maintained a very large graphite cluster at the Financial Times (I think it was about 1 million active metrics, but it might have be 0.5 mill, I forget) The only sane way to manage the front end was using grafana. Simple oauth2 integration meant that I could avoid the nightmare of trying to get AD access, and it also mean't one click SSO.
Grafana was one of those tools that was self evidently the best in class, so it was widely adopted. Within two years, virtually every team screen had grafana on it. Non programmers used it, and even set alerts. How many other "devop" tool can boast that level of universality?
Either way, keep up the good work, and best of luck.
I can't imagine what it'd be like (as a stakeholder), using a Grafana instance that, in total, has >500k metrics. Would assume many of those are depreciated/ do not provide any value/ or do not spur any action by stakeholders.
And if you have microservices, you want to track how well each client-server pair is doing, on both sides of the equation, which means tracking error codes, success/fail rates, etc.
Finance wants its own metrics to measure capacity versus utilization to prove to the CFO the spending is appropriately constrained.
Devs want to prove their system works and works quickly, so you'll have a variety of metrics revolving around subcomponent usage, and performance timing. Maybe even cache rates.
Not all of these metrics will spark action by stakeholders. Some will be retained 'just in case' since you can't retroactively collect data. When perf drops, in a canary because GC pauses are increasing, you definitely want to be able see both performance metrics over time as well as GC metrics.
There were at the time about 200 dashboards. They were controlled and curated by their own teams. It was pretty much the only shared tool that worked well. The only thing that I encouraged was tagging, but even then, they mostly did it themselves to make finding things easier.
There were about 80 active products, most had _a_ dashboard.
The cruicial thing was that it doesn't cost much to record those metrics. This means that post incident we can easily put an alert in, or prove x affects y because z.
limiting the number of metrics recorded is frankly silly. Enforcing rules about quality and location, certainly, its something I spend a reasonable amount of time on.
for example, the front end was a microservice. Each http call of each microservice was graphed, which allowed quick and simple diagnostics for general performance. Most of the time its not needed, but when you _do_ need it, its critical to have context
Congratulations! :)
I hope this doesn’t happen to grafana.
There are many people in the design/UI/UX space specialising on (often numeric) information design. They tend to be as fluent as any programmer in statistics, if not more.
To this day, programmers tend to conspiratorially suggest to designers to read Edward Tufte, even though his are the first books they make you read in any information design class, and have been since the early 80s.
Variables to abstract out some, a bit of "repeat" to loop over something, and you get pretty drop-downs that you can combine to show nice graphs.
Then you think "I'll add it to a playlist". and you do so.
Then you think "my kiosk can't scroll this much for all, let's have one screen each for the apps" and you do.
And then you realize you cannot use variables from playlists, and you cannot template screens.
So you make eight copies of your screen, one for each variable configuration.
And you edit each copy of your screen to set the variables, and save it.
And then you realize that there was a typo in one panel.
So you go in and edit in eight different screens to fix that typo.
Then you realize that it doesn't look good on the TN panel, so you need to change a few colours to get better contrast.
So you do that on eight different copies, by the means of clicking in every pane, navigating through the point-n-click and then pressing.
But you realized that you learned this, so you're fast, and use the keyboard. Except then the change doesn't take.
Because grafana requires you to click in another field after you've edited, or your change doesn't hold if you press "Escape" or other key to navigate back.
And that's how I learned how Grafana is best of breed in GUI dashboard tools. Sort of how a pug is best of breed in a dog competition.
At one point I used a script that ran before grafana in the docker container, and that script ran a query on AWS to populate the dashboard.
However doing it quickly or efficiently is not splunk's strong suit.
Grafana allows you to quickly graph data from source x, compare to source y and then build a dashboard.
Splunk can also do al those things, but much slower.
We had a huge splunk clusters (100gig a day), and its a great compliment to grafana.
typically you use grafana to alert you to when things are going wrong. It would point out which system was going wrong and at what time, then you'd use splunk to get the logs to figure out the cause.
Lots of options - Watch this space!
Similarly, if the service is updated, it also generates a dashboard and updates the previous one (this is easy because the update is an UPSERT). You can do interesting things like modularize pieces of the dashboard and update those modules independently. They can then get pulled into the Jinja template during the update.
"That goes hand-in-hand with pushing forward with our vision of building an open, composable observability platform that brings together the three pillars of observability – logs, metrics, and traces – in a single experience, with Grafana at the center."
*note: I know that to some degree this is possible with current grafana, but if you read through the issues folks have with doing data viz outside of time series, you'll catch my meaning.
Also, it didn't support image uploads, you'd need to host them somewhere else if you wanted them to show up in a panel. rather inconvenient.
Congrats to the team. Well-deserved.
I've been wondering about why there are not more elastic based or grafana based hosted solutions.