undefined | Better HN

0 pointsembedding-shape1mo ago0 comments

FWIW, I have no CS degree and barely attended school at all, and found Grafana + Prometheus + Loki fairly easy to setup, at least compared to what we used to use before those tools were available. Maybe it's because I used NixOS for the setup, but besides learning some new domain-specific things I didn't know since before, I don't recall hitting any particular bumps or roadblocks, I also went the 100% self-hosted route (spread across two hosts at home).

What exactly were you struggling with when it came to the setup? Just a ton of new concepts to learn which took time, or something specific to Grafana/Prometheus/Loki?

0 comments

9 comments · 3 top-level

dijit1mo ago· 4 in thread

"Getting it running" is the easy part.

"Getting it ready for production" is a different game.

I've fallen on my sword many times by trying to explain that prometheus fails every metric of production ready; in fact Google themselves replaced borgmon (prometheus) for Monarch because the "tiny unreliable time series databases everywhere" was in fact, not the successful and reliable deployment strategy that they had claimed.

But, it is very easy to set up. Just don't go looking for failure modes, because they're everywhere and every single one of them is catastrophic.

denysvitali1mo ago

There are ways to scale Prometheus (look at Thanos), but none of the solutions is really bug free.

See this PR for example (https://github.com/prometheus/prometheus/pull/18364) - this used to impact a production deployment I worked on. Prometheus, Thanos and even OpenTelemetry are full of those kind of problems - but at the same time it's the best we have and we should be grateful they're free and open source.

I'd still choose an open source stack (and contribute to it) rather than go for a proprietary solution - we've all seen what happens with DataDog & co.

Please don't take my words lightly, I worked with the rest of my team in a large scale observability platform and scalability should not be underestimated - at the same time DataDog / Splunk prices are simply unjustified. It's ironically cheaper to build a team of engineers that will maintain a sane observability stack instead of feeding the monster(s).

otterley1mo ago

> It's ironically cheaper to build a team of engineers that will maintain a sane observability stack instead of feeding the monster(s).

Can you show the math here? This is a very bold claim, and I’m super curious. A shared Google Sheet would work well.

1 more reply

embedding-shapeOP1mo ago

Well, I am running the stack in production right now, but everyone has a different understanding of what that actually means...

Do you have concrete examples of these catastrophic failures? I've personally havent experienced any myself during these years, but I'm doing very boring and typical stuff, so wouldn't surprise me there was hard edges still.

dijit1mo ago

There's a difficult distinction here, you're right.

Technically even a single server running LAMP as root but taking frontend traffic meets the definition of in production but I think we all recognise that it's not the right idea.

What I'm referring to is: should the disk start to have issues: what does prometheus do? If the scrapers start to stall due to connection timeouts: what does prometheus do? If you are doing linear interpolation of data and you have massive gaps because you're polling opportunisitically: what does prometheus do.

I'm all about boring technology, but prometheus assumes too much happy path. It assumes that a single node is enough for time series data that is used for alerting.

Which, it is: at very small scale and with best effort reliability.

It's not acceptable as soon as lost data could be critically important in diagnosing major issues in billing systems, or actually billing users, or to infer issues that need to be correlated across multiple systems.

2 more replies

SOLAR_FIELDS1mo ago· 2 in thread

FWIW, if you come flying in saying you used NixOS to set something up you’re not what we would call a “casual user”

embedding-shapeOP1mo ago

Why not? Hardly unheard of for managing infrastructure. If we were talking about desktop environments, then maybe, and to be fair, I never said I was a casual user, just that I didn't find prometheus particularly difficult to manage in a production environment.

SOLAR_FIELDS1mo ago

The implication is that by virtue of using NixOS, you're already a self selected power user. The people that would find setting this thing up in production difficult and the people who would use NixOS are a very small overlap, if any, on that venn diagram.

1 more reply

dusanstanojevic1mo ago

Hi, I'm the creator of Traceway.

I have created Traceway because I looked at that stack and decided that I'm not going to add 7 more services to my stack that could all fail that I now have to maintain as well. Here is the list: Grafana, Otel Collector (to forward metrics), Prometheus, Loki, Tempo, Mimir, K8s.

This is not maintainable in production, unless you have a person to manage it. My app had about 500-1000 req/sec, this sounds like a lot but it's extremely light from the observability perspective. Why would I add 7 more points of failure and services to monitor for proper resource allocation for something like this? To add insult to injury I would have to keep building my SLOs, they wouldn't be tracked automatically by default, I would have to keep paying for Sentry because the issue tracking is quite lacking on Grafana. Oh almost forgot, I would also have to get an alerting provider or pay for that (maybe I'm wrong, it was 6 mo ago).

Anyhow, Traceway is a 60mb binary in Go, it works with Clickhouse or Sqlite and the data is stored on S3 when not used. That means you can host it with sqlite on the 2$ server or even free tier and have it working for your side projects, you can host it with managed clickhouse and get auto scalability on the db level.

The goal is to provide full observability and tools to fix issues directly for developers. What we have so far: alerts, notifications, SSO (google & github), integrations, metrics, preconfigured SLOs, distributed tracing, RUM/session recordings (js & flutter).

Almost forgot, you'd need a symbolicator as well, or your fe/mobile exception stack traces will be messed up in Grafana, I don't even know which tool they have for that, but it's always a new service to host and maintain...

j / k navigate · click thread line to collapse

0 comments

9 comments · 3 top-level

dijit1mo ago· 4 in thread

"Getting it running" is the easy part.

"Getting it ready for production" is a different game.

But, it is very easy to set up. Just don't go looking for failure modes, because they're everywhere and every single one of them is catastrophic.

denysvitali1mo ago

There are ways to scale Prometheus (look at Thanos), but none of the solutions is really bug free.

I'd still choose an open source stack (and contribute to it) rather than go for a proprietary solution - we've all seen what happens with DataDog & co.

otterley1mo ago

> It's ironically cheaper to build a team of engineers that will maintain a sane observability stack instead of feeding the monster(s).

Can you show the math here? This is a very bold claim, and I’m super curious. A shared Google Sheet would work well.

1 more reply

embedding-shapeOP1mo ago

Well, I am running the stack in production right now, but everyone has a different understanding of what that actually means...

dijit1mo ago

There's a difficult distinction here, you're right.

Technically even a single server running LAMP as root but taking frontend traffic meets the definition of in production but I think we all recognise that it's not the right idea.

I'm all about boring technology, but prometheus assumes too much happy path. It assumes that a single node is enough for time series data that is used for alerting.

Which, it is: at very small scale and with best effort reliability.

2 more replies

SOLAR_FIELDS1mo ago· 2 in thread

FWIW, if you come flying in saying you used NixOS to set something up you’re not what we would call a “casual user”

embedding-shapeOP1mo ago

SOLAR_FIELDS1mo ago

1 more reply

dusanstanojevic1mo ago

Hi, I'm the creator of Traceway.

j / k navigate · click thread line to collapse