What’s the advantage here beyond keeping things lightweight? Feels like this could hit limitations as complexity grows.
- It's more vertically integrated for example networking across node is built in, you get it out of the box.
- It supports stateful workload out of the box with no fuss. Running DBs doing snapshots, deletion protection etc... is very simple with LXD.
- LXD supports running Docker inside containers which means in the future we will be enabling docker containers. Each LXD containers can be treated as a 'pod' that can run multiple docker containers inside. But it's just a simple system container you can treat like a VM.
- Working with GPUs is very simple and straight forward. This is going to be key as we start to enable AI work loads.
- LXD doesn't require a Master Node which means each instance I provision I can use it to run my work load. It also supports redundancy as I grow my cluster because it handles distribution through raft. Which means in terms of overhead it's much lower than K8s
- Overall LXD feels like a batteries included container hypervisor.
This solution doesn't replace things like prometheus. In fact LXD has native support for prometheus, we would also be able to extend the solution to pushing data to a prometheus instance or expose a /metrics endpoint for prometheus to consume.
For our MVP we just chose Elastic but it will be easy to extend to support prometheus as well. We're shipping data using open telemetry format. OpenTelemetry is a specification when we ship data we try to keep it as close to what open telemetry does as much as possible. Elastic's observability supports this out of the box.
All this solution does is it queries the underlying infrastructure metrics and ship it to a destination. The only scaling it needs to handle is ship the data and handle back-pressure incase the destination cannot handle the load. Broadway does this out of the box.
We (Ops type people) have a developed system for gathering metrics, it's Prometheus stack. Instead of integrating with that system, OpsMaru decided it doesn't work and went with their own custom system. You are showing code you were building all CPU metrics that PromQL query easily does and you code assumes 15 second scrapes so if we need higher resolution temporarily, well, sucks to be your customer. Also, if you did Remote Write, you could Remote Write back to a customer if they wanted it. Hell, you could have written a system so we don't need run Prometheus locally since you would scrape everything and send it back to us.
Also, you are already running "my company code" so it might be emitting Prometheus metrics so I'm probably running Prometheus already so I can monitor my own code. However, if I wanted to keep an eye on OpsMaru Uplink, I can't because OpsMaru Uplink doesn't appear to have Metric endpoint I can monitor. Maybe your customers are too small to have Ops people but if they did, they are now blind.
I want blog article explaining all options tested and what pitfalls you ran into that you settled on this.
This isn't a custom system at all we're simply removing the need to install / configure / manage another external package by implementing a data shipper into uplink using elixir broadway. The end goal is still that Ops / SREs can still use their existing favorite monitoring pipeline whether that's Grafana / Prometheus / Loki or Elastic / OpenSearch stack. There are several advantages, it means less things to install / maintain / patch / secure as mentioned in the post. We believe doings things this way leads to a more robust / secure system in the long term.
As for the 15 seconds scrapes we can tune that and provide that as an option for customers as well. These are things we can improve and provide to our customers as options. For now for MVP we're shipping data to the elastic stack certain decisions are made to help simplify and reduce the amount of things we have to do to get the product to an MVP.
We can provide the /metrics endpoint in the future, it's just a matter of time and priorities.
There are reasons why we're shipping data into elastic that will be clearer once things mature a little more. There are things Elastic can do that we need at a base level for our internal product plans, there will be a follow up post about this later.
Will provide more blog post articles giving you more details as to the decisions we've made and why we made them. Always happy to read feedback.
If I'm a customer and say "Hey, my applications are emitting Prometheus Metrics, how to scrape?" what is your recommendation with the platform?
OVN is very heavy and requires a lot of management when it comes to provisioning and maintenance. So for now we didn't want to go there just yet.
We're sticking with LXD, it's been receiving a lot of updates from canonical, the team is responsive on the forum and has been a pleasure to work with.
Once we have some breathing room, we definitely want to explore incus and see what networking options are out there.
Maybe we'll just adopt wireguard and make that work out of the box with incus in a future iteration.