Agent-Less System Monitoring with Elixir Broadway (opens in new tab)

(opsmaru.com)

76 pointszacksiri1y ago16 comments

16 comments

13 comments · 4 top-level

ic4l1y ago· 3 in thread

LXD seems like an unusual choice when Kubernetes already has cadvisor and strong monitoring integrations. Avoiding extra agents is nice, but does this really scale better than existing solutions like Prometheus and OpenTelemetry?

What’s the advantage here beyond keeping things lightweight? Feels like this could hit limitations as complexity grows.

zacksiriOP1y ago

I chose LXD for several reasons. There is much less overhead cost when it comes to managing an LXD cluster:

- It's more vertically integrated for example networking across node is built in, you get it out of the box.

- It supports stateful workload out of the box with no fuss. Running DBs doing snapshots, deletion protection etc... is very simple with LXD.

- LXD supports running Docker inside containers which means in the future we will be enabling docker containers. Each LXD containers can be treated as a 'pod' that can run multiple docker containers inside. But it's just a simple system container you can treat like a VM.

- Working with GPUs is very simple and straight forward. This is going to be key as we start to enable AI work loads.

- LXD doesn't require a Master Node which means each instance I provision I can use it to run my work load. It also supports redundancy as I grow my cluster because it handles distribution through raft. Which means in terms of overhead it's much lower than K8s

- Overall LXD feels like a batteries included container hypervisor.

This solution doesn't replace things like prometheus. In fact LXD has native support for prometheus, we would also be able to extend the solution to pushing data to a prometheus instance or expose a /metrics endpoint for prometheus to consume.

For our MVP we just chose Elastic but it will be easy to extend to support prometheus as well. We're shipping data using open telemetry format. OpenTelemetry is a specification when we ship data we try to keep it as close to what open telemetry does as much as possible. Elastic's observability supports this out of the box.

All this solution does is it queries the underlying infrastructure metrics and ship it to a destination. The only scaling it needs to handle is ship the data and handle back-pressure incase the destination cannot handle the load. Broadway does this out of the box.

mdaniel1y ago

> For our MVP we just chose Elastic

Honest question: why Elastic over Open Search?

1 more reply

ic4l1y ago

Great answer, thanks.

stackskipton1y ago· 2 in thread

SRE or whatever they are calling Ops here, this blog left me with more "Please hire an Ops Principal". That has nothing to do with Elixir.

We (Ops type people) have a developed system for gathering metrics, it's Prometheus stack. Instead of integrating with that system, OpsMaru decided it doesn't work and went with their own custom system. You are showing code you were building all CPU metrics that PromQL query easily does and you code assumes 15 second scrapes so if we need higher resolution temporarily, well, sucks to be your customer. Also, if you did Remote Write, you could Remote Write back to a customer if they wanted it. Hell, you could have written a system so we don't need run Prometheus locally since you would scrape everything and send it back to us.

Also, you are already running "my company code" so it might be emitting Prometheus metrics so I'm probably running Prometheus already so I can monitor my own code. However, if I wanted to keep an eye on OpsMaru Uplink, I can't because OpsMaru Uplink doesn't appear to have Metric endpoint I can monitor. Maybe your customers are too small to have Ops people but if they did, they are now blind.

I want blog article explaining all options tested and what pitfalls you ran into that you settled on this.

zacksiriOP1y ago

Thank you for your feedback, you have a valid point about the /metrics endpoint. We're planning on providing a /metrics endpoint in the future. I mentioned this previously as well in another reply.

This isn't a custom system at all we're simply removing the need to install / configure / manage another external package by implementing a data shipper into uplink using elixir broadway. The end goal is still that Ops / SREs can still use their existing favorite monitoring pipeline whether that's Grafana / Prometheus / Loki or Elastic / OpenSearch stack. There are several advantages, it means less things to install / maintain / patch / secure as mentioned in the post. We believe doings things this way leads to a more robust / secure system in the long term.

As for the 15 seconds scrapes we can tune that and provide that as an option for customers as well. These are things we can improve and provide to our customers as options. For now for MVP we're shipping data to the elastic stack certain decisions are made to help simplify and reduce the amount of things we have to do to get the product to an MVP.

We can provide the /metrics endpoint in the future, it's just a matter of time and priorities.

There are reasons why we're shipping data into elastic that will be clearer once things mature a little more. There are things Elastic can do that we need at a base level for our internal product plans, there will be a follow up post about this later.

Will provide more blog post articles giving you more details as to the decisions we've made and why we made them. Always happy to read feedback.

stackskipton1y ago

I await the blog article. I just think throwing out Prometheus Stack was terrible idea. If you want to store Metrics in Elastic, which I've done and always ends in tears, is fine. My concern is not keeping Prometheus compatible until last second.

If I'm a customer and say "Hey, my applications are emitting Prometheus Metrics, how to scrape?" what is your recommendation with the platform?

1 more reply

cpursley1y ago· 2 in thread

This is a pretty neat product. Seems like the purpose is to allow deploying and selling of self-hosted instances?

zacksiriOP1y ago

Thank you! Yes! That's the main focus moving forward. Parts of it is still being built out, essentially we want to enable an App Store like experience for Web Applications. Open source developers should be able to monetize their applications by selling instances of their app to people who are non-technical but need their apps. Developers get paid via stripe connect.

cpursley1y ago

Neato. You might hear from me in the near future. I'm working on a self-hosted CRM/Email Marketing/Drip type of app with Phoenix LiveView - I'm assuming you handle Elixir as first class citizen based on your blog post. I also have some hybrid astrojs apps I'm considering productizing.

1 more reply

zacksiriOP1y ago· 2 in thread

Hey there! Founder of Opsmaru here. I didn't expect the post to get to the first page after it didn't get upvoted when I first posted it. Happy to answer any questions about the product and this post!

lydericlandry1y ago

Do you also support incus (LXD fork)?

zacksiriOP1y ago

We were planning on supporting incus. However incus dropped support for fan networking in favor of OVN for cross node networking.

OVN is very heavy and requires a lot of management when it comes to provisioning and maintenance. So for now we didn't want to go there just yet.

We're sticking with LXD, it's been receiving a lot of updates from canonical, the team is responsive on the forum and has been a pleasure to work with.

Once we have some breathing room, we definitely want to explore incus and see what networking options are out there.

Maybe we'll just adopt wireguard and make that work out of the box with incus in a future iteration.

j / k navigate · click thread line to collapse

16 comments

13 comments · 4 top-level

ic4l1y ago· 3 in thread

What’s the advantage here beyond keeping things lightweight? Feels like this could hit limitations as complexity grows.

zacksiriOP1y ago

I chose LXD for several reasons. There is much less overhead cost when it comes to managing an LXD cluster:

- It's more vertically integrated for example networking across node is built in, you get it out of the box.

- It supports stateful workload out of the box with no fuss. Running DBs doing snapshots, deletion protection etc... is very simple with LXD.

- Working with GPUs is very simple and straight forward. This is going to be key as we start to enable AI work loads.

- Overall LXD feels like a batteries included container hypervisor.

mdaniel1y ago

> For our MVP we just chose Elastic

Honest question: why Elastic over Open Search?

1 more reply

ic4l1y ago

Great answer, thanks.

stackskipton1y ago· 2 in thread

SRE or whatever they are calling Ops here, this blog left me with more "Please hire an Ops Principal". That has nothing to do with Elixir.

I want blog article explaining all options tested and what pitfalls you ran into that you settled on this.

zacksiriOP1y ago

Thank you for your feedback, you have a valid point about the /metrics endpoint. We're planning on providing a /metrics endpoint in the future. I mentioned this previously as well in another reply.

We can provide the /metrics endpoint in the future, it's just a matter of time and priorities.

Will provide more blog post articles giving you more details as to the decisions we've made and why we made them. Always happy to read feedback.

stackskipton1y ago

If I'm a customer and say "Hey, my applications are emitting Prometheus Metrics, how to scrape?" what is your recommendation with the platform?

1 more reply

cpursley1y ago· 2 in thread

This is a pretty neat product. Seems like the purpose is to allow deploying and selling of self-hosted instances?

zacksiriOP1y ago

cpursley1y ago

1 more reply

zacksiriOP1y ago· 2 in thread

lydericlandry1y ago

Do you also support incus (LXD fork)?

zacksiriOP1y ago

We were planning on supporting incus. However incus dropped support for fan networking in favor of OVN for cross node networking.

OVN is very heavy and requires a lot of management when it comes to provisioning and maintenance. So for now we didn't want to go there just yet.

We're sticking with LXD, it's been receiving a lot of updates from canonical, the team is responsive on the forum and has been a pleasure to work with.

Once we have some breathing room, we definitely want to explore incus and see what networking options are out there.

Maybe we'll just adopt wireguard and make that work out of the box with incus in a future iteration.

j / k navigate · click thread line to collapse