Ask HN: How do you monitor your systemd services?

127 pointswh33zle2y ago60 comments

I am using systemd on my machine and try to configure most things through it. For example, I have a backup job that is triggered by a timer. I want to know when that job fails so I can investigate and fix it. Over time, I've had multiple solutions for this:

Send a notifcation via notify-send

Add `systemctl --failed` to my shell startup script

Send myself emails

None of these are quite ideal. Notifications are disruptive of the current workflow and ephemeral, meaning I might forget about it if I don't deal with it immediately. Similarly, reading `systemctl --failed` on every new terminal is also disruptive but at least it makes me not forget about it. Both of these are also not really applicable to server systems. Sending myself emails feels a bit wrong but has so far been the best solution.

How are other people solving this? I did some research and I am surprised that there isn't a more rounded solution. I'd expect that pretty much every Linux user must run into this problem.

60 comments

gjulianm2y ago

Short answer: Prometheus + Grafana + Alertmanager. prometheus_node_exporter has an option to export SystemD service status and you can alert on failed services, and you can use Alertmanager to configure multiple types of alarms, including repeats so you don't forget.

Long answer: Whenever I've started to add alerting and monitoring to a system, I end up wanting to add more things each time, so I find it valuable to start from the beginning with an extensible system. For me, Prometheus has been the best option: easy to configure, lightweight, doesn't even need to run in the host, and can monitor multiple systems. You just have to configure which exporters you want it to pull data from. In this case, prometheus_node_exporter has a massive amount of stats about a system (including SystemD), and there are default alarms and dashboards out there that will help you create basic monitoring in a minute.

You can choose to use Grafana for visualization, and then either the integrated Grafana alerting or use the Prometheus alerting + Prometheus Alertmanager. I think in the latest versions Grafana Alerting includes basically an embedded AlertManager so it should have the same features.

Regarding the type of alert itself, I send myself mails for the persistence/reminders + Telegram messages for the instant notifications. I find it the best option tbh.

derefr2y ago

> Short answer: Prometheus + Grafana + Alertmanager.

Or, a higher-level recommendation, appropriate for most SMBs: sign up for Grafana Cloud's managed prometheus+grafana (or any equivalent external managed monitoring stack), and then follow their setup instructions to install their grafana-agent monitoring agent package (which sticks together node_exporter, several other optional exporters enable-able with config stanzas (e.g. redis_exporter, postgresql_exporter, etc.), and a log multiplexer for their Loki logging service [which is to logs-based metrics as Prometheus is to regular time-series metrics.])

Why use a managed service? Because, unless your IT department is large enough to have its own softball team, the stability/fault-tolerance of the "monitoring and alerting infra" itself is going to be rather low-priority for the company compared to other things you're being asked to manage; and also will be something you rarely need to touch... until it breaks. Which it will.

You really want some other IT department whose whole job is to just make sure your monitoring and alerting stay up, doing this as their product IT focus rather than their operations IT focus.

(You also want your alerting to be running on separate infra from the thing it watches, for the same reason that your status page should be on a separate domain and infra from the system it reports the status of. Having some other company own it is an easy way to achieve this.)

> Regarding the type of alert itself, I send myself mails for the persistence/reminders + Telegram messages for the instant notifications.

Again, higher-level rec appropriate for SMBs: sign up for PagerDuty, and configure it as the alert "Notification Channel" in Grafana Cloud. If you're an "ops team of one", their free plan will work fine for you.

Why is this better than Telegram messages? Because the PagerDuty app does "critical alerts" — i.e. its notifications pierce your phone's silent/do-not-disturb settings (and you can configure them to be really shrill and annoying.) You don't want people to be able to call you at 2AM — but you do want to be woken up if all your servers are on fire.

---

Also: if you're on a cloud provider like AWS/GCP/etc, it can be tempting to rely on their home-grown metrics + logging + alerting systems. Which works, right up until you grow enough to want to move to a hybrid "elastic load via cloud instances; base load via dedicated hardware leasing" architecture. At which point you suddenly have instances your "home cloud" refuses to allow you to install its monitoring agent on. Better to avoid this problem from the start, if you can sense you'll ever be going that way. (But if you know your systems aren't "scaling forever" and you'll stay in the cloud, the home-grown cloud monitoring + alerting systems are fine for what they do.)

rudasn2y ago

So about 3 years ago we had a bunch of on prem servers shutting down around March/April. We had even more servers that weren't shutting down so we had to "move fast" before they all had issues.

I must have spent about a week trying to learn just enough about prometheus and grafana (I had used grafana before with influx but for a different purpose) so that we could monitor temperature, memory, cpu, and disk (the bare minimum).

The goal was to have a single dashboard showing these critical metrics for all servers (< 100), and be able to receive email or sms alerts when things turned red.

No luck. After a week I had nothing to show for.

So I turned to Netdata. A one liner on each server and we had super sexy and fast dashboard for each server. No birds eye view, but fine. I then spent maybe 3-4 days trying to figure out how to get alerting to work (just email, but fine) and get temperature readings (or something like that).

No luck. By the end of week 2 I still had nothing, but a bunch of servers shutting down during peak hours.

Week 3 I said fuck it I'll do the stupidest thing and write my own stack. A bunch of shell scripts, deployed via ansible, capturing any metric I could think of, managed by systemd, posting to a $5/month server running a single nodejs service that would do in memory (only) averages, medians etc, and trigger alerts (email, sms, Slack maybe soon) when things get yellow or red.

By week 4 we had monitoring for all servers and for any metric we really needed.

Super cheap, super stable and absolutely no maintenance required. Sure, we probably can't monitor hundreds of servers or thousands of metrics, but we don't need to.

I really wanted to use something else, but I just couldn't :(

andrewm48942y ago

> So I turned to Netdata. A one liner on each server and we had super sexy and fast dashboard for each server. No birds eye view, but fine. I then spent maybe 3-4 days trying to figure out how to get alerting to work (just email, but fine) and get temperature readings (or something like that).

I work in Netdata on ML. Just wanted to mention that as of last release a parent node will show all children in the agent dashboard so if doing again as of today a parent netdata might have got you the birds eye view as a starting point https://github.com/netdata/netdata/releases/tag/v1.41.0#v141...

(of course we also have Netdata Cloud which would have probably worked too but maybe was not as built out 3 years ago as is now - but don't want to go into sales mode and get blasted :) )

1 more reply

mike_hearn2y ago

So .... why were the servers shutting down, and what metric did your own system capture that the others didn't which let you determine that?

1 more reply

droopyEyelids2y ago

If you have a one-off server running nodejs, you've definitely got maintenance

1 more reply

ownagefool2y ago

Alternative view point.

Observability is hella expensive. Orgs should consider TCO when making such decisions. Paying a few hundred thousands more for the skills to self run could literally chop tens of millions off vendor bills.

happymellon2y ago

But then you aren't taking into account server and storage costs of self managed monitoring.

Unless it's Datadog. That's expensive.

1 more reply

scrapheap2y ago

I can definitely vouch for Prometheus for monitoring - once you get your head round the basics then it's so easy to extend it to monitor other things.

gregmac2y ago

I don't monitor services at that level at all, because it means basically nothing. More acutely: the the lack of a notification doesn't tell mean everything is "ok".

I tend to monitor the actual service. If it's a web server, have something checking that a specific URL is working (tip: use something specific, not /). Likewise any other network service is pretty easy to monitor.

For backups, check the date on the most recent file in the backup target location. If that date is older than "x", something is broken. This can apply to most other types of backend apps too -- everything has some kind of output.

It's when these checks fail that you can investigate deeper and start diagnosing systemd or whatever. It's also possible there's a bigger problem -- like DNS got messed up, or the hardware died -- and checking the final outcome will catch all this.

Basically explicitly checking systemd is a lot of extra work for no real added benefit. If your systemd service is failing often enough that knowing that is the problem immediately (at the alert level) IMHO you'd be better off to spend the time fixing the service definition so it doesn't fail.

tetha2y ago

As a counterpoint - we do both, but these two things have different goals.

Checks from the point of view of an end user are the gold standard if the service is functioning and functioning well enough. I very much agree with this. For example, with the case of postgres, something like sharp increases or decreases in query throughput or query durations is something to alert on, because this will negatively impact the applications depending on it.

However, we have incrementally implemented additional checks and dependencies between checks to speed up troubleshooting complex systems during an emergency. Instead of on-call having to, e.g., check postgres, check patroni, check consul, check consul server cluster, go back, check network, check certificates... zabbix can already compile this into a statement like "postgres is down, but that is caused by patroni not reaching the DCS, but that's caused by the consul client being down.. however, the service is running and the certificates are fine and the consul-server cluster is also fine".

tehalex2y ago

If you are ok with a Saas and if it's just scheduled jobs that you are monitoring, there are a number of monitoring tools where you tell when job completes (with a http request) and a missing ping (after a grace period) means that it failed.

I think https://deadmanssnitch.com/ may have been the original service for this.

https://healthchecks.io/ has a fairly generous free tier that I use now.

There are others that do the same thing Sentry, Uptime Robot, ...

choffee2y ago

https://cronitor.io/ is another option here that works for me. You can set up rules like "It should run once a day and return after at least this amount of time and also return a number greater than 1" Then just use come curl calls to your scripts at start and end and you are good to go.

hjuutilainen2y ago

Another happy healthchecks.io user here. You can also run it on your own infra: https://healthchecks.io/docs/self_hosted/

sickill2y ago

+100 for https://healthchecks.io/

codegeek2y ago

You can also try https://cronhub.io (I run it)

chasil2y ago

> "I have a backup job that is triggered by a timer. I want to know when that job fails so I can investigate and fix it."

This is really more in the realm of a shell script.

You could do this verbosely:

  #!/bin/sh

  /path/to/my/backup_job

  if [ $? -ne 0 ]
  then /path/to/my/failure_alert
  fi

...or, you could do this tersely:

  #!/bin/sh

  /path/to/my/backup_job || /path/to/my/failure_alert

The wrapper script would go into your timer unit. I like dash.

justin_oaks2y ago

That might be a good first step, but certainly isn't sufficiently robust.

What happens when when the /path/to/my/failure_alert script fails?

What happens when your backup job returns success but didn't generate any output?

What happens when you turn off the systemd timer for a while and forget to turn it back on?

What happens when the server stops running, has a full disk, or has a networking issue?

Ultimately, some of the other answers are better. You should have a separate system monitoring this. And that separate system should track every time a backup happens, either by checking the backup exists at the target location (good), or checking that the backup system sent a "Yes, I did a backup" message (ok, but not as good).

I use Telegraf for data collection, InfluxDB (v1) as a time series database, and Grafana (v7) for graphing and alerts. I'm using an older version of InfluxDB and Grafana because they just work and keep on working. Many other tools will work just as well as these do. I'm just giving them as an example.

Such a system may seem like overkill to just keep track of a few things, but you need something that'll tell you when you get no data. So at a minimum you'll want something on a separate server and you'll want it to send alerts when an expected event doesn't happen.

chasil2y ago

The original poster asked for simple detection of non-zero exit status.

What you speak of is far, far beyond the original question.

I am quite pleased with the reaction to my post, and I do not feel the need to compare technical merit.

Perhaps you would be happier with JCL?

In any case, enjoy your tooling.

javajosh2y ago

That's great but isn't the real question about what goes in /path/to/my/failure_alert?

chasil2y ago

The original poster hinted that "notifications" and email were options.

For SMS text message notifications, I use an AWK script to send SMTP to an email-SMS gateway. I try to keep these under the 160 character limit, only sent in extraordinary situations (high server room temp, decoy port triggering on the firewall hinting an intrusion, etc). I don't want this blowing up my phone.

For email, I have a MIME pack script that allows me to send a message with an arbitrary number of base64-encoded attachments.

Does that cover what might be in a failure alert script?

PhilipRoman2y ago

I was building an elaborate job monitoring system, but then I realized that what I really need is monitoring the actual end to end functionality.

For example, instead of monitoring my Minecraft server process that OpenRC spawns, I have a dedicated monitoring server that actually queries the server for version, number of players, etc. Same for websites, etc. Think of it as periodically running an integration test on a live system.

This way I get much more confidence that the service is doing what it should.

I'm not a big fan of over complicated monitoring systems - I simply have a script that builds a HTML status page with enough information to know when something goes wrong.

MiguelHudnandez2y ago

Broad alerts are really good to have. Narrow metrics are great to have once something goes wrong. When a server does go down, what did CPU, memory, disk IO look like? Did the request count climb quickly before the outage? Having those other metrics help for speedy troubleshooting -- Is it a software problem that got out of control or did some piece of hardware die or get throttled?

I'm of the opinion that having charts and graphs to rely on can focus troubleshooting resources more quickly onto the most actionable areas.

arjvik2y ago

I love https://ntfy.sh/ for my services running on headless servers - it lets me ping my phone with messages of varying urgency, and even duplicate the notifications to email for particularly information-dense messages.

2bluesc2y ago

I use the `OnFailure` property to trigger a service that emails me for failed services like backups which are run as system timers + service.

I also use `failure-monitor` which is Python service that monitors `journald`.

Files on Github for those interested:

https://github.com/kylemanna/systemd-utils

mcpherrinm2y ago

I run the Prometheus node_exporter on my servers. That has a systemd collector for the state of services.

That reports the state of all systemd services to a central Prometheus and alertmanager cluster, which has various alert rules.

pphysch2y ago

I recommend this approach, as you can also deploy blackbox_exporter to get an "outside" perspective on service status.

scrapheap2y ago

You can also restrict the list of systemd services that node_exporter provides metrics for - that way you only store metrics for those services you care about.

kelnos2y ago

> Notifications are [...] ephemeral, meaning I might forget about it if I don't deal with it immediately.

If you do like the notification method aside from this issue, try passing "--urgency=critical" or "--expire-time=0" to notify-send. Either (or both) of those should make the notifications stay popped up, assuming your notification daemon is doing something reasonable with those hints.

(Disclosure: I'm the author of xfce4-notifyd, which does behave in this way; other daemons may do other things.)

bravetraveler2y ago

This thread is one of those cases where you read something and realize you've been completely missing something. I don't monitor these as much as I should

Servers/services? Definitely - take your pick. Timers/jobs, particularly those on my system? Nothing!

With the right directives laid out ('Wants/Requires/Before/After'), they can be pretty robust/easily forgotten.

I've been lucky in this regard; I check 'systemctl list-timers' just to be sure - but they always run

mxuribe2y ago

@wh33zle For work, well, i have to follow already-established convention (some that others have noted). but for personal machines, i have not rolled out too many comprehensive monitoring solutions or platforms. Rather, i add focus on moitoring specific jobs/tasks, and as such leverage cron to run the job, and use basic, old school sorts of bash scripts to assess success or failure. I'm statrting to look into leveragin more systemd as you noted.

Now, specific to alerting, well, i have rolled out my own solution...Caution: self-promotion coming next...

I stopped relying on email being sent from servers since i've had too many annoyances, constraints in my history. Also, nowadays email is a medium that is slow for me...that is, i treat it like its non-time-sensitive3 messaging (for the majortiy of the time). So, for system alert-style messagings, I use my own little python script that sends messages into a dedicated matrix room. Since, i'm always on matrix, its a place where i can quickly see a new system alert messaage (matrix clients like Element allow you to adjust visibility - i think they call it noise level - of which messages are given higher or lower priority for the client vieew, etc.). And, those messages tend to be ephemeral, since they're just alerts, and such messages do not pollute my email inbox. There are plenty of options in this space of course. Mine is not the only one, but i also wanted to learn how to make apps for matrix ecosystem, etc. Here's a link to my little notification app/script that leverages the matrix network chat ecosystem: https://github.com/mxuribe/howler

veyh2y ago

Uptime-Kuma [1] with ntfy [2]. Most of my services expose HTTP so I just have Uptime-Kuma monitor that. But if you have something that is not exposed to the public you can still use a "push" type monitor, and in a cron job on your server(s), send heartbeat to it when everything is working.

[1] https://github.com/louislam/uptime-kuma

[2] https://ntfy.sh/

Phelinofist2y ago

I use Nagios, easy, lean and gets the job done

g4zj2y ago

I occasionally try other monitoring solutions, but always end up returning to Nagios. It's just what I need it to be and very little extra.

mike_hearn2y ago

Sadly there are lots of basic must-have tasks that Linux distros simply do not support out of the box. It's not so much an OS as a kit for making operating systems. Backup is another.

Here's how I set up email monitoring of systemd services, for anyone who wants it:

https://gist.github.com/mikehearn/f1db694f24eaa05c753e5a7598...

It consists of three parts. Firstly a shell script that will email the unit status colorized to your preferred email address. Secondly, a service file that tells systemd how to call it, and finally, an OnFailure line in each service that you want to monitor. You can use systemd's support for overlays to add this to existing services you didn't write yourself.

You also have to make sure that your server can actually send mail to you. Installing default-mta will get you an SMTP relay that's secure out of the box but your email service will consider it spam. If you use gmail it's typically sufficient to just create a filter that ensures emails from your server are never marked as spam.

nurettin2y ago

I have a carefully designed alert service for every project, checking various aspects of the system. It periodically checks heartbeats from various systems to make sure everything is in order. It sends alerts to UI via websocket, and to slack channels and makes calls to twilio numbers if things do not self-recover in time. I only check if the alert system is running via cron.

eternityforest2y ago

For monitoring and alerts I look to how industrial SCADA does it.

Unfortunately I have no code to share, because... I'm a dev, rather than a sysadmin, and I do backups and such at home with the GUI, and I don't work 9m anything microservicy, so I've only done monitoring of features within one monolithic application.

My preferred way to monitor a backup task would just be to use a backup tool that had it's own monitoring built in, or integrations with a popular monitor solution. I've done DIY backup scripts, it always seems so simple that you might as well just write a few lines... But it's also so common of a use case that there's lots of really nice options.

I've done the systemd --failed thing on every new terminal, and probably should go back to doing so, but it doesn't do much if you're not logging in regularly. Although it does help when you're logging in to see what went wrong.

But the general idea when I have actually implemented monitoring, is that you have state machine alerts. They go from normal, to tripped, to active.

If you acknowledge it, it becomes acknowledged, if it bad condition goes away, it becomes cleared, and returns to normal when acknowledged(Or instantly, if auto-ack is selected).

Every alert has a trip condition, which can be any function on one or more "Tag points"(Think observable variables with lots of extra features).

A tripped alert only becomes active if it remains tripped for N seconds, to filter out irrelevant things caused by normal dropped packets and such, while still logging them.

While an alert is active, it shows in the list on the server's admin page, and can periodically make a noise or do some reminder. Eventually I'd like to find some kind of MQTT dashboard solution that shows everything in one place, and sends messages to an app, but I haven't needed anything like that yet.

Under the hood the model is fairly complex but you don't have to think about it much to use it.

INTPenis2y ago

Only send alerts from the end user perspective. In your case the end user would most likely go into the backups and list them. So I would have a job that lists the backups every day and if something is missing it alerts.

See the difference here is that you don't monitor the systemd backup job, you monitor the backup backend instead. Because systemd can be configured to retry a job, the end result is in the backend.

And in other cases I do have monitoring for individual services, but I only send alerts if the end user experiences an issue. So a web server process/systemd unit is being monitored, but the alert is on a different monitor that checks if the website returns 200, or if it contains a keyword indicating it works.

speedyapoc2y ago

I push telemetry to Amazon CloudWatch (my infrastructure is on AWS) and then setup alarms accordingly. If I'm concerned about a service failing or becoming unresponsive, it's easy to create an alarm based on the existence or non-existence of data.

kiririn2y ago

`0 * * * * journalctl --since="61 minutes ago" --priority=warning --quiet`

In crontab piped to a bunch of grep -v for the things I want to ignore

So basically the email approach, just have to be religious about marking unread if not immediately actioned

e12e2y ago

What kind of warning (and up) messages do want to keep in your journal - but ignore in your alerts? Ie why adjust your grep patterns - not what you log - to increase signal to noise ratio?

jiehong2y ago

I’d go with sending myself a push notification on the phone through a dedicated service file and then call it with this in my unit file:

OnFailure=send-push-notification.service

Perhaps via a WhatsApp notification or any other instant message [0] or any other service such as matrix as said in another comment.

[0] https://developers.facebook.com/docs/whatsapp/cloud-api/get-...

SoftTalker2y ago

Run the job via cron, if it fails you'll get an email sent according to your system's alias file. You can also grep the logs for failure if you think you didn't get the email.

m30472y ago

In general this evolves to a SIEM-like solution in IT or gets added to the tag menagerie in OT.

If you're focused on "notifications are bad" note that notifications are push, and pull solutions are possible. Tail logs (or journalctl) and post significant events to Redis (https://github.com/m3047/rkvdns_examples/tree/main/totalizer...) for example.

OJFord2y ago

Everyone seems to be talking about production services & headless servers, but my impression is that you meant on the desktop?

I wrote a little script that puts a failed service count in waybar, and throws up a dismissable swaynag message with buttons to 'toggle details', and reset or restart the failed system/user units.

It's a bit noisy at the moment - but I think that's probably just a helpful indication of units I need to sort out/make a bit more robust anyway.

dig12y ago

This combo does the job for me: grafana + riemann + influxdb and collectd as the main agent. collectd bundles many plugins so you can watch logs, monitor running processes or have something custom [1]. This setup is very light to start with and can scale well (up until you hit influxdb limits :D).

[1] https://github.com/mbachry/collectd-systemd

javajosh2y ago

I think this is a great question. Consider https://blog.wesleyac.com/posts/how-i-run-my-servers. His unit files do not mention monitoring.

HankB992y ago

I've been using Checkmk (raw - e.g. free) to monitor stuff in my home lab (mostly for other things.) It has notified me of some failed Systemd services.

dsr_2y ago

We don't use systemd, so we haven't had issues with it.

Things get deployed by the automatic deployment system. If they go in cron, they are supervised by a program called errorwatch which does all the things that you want in a one-shot supervisor: logging, error codes, time bounds, checking for right output, checking for wrong output. If they are daemonic, they get /etc/init.d/ start/stop scripts that have been tested.

If they have a habit of dying and we can't afford that and we can't fix it, we run them from daemontools instead of init.d.

cyfex2y ago

> Sending myself emails feels a bit wrong but has so far been the best solution.

Why does email feel wrong? I find it a pretty viable solution.

kazinator2y ago

I have an /etc/inittab entry with

  stmd:2345:respawn:/bin/systemd

renewiltord2y ago

We just fire off a Slack message. It does the trick.

johnea2y ago

I generally just call up goggle and ask them how my system is doing...

j / k navigate · click thread line to collapse

60 comments

gjulianm2y ago

Regarding the type of alert itself, I send myself mails for the persistence/reminders + Telegram messages for the instant notifications. I find it the best option tbh.

derefr2y ago

> Short answer: Prometheus + Grafana + Alertmanager.

You really want some other IT department whose whole job is to just make sure your monitoring and alerting stay up, doing this as their product IT focus rather than their operations IT focus.

> Regarding the type of alert itself, I send myself mails for the persistence/reminders + Telegram messages for the instant notifications.

---

rudasn2y ago

So about 3 years ago we had a bunch of on prem servers shutting down around March/April. We had even more servers that weren't shutting down so we had to "move fast" before they all had issues.

The goal was to have a single dashboard showing these critical metrics for all servers (< 100), and be able to receive email or sms alerts when things turned red.

No luck. After a week I had nothing to show for.

No luck. By the end of week 2 I still had nothing, but a bunch of servers shutting down during peak hours.

By week 4 we had monitoring for all servers and for any metric we really needed.

Super cheap, super stable and absolutely no maintenance required. Sure, we probably can't monitor hundreds of servers or thousands of metrics, but we don't need to.

I really wanted to use something else, but I just couldn't :(

andrewm48942y ago

(of course we also have Netdata Cloud which would have probably worked too but maybe was not as built out 3 years ago as is now - but don't want to go into sales mode and get blasted :) )

1 more reply

mike_hearn2y ago

So .... why were the servers shutting down, and what metric did your own system capture that the others didn't which let you determine that?

1 more reply

droopyEyelids2y ago

If you have a one-off server running nodejs, you've definitely got maintenance

1 more reply

ownagefool2y ago

Alternative view point.

happymellon2y ago

But then you aren't taking into account server and storage costs of self managed monitoring.

Unless it's Datadog. That's expensive.

1 more reply

scrapheap2y ago

I can definitely vouch for Prometheus for monitoring - once you get your head round the basics then it's so easy to extend it to monitor other things.

gregmac2y ago

I don't monitor services at that level at all, because it means basically nothing. More acutely: the the lack of a notification doesn't tell mean everything is "ok".

tetha2y ago

As a counterpoint - we do both, but these two things have different goals.

tehalex2y ago

I think https://deadmanssnitch.com/ may have been the original service for this.

https://healthchecks.io/ has a fairly generous free tier that I use now.

There are others that do the same thing Sentry, Uptime Robot, ...

choffee2y ago

hjuutilainen2y ago

Another happy healthchecks.io user here. You can also run it on your own infra: https://healthchecks.io/docs/self_hosted/

sickill2y ago

+100 for https://healthchecks.io/

codegeek2y ago

You can also try https://cronhub.io (I run it)

chasil2y ago

> "I have a backup job that is triggered by a timer. I want to know when that job fails so I can investigate and fix it."

This is really more in the realm of a shell script.

You could do this verbosely:

  #!/bin/sh

  /path/to/my/backup_job

  if [ $? -ne 0 ]
  then /path/to/my/failure_alert
  fi

...or, you could do this tersely:

  #!/bin/sh

  /path/to/my/backup_job || /path/to/my/failure_alert

The wrapper script would go into your timer unit. I like dash.

justin_oaks2y ago

That might be a good first step, but certainly isn't sufficiently robust.

What happens when when the /path/to/my/failure_alert script fails?

What happens when your backup job returns success but didn't generate any output?

What happens when you turn off the systemd timer for a while and forget to turn it back on?

What happens when the server stops running, has a full disk, or has a networking issue?

chasil2y ago

The original poster asked for simple detection of non-zero exit status.

What you speak of is far, far beyond the original question.

I am quite pleased with the reaction to my post, and I do not feel the need to compare technical merit.

Perhaps you would be happier with JCL?

In any case, enjoy your tooling.

javajosh2y ago

That's great but isn't the real question about what goes in /path/to/my/failure_alert?

chasil2y ago

The original poster hinted that "notifications" and email were options.

For email, I have a MIME pack script that allows me to send a message with an arbitrary number of base64-encoded attachments.

Does that cover what might be in a failure alert script?

PhilipRoman2y ago

I was building an elaborate job monitoring system, but then I realized that what I really need is monitoring the actual end to end functionality.

This way I get much more confidence that the service is doing what it should.

I'm not a big fan of over complicated monitoring systems - I simply have a script that builds a HTML status page with enough information to know when something goes wrong.

MiguelHudnandez2y ago

I'm of the opinion that having charts and graphs to rely on can focus troubleshooting resources more quickly onto the most actionable areas.

arjvik2y ago

2bluesc2y ago

I use the `OnFailure` property to trigger a service that emails me for failed services like backups which are run as system timers + service.

I also use `failure-monitor` which is Python service that monitors `journald`.

Files on Github for those interested:

https://github.com/kylemanna/systemd-utils

mcpherrinm2y ago

I run the Prometheus node_exporter on my servers. That has a systemd collector for the state of services.

That reports the state of all systemd services to a central Prometheus and alertmanager cluster, which has various alert rules.

pphysch2y ago

I recommend this approach, as you can also deploy blackbox_exporter to get an "outside" perspective on service status.

scrapheap2y ago

You can also restrict the list of systemd services that node_exporter provides metrics for - that way you only store metrics for those services you care about.

kelnos2y ago

> Notifications are [...] ephemeral, meaning I might forget about it if I don't deal with it immediately.

(Disclosure: I'm the author of xfce4-notifyd, which does behave in this way; other daemons may do other things.)

bravetraveler2y ago

This thread is one of those cases where you read something and realize you've been completely missing something. I don't monitor these as much as I should

Servers/services? Definitely - take your pick. Timers/jobs, particularly those on my system? Nothing!

With the right directives laid out ('Wants/Requires/Before/After'), they can be pretty robust/easily forgotten.

I've been lucky in this regard; I check 'systemctl list-timers' just to be sure - but they always run

mxuribe2y ago

Now, specific to alerting, well, i have rolled out my own solution...Caution: self-promotion coming next...

veyh2y ago

[1] https://github.com/louislam/uptime-kuma

[2] https://ntfy.sh/

Phelinofist2y ago

I use Nagios, easy, lean and gets the job done

g4zj2y ago

I occasionally try other monitoring solutions, but always end up returning to Nagios. It's just what I need it to be and very little extra.

mike_hearn2y ago

Sadly there are lots of basic must-have tasks that Linux distros simply do not support out of the box. It's not so much an OS as a kit for making operating systems. Backup is another.

Here's how I set up email monitoring of systemd services, for anyone who wants it:

https://gist.github.com/mikehearn/f1db694f24eaa05c753e5a7598...

nurettin2y ago

eternityforest2y ago

For monitoring and alerts I look to how industrial SCADA does it.

But the general idea when I have actually implemented monitoring, is that you have state machine alerts. They go from normal, to tripped, to active.

If you acknowledge it, it becomes acknowledged, if it bad condition goes away, it becomes cleared, and returns to normal when acknowledged(Or instantly, if auto-ack is selected).

Every alert has a trip condition, which can be any function on one or more "Tag points"(Think observable variables with lots of extra features).

A tripped alert only becomes active if it remains tripped for N seconds, to filter out irrelevant things caused by normal dropped packets and such, while still logging them.

Under the hood the model is fairly complex but you don't have to think about it much to use it.

INTPenis2y ago

See the difference here is that you don't monitor the systemd backup job, you monitor the backup backend instead. Because systemd can be configured to retry a job, the end result is in the backend.

speedyapoc2y ago

kiririn2y ago

`0 * * * * journalctl --since="61 minutes ago" --priority=warning --quiet`

In crontab piped to a bunch of grep -v for the things I want to ignore

So basically the email approach, just have to be religious about marking unread if not immediately actioned

e12e2y ago

What kind of warning (and up) messages do want to keep in your journal - but ignore in your alerts? Ie why adjust your grep patterns - not what you log - to increase signal to noise ratio?

jiehong2y ago

I’d go with sending myself a push notification on the phone through a dedicated service file and then call it with this in my unit file:

OnFailure=send-push-notification.service

Perhaps via a WhatsApp notification or any other instant message [0] or any other service such as matrix as said in another comment.

[0] https://developers.facebook.com/docs/whatsapp/cloud-api/get-...

SoftTalker2y ago

Run the job via cron, if it fails you'll get an email sent according to your system's alias file. You can also grep the logs for failure if you think you didn't get the email.

m30472y ago

In general this evolves to a SIEM-like solution in IT or gets added to the tag menagerie in OT.

OJFord2y ago

Everyone seems to be talking about production services & headless servers, but my impression is that you meant on the desktop?

I wrote a little script that puts a failed service count in waybar, and throws up a dismissable swaynag message with buttons to 'toggle details', and reset or restart the failed system/user units.

It's a bit noisy at the moment - but I think that's probably just a helpful indication of units I need to sort out/make a bit more robust anyway.

dig12y ago

[1] https://github.com/mbachry/collectd-systemd

javajosh2y ago

I think this is a great question. Consider https://blog.wesleyac.com/posts/how-i-run-my-servers. His unit files do not mention monitoring.

HankB992y ago

I've been using Checkmk (raw - e.g. free) to monitor stuff in my home lab (mostly for other things.) It has notified me of some failed Systemd services.

dsr_2y ago

We don't use systemd, so we haven't had issues with it.

If they have a habit of dying and we can't afford that and we can't fix it, we run them from daemontools instead of init.d.

cyfex2y ago

> Sending myself emails feels a bit wrong but has so far been the best solution.

Why does email feel wrong? I find it a pretty viable solution.

kazinator2y ago

I have an /etc/inittab entry with

  stmd:2345:respawn:/bin/systemd

renewiltord2y ago

We just fire off a Slack message. It does the trick.

johnea2y ago

I generally just call up goggle and ask them how my system is doing...

j / k navigate · click thread line to collapse