The other day, reading this Lobste.rs thread, I came across healthchecks.io (or hchk.io), a service for monitoring cron jobs.
The service is pretty simple, you specify the details of your cron job (schedule, grace time, etc.) and you are given a unique URL, which you then make your job ping once it’s finished, letting healthchecks.io know the job ran.
You can also first ping URL/start
and then URL
to track the time your job took to finish. As well as ping URL/fail
to signal a failure.
This is all pretty easy logic. There’s nothing special about the service, and as I monitor my services with Prometheus, I thought I could hack together a similar solution with Pushgateway.
What is Pushgateway?
First, a little background on how Prometheus does gathering of metrics.
Prometheus does pull collection of metrics, that is, it actively reaches out to endpoints containing metrics in the Prometheus format.
Therefore all endpoints must be up and running at any time, should Prometheus go ahead and scrape them.
But how do I make Prometheus get metrics about my batch jobs, if they run only for a few minutes?
That’s where the Pushgateway comes into action. Pushgateway is a service that listens to metrics being pushed to it, stores them for some time and offers them to Prometheus, something like voicemail for metrics.
Why replace healthchecks.io?
Less cost
healthchecks.io has three pricing tiers (as of August 2019):
- Hobbyist: $0/mo for 20 jobs and 3 team members
- Business: $16/mo for 100 jobs, 10 team members and 50 SMS & WhatsApp monthly alerts
- Business Plus: $64/mo for 1000 jobs, unlimited team members and 500 SMS & WhatsApp monthly alerts
If you don’t exceed the Hobbyist tier, you’ll be fine, but at 16$/mo for 100 jobs, I’d rather write my own alternative.
With Prometheus, you get unlimited jobs, unlimited team members and unlimited alerts through whatever channels you choose for the cost of running a simple stack of three microservices and a bit less “turn-key” feel.
More features
Pushgateway supports labeling metrics, so you can label your pings with a
stage
label and let Pushgateway know the status of each of your job’s stages.Pushgateway supports pushing many metrics at once, letting you export more insight about your job every time you push.
Prometheus lets you better tune which rules trigger an alert, who in the team gets notified, etc.
With this setup, your jobs report status over the LAN, letting you restrict internet access to your jobs, which is a nice security plus.
This setup keeps your IP address private (not like it’s sensitive information if you are running internet services anyway, but it’s nice to have).
Note: the only downside to this setup is Alertmanager’s subset of supported alert channels, but it’s not like it’s too hard to write your own client.
How to replace healthchecks.io?
Install the Prometheus stack
Install and run Prometheus, Pushgateway and Alertmanager
Connect Prometheus to Alertmanager, edit
prometheus.yml
:Make Prometheus scrape Pushgateway, edit
prometheus.yml
:
Make jobs push metrics to Pushgateway
Once you have all that interconnected, you can go ahead and push metrics to Pushgateway, for Prometheus to scrape:
- Monitoring job execution:
- Monitoring job success or failure:
- Measuring job execution time:
See Pushgateway’s README.md for more information on how to push metrics.
Configure alerting upon job failure
- Add the following to
prometheus.yml
andrules.yml
respectively: - Finally, configure an Alertmanager alert receiver and start getting notified whenever your jobs fail
Note: check out my previous post to integrate Alertmanager with Amazon SES for email alerts without worrying about setting up SMTP servers.