Understanding Chronos: Keeping track of Cron tasks in a systematic
Editor Note: This post originally appeared on the Sitewards blog
Editor Note: This post originally appeared on the Sitewards blog
Cron is basically ubiquitous as a service available on Linux machines, having it’s roots in Unix in the 1980s
It is a scheduled task runner; something that will execute a command periodically based on a specification defined in a file called the “crontab” file. It is used by core Linux services, third party software or ad-hoc scripts developed by the system administrator.
However, in running cron in production there are a few things that leave it to be desired:
It seems to have a hard dependency on “mailing” logs, rather than storing them
There is no tracking of success over time
There is no locking mechanism preventing overlapping jobs
There are various systems that do not suffer these inadequacies (my favourite being Kubernetes CronJobs), but given it’s prevalence it’s worth spending a little time trying to address these shortcomings. So, here is what I am currently recommending:
* * * * * systemd-cat --identifier=magento_system_cron -- flock --timeout="900" /run/lock/magento_system_cron.lock /usr/bin/php /var/www/html/bin/magento cron:run; echo "magento_system_cron_exit_code $?" > /var/metrics/prometheus/magento_system_cron.prom
But, there’s quite some lessons going on there. So, let’s unpack it a little bit.
The task itself
The first one is super simple — somewhere in there is the task we actually want to run! It is below:
/usr/bin/php /var/www/html/bin/magento cron:run
That’s it. It’s the standard Magento crontask, as documented by the devdocs. This would work left in the crontab with simply * * * * *
, however it is often hard to debug what has gone wrong with this, or other tasks.
Addressing #1: Storing logs
There are several ways to store logs:
systemd-cat (my favourite)
redirecting it to a file, and appending
redirecting it to a file and rotating (with logrotate or something similar)
Newer versions of Linux (Debian 9, Ubuntu 16.04) ship with a new init system called “systemd”. Systemd replaces a bunch of Linux services such cron, init, syslog and so on. While this has caused some controversy, it does provide something called “systemd-journald”. This operates in compliment with, or replacement of, syslog. It is much easier to navigate, and provides thing such as filtering, search, different views and so on.
Additionally, this system provides a helper tool called “systemd-cat”. From the man page:
systemd-cat — Connect a pipeline or program’s output with the journal
So, pretty straight forward — it forwards stdout and stderr to the systemd-journal, where it can later be queried. It is implemented in the above cron task as follows:
systemd-cat --identifier=magento_system_cron --
Let’s break down what each component does.
systemd-cat
- Explained already--identifier=magento_system_cron
- Lets us query the systemd-journal with the commandjournalctl --identifier="magento_system_cron"
to just view logs associated with this task--
- Treat everything else as a positional parameter
So, by prefixing your command with the systemd-cat
syntax you can easily log your results! A trivial example:
$ systemd-cat --identifier="my_really_cool_identifier" echo "hi"
$ journalctl --identifier="my_really_cool_identifier"
-- Logs begin at Fri 2017-11-03 10:11:56 CET, end at Fri 2017-11-03 17:30:24 CET. --
Nov 03 17:29:43 __HOSTNAME_REDACTED__ my_really_cool_identifier[29113]: hi
Checkout the systemd-journald
man pages for more information
A warning: by default, the journal is often not persistent through reboots. You will need to make it so.
Addressing #2: Tracking success over time
It is no secret that I am a fan of the Prometheus TSDB. It allows picking up and reading data from basically anything, and provides an opinionated approach to monitoring that I have grown to respect over time. With the assistance of an exporter called the “node exporter”, we can expose our cron job health to Prometheus and track it over time.
First, the node exporter must be configured to read metrics out of a specific folder (in this case, /var/metrics/prometheus/
). How to do this is beyond the scope of this article, but checkout the GitHub page for more guidance here.
Once this configured it is super easy to keep track of the success of our jobs. We can rely on the exit code!
All processes as they exit are required to give some indication as to whether or not they exited successfully, called the “exit code” (man exit 2
, man exit 1
). These have been standardized to a limited extent in sysexits.h, but for our purpose a simple rule can be considered:
“non zero exit code means something went wrong”
So, we need to capture our exit code and store it for Prometheus to pick up. This turns out to be pretty easy — bash stores the exit code in a variable called $?
(man bash
), which we can store for Prometheus to scrape. In our cron task:
${COMMAND}; echo "magento_system_cron_exit_code $?" > /var/metrics/prometheus/magento_system_cron.prom
The format of this output is the Prometheus exposition format. It’s pretty easy to try this one:
$ false # This just exists unsuccessfully
$ echo $?
1
We open sourced our ansible role to set up the node exporter on GitHub. Take a look!
Addressing #3: Adding a locking mechanism
Firstly, let me just push that it should be the responsibility of the application being invoked to implement locking. However, many don’t (looking at you, Magento!) so it’s often easier to polyfill this.
A “lock” ensures that a resource (such as a file) cannot be accessed by multiple processes at once. It prevents multiple jobs from running simultaneously, which can stack up and kill a machine. To implement this, the tool flock
is used (man flock 2
). In the example above:
flock --timeout="900" /run/lock/magento_system_cron.lock ${COMMAND}
Let’s break it down:
flock
- Explained already--timeout="900
- Waits up to 900 seconds before giving up trying to get a lock/run/lock/magento_system_cron.lock
- The file to try and lock. It is created if it doesn't exist${COMMAND}
- The command you want to run
This prevents multiple versions of the command from running at once. They’ll be started for sure, however they’ll simply wait until the “lock” file is free, and then execute. This is pretty light in terms of resource consumption.
We still do not want an infinite number of locked processes stacking up and exhausting the process table or other weirdness. So, we give it a limit of “900 seconds” (15 minutes) so that if the lock still exists 15 minutes later, the job simply expires and exists with a non zero exit code.
Combined with our Prometheus tracking above, we can see if jobs are being skipped.
This is pretty easy to test:
# This just runs "sleep 300" locked by "/run/lock/bork.lock". & puts it into the background so we can do other things
$ flock --timeout=300 /run/lock/bork.lock sleep 300 &
# The process ID for the job
[1] 1060
$ flock --timeout=300 /run/lock/bork.lock sleep 300 &
[2] 1084
# $$ is a variable containing the shell's process ID (`man bash`)
$ pstree $$
bash─┬─flock───sleep
├─flock
└─pstree
We can see in the above, the first process is running as expected. The second one is blocked by the first — when the lock expires, it will run.
We can artificially release the lock by killing the first process —
$ kill 1060
[1]- Terminated flock --timeout=300 /run/lock/bork.lock sleep 300
$ pstree $$
bash─┬─flock───sleep
└─pstree
We can see now that the lock is free, the second sleep is running.
In Conclusion
Cron is not such a perfect tool. But it’s everywhere, and learning to use it effectively can save you lots of hours debugging opaque jobs that are otherwise super difficult.
Still reading? Awesome! I’m glad you enjoyed the post. If you think this sounds like technology that can benefit you, chat to Sitewards. We’re specialists in thinking about, developing, and delivering quality e-commerce software!
Think I’m wrong? Great! Hit me up on Twitter (@andrewhowdencom). I’d love to talk with you about this further ❤