Don't Fear the Reaper: Detecting Failed Delayed Job Workers

Update: The plugin discussed in this post has been packaged into the delayed_job_heartbeat_plugin gem.

detecting_failed_delayed_job_workers_in_rails Previously we've blogged about how to write Delayed Job plugins and how to aggregate jobs into job groups. In this post we'll explore how to proactively detect failed Delayed Job workers so their jobs can be retried in a timely manner. This is useful if a worker crashes, is automatically restarted by your platform provider, or is shutdown by auto-scaling infrastructure.

The Delayed Job Lock Model

Let's start off with a brief introduction to how Delayed Job implements locking. We'll only consider the Active Record Delayed Job backend but the Mongoid backend uses a similar scheme. Jobs are stored in a delayed_jobs table that includes a YAML encoding of the object that will do the job's work, the time the job was locked, the name of the worker that locked the job, and some additional metadata like the number of job attempts. When a worker picks up a job, it sets the job's locked_at to the current time and sets the job's locked_by to the worker's name (which should be unique across your pool of workers). Jobs are eligible to be picked up by a worker if they are not locked by another worker or they've been locked by another worker for more than max_run_time seconds.

At this point you might be wondering why Delayed Job's max_run_time setting isn't sufficient to unlock jobs that have been locked by failed workers. Well, it is, as long as you don't mind waiting that long for jobs to be unlocked. That doesn't work for us since we have max_run_time set pretty high to accommodate some long running bulk import and export jobs.

The Delayed Job Heartbeat Plugin Overview

There are two parts to how we'll unlock jobs for failed workers:

A Delayed Job plugin that runs on each worker and periodically updates a database table with heartbeat information
A reaper process that periodically unlocks jobs locked by workers that haven't updated their heartbeat recently

Now let's dive into some details on each of these components.

The Delayed Job Heartbeat Plugin

Our heartbeat plugin will consist of a few classes:

Delayed::Heartbeat::WorkerModel - A persistent model with the name and last heartbeat timestamp of each worker. In the future this could be extended to included additional information about workers like the version of the source code they're running. (Note we chose the name WorkerModel rather than just Worker to avoid confusion with Delayed Job's Worker class)
Delayed::Heartbeat::WorkerHeartbeat - Asynchronously updates a workers heartbeat timestamp
Delayed::Heartbeat::Plugin - A Delayed Job plugin that plugs into the worker's lifecycle to start and stop the WorkerHeartbeat.

First we'll need a database migration to create the table for our worker models:

Next let's create the WorkerModel class that provides methods for updating the worker's heartbeat and some ActiveRecord scopes that we'll need for the reaper:

So far this has all been standard Rails stuff. Now things get a little more interesting with the WorkerHeartbeat that uses a background thread to periodically update the worker's heartbeat:

Some of the code for telling the heartbeat thread to shutdown might look a bit funky but it's just using the self-pipe trick to perform an interruptible sleep.

We've done all the heavy lifting required to implement the plugin. Now let's plug into the worker's lifecycle to start/stop the heartbeat:

Finally let's register the plugin with Delayed Job in an appropriate initializer:

The Reaper

Now that we have a Delayed Job plugin that periodically updates a worker's heartbeat, we can unlock jobs that have been locked by workers that we haven't heard from recently:

We're running in Heroku so we've configured a clockwork process to periodically unlock orphaned jobs:

You should be able to do something similar with your favorite scheduler.

Finally update your application.rb with the appropriate configuration for the heartbeat plugin and reaper process:

That's it! We can now unlock orphaned jobs in a few minutes rather than waiting a job's maximum runtime to elapse.

Salsify Engineering Blog

Salsify Engineering Blog

Don't Fear the Reaper: Detecting Failed Delayed Job Workers

The Delayed Job Lock Model

The Delayed Job Heartbeat Plugin Overview

The Delayed Job Heartbeat Plugin

The Reaper

Other Solutions to Detecting Failed Delayed Job Workers?

Recent Posts

Salsify Home

Engineering Blog