Salsify is a multi-tenant SaaS product information exchange platform hosted in Heroku. We utilize delayed job extensively for long running tasks like import/export/search indexing/etc. One of the features we've grown to love about delayed job is its extensibility via plugins (see our other posts). Recently, an increasing number of job workers have been exceeding their dyno memory quota**, and consequently suffering serious performance degradation. While we are always working to improve our code performance, we thought it would be nice to have a mechanism for euthanizing workers that are unrecoverably exhausted. Enter the DelayedJob::YouthInAsiaPlugin.
A ruby process performing a delayed job requires more memory than available, and begins paging to swap space, which causes degraded process performance. We explicitly choose not to kill the job while it is executing because it will generally finish within the configurable job timeout, and often (due to the nature of the issue) would re-occur if we retried the job. When the offending job finishes executing we want to prevent the worker from picking up additional work, as we typically see continued performance issues with the worker once it enters this state. Note: Systems with fast disks probably won't see nearly the performance impact we see running on Heroku, where swapping to disk consumes nearly 100% of the cpu.
Kill the process with a Delayed::Job plugin!
At Salsify, a particular pain point for maintaining reasonable memory performance is in handling arbitrarily large templated Excel files which are used to generate item setup sheets for retailers like Amazon and Walmart. Often times our systems are strained unzipping .xlsx files that contain surprises like hidden forms/sheets/formatting, large binary video files, or even complex ActiveX executables. Using the delayed job plugin framework, we are able to quickly recover from these rare occurrences and ensure we don’t suffer prolonged memory/performance problems due to unexpected crazy input(Really, Excel is amazing). If you have any advice on how we might have solved this differently we'd love to hear it!
**quota (512 MB on 1X dynos, 1024 MB on 2X dynos)