Sidekiq is the most popular queue processing service for Ruby on Rails. It has many brilliant features, one of them is its automatic retry when a queued job fails, to account for intermittent problems.
The retry system is automatic, by default Sidekiq retries a job 25 times before putting it on the Dead Job Queue. The retry delay grows exponentially – by the 25th retry a job would have spent three weeks in the Retry Queue!
Of course generally everybody has an alert system for when jobs fail. But, the Sidekiq retry logic works well and most errors are transient, so people grow complacent and start ignoring the messages about the failed jobs.
This works well until it doesn’t. This was the point when I started looking into ways to properly monitor the Sidekiq Retry Queue.
I had the following questions:
- How to alert on jobs that have failed too many times for comfort?
- How to alert if a deluge of jobs fail?
- How to make sure the alerts we send are actionable?
- How to check if the alerting system is operational?
I took some time during Christmas and wrote a single file ruby app called https://github.com/imreFitos/sidekiq_retry_alert. This app queries a Sidekiq server’s Retry Queue and sends alerts to a Slack channel when a single job keeps failing repeatedly, and if it finds a lot of failing jobs, it tallies them up into easily read Slack messages.
This is how it looks in Slack:
PRODUCTION ALARM: 2 NameOfTheImportantJobs on the Important queue have failed X+ times
The app remembers the previous state of the queue, so you only get messages when the queue’s state changes.
To check if the alerting system works, I wrote a second script that simply sends a daily report to the Slack channel. If you don’t see the daily report, chances are your alert system has stopped working.
This is how the daily report looks in Slack:
Daily report on production sidekiq retries:
ImportantQueue: 2 NameOfTheImportantJobs are retried
I recommend running them from cron.
I hope this helps!
imre