In sidekiq without super_fetch (a paid feature), any jobs in progress when a worker crashes are lost forever. If a worker merely encounters an exception the job will be put back on the queue and retried but a crash means the job is lost.
Again, no problem paying for Pro, but I would prefer a little more transparency on how big a gap that is.
Here's Resque literally using `lpop` which is destructive and will lose jobs.
https://github.com/resque/resque/blob/7623b8dfbdd0a07eb04b19...
Great point, and thanks for chiming in. I wonder if containerization has made this more painful (due to cgroups and OOMs). The comments here are basically some people saying it's never been a problem for them and some people saying they encounter it a lot (in containerized environments) and have had to add mitigations.
Either way, my observation is a lot of people not paying for Sidekiq Pro should. I hope you can agree with that.
This doesn't happen at a high rate, but it happens more than zero times per week for us. We pay for Sidekiq Pro and have superfetch enabled so we are protected. If we didn't do so we'd need to create some additional infra to detect jobs that were never properly run and re-run them.
[1] https://gitlab.com/gitlab-org/ruby/gems/sidekiq-reliable-fet...
[2] https://redis.io/commands/rpoplpush/#pattern-reliable-queue
I'm still confused about what you're saying though. You're saying that the language of "enhanced reliability" doesn't reflect losing 2 jobs over about 50*7 million (from your other comment)?
And that if you didn't pay for the service, you'd have to add some checks to make up for this?
That all seems incredibly reasonable to me.
It’s hard to get this right though. No matter where the line gets drawn, free users will complain that they don’t get everything for free.
Over the past week there were 2 jobs that would have been lost if not for superfetch.
It's not a ton, but it's not zero. And when it comes to data durability the difference between zero and not zero is usually all that matters.
Edit for additional color: One of the most common crashes we'll see is OutOfMemory. We run in a containerized environment and if a rogue job uses too much memory (or a deploy drastically changes our memory footprint) the container will be killed. In that scenario, the job is not placed back into the queue. SuperFetch is able to recover them, albeit with really lose guarantees around "when".
50,000,000 * 7 = 350,000,000
2 / 350,000,000 = 0.000000005714286
1 - (2 / 350,000,000) = 0.999999994285714 = 99.999999%
> It's not a ton, but it's not zero. And when it comes to data durability the difference between zero and not zero is usually all that matters.
If your system isn't resilient to 2 in 350,000,000 jobs failing I think there is something wrong with your system.
I have in the past monitored how many jobs were lost and, although a small percentage, it was still recurring thing.
OOM kills are particularly pernicious as they can get into a vicious cycle of retry-killed-retry loops. The individual job causing the OOM isn't that important (we will identify it, log it and noop it), it's the blast radius effect on other sidekiq threads (we use up to 20 threads on some of our workers), so you want to be able to recover and re-run any jobs that are innocent victims of a misbehaving job.
No thanks.