Show HN: Highlander – Stop Overlapping Python Cron Jobs (opens in new tab)

(github.com)

28 pointsccannon11y ago43 comments

43 comments

40 comments · 10 top-level

fideloper11y ago· 10 in thread

Does anyone use flock? I came across it recently and believe it serves the same purpose, very useful from cron tasks:

man page: http://linux.die.net/man/1/flock

Example: https://ma.ttias.be/prevent-cronjobs-from-overlapping-in-lin...

gerad11y ago

Yeah we use flock for this all the time. So much so that I'm surprised this is news.

ccannonOP11y ago

This is a pure python solution.

1 more reply

chubot11y ago

I use flock, and flock is better -- what if the machine reboots during the cron job? Then you will be left with a file, and the cron job will never restart without human intervention.

With flock, you're using kernel state, and everything works as expected. Everything is cleared when you reboot, no matter if it's during the cron job or otherwise.

ccannonOP11y ago

This is completely false. If you actually read the source code, you would discover that once the left over file is read Highlander would know that the process is not currently running, remove the old PID file, and create a new one.

ccannonOP11y ago

I think my solution is simpler, but their timeout command line argument has inspired me!

supster11y ago

Yep using flock right now, though it was kind of hard to find good information on it besides the example you linked above. I'm also using it for a python job so this repo might work out well.

ccannonOP11y ago

Yeah the flock documentation leaves something to be desired to say the least.

viraptor11y ago

Yes. I don't see a point in adding this code to the app itself, if it's already part of the system.

kiallmacinnes11y ago

The only argument for adding it to the app that I can think of is, if your distributing to a large number of end users.. A huge chunk of them won't use flock, even if told.

switch00711y ago

I use flock and timeout for most jobs.

ccannonOP11y ago· 8 in thread

I always encounter the problem where I write Python scripts that run on a cron job that sometimes take longer than the interval before the same cron job will run again (e.g., I have a cron that runs every hour and one run takes 2 hours to complete). In this scenario, you would want the first cron to complete before the second cron is run. What Highlander does is if it sees that your cron is already running, it immediately returns thereby skipping that cron run.

sidmitra11y ago

Does this work with celery tasks?

My usual solution is to add checks in the cron job to make sure they don't repeat or duplicate anything, by using an audit table. So for example when a celery tasks triggers an email, I store an event called EMAIL_X_SENT to the audit table with meta data and check it later before sending it again.

Of course it complicates the logic a bit but I've noticed it's the same pattern that works everywhere so I just abstracted most of into a custom task class.

Another way typically is to use a shared lock just like above except in the cache backend. So you could probably extend highlander to use cache backend etc.

ccannonOP11y ago

I think you could use this with celery tasks so as long as each worker used a different PID file.

wyldfire11y ago

By the time I reach this level of complexity, I usually think that what I really want is a daemon.

But I suppose this is a good way to split the difference.

mitchty11y ago

Easy enough to just use halockrun from the hatools package for that. That is also its raison d'être really. Sure it is c but it is pretty much done as a tool over fcntl(3).

http://www.fatalmind.com/software/hatools/

jstoiko11y ago

What would be the difference between this and using something like APScheduler? Doesn't it achieve the same thing?

ccannonOP11y ago

This is a much simpler solution to a less complex problem.

walshemj11y ago

So do you not write your scripts to know if its already running using file locks?

ccannonOP11y ago

I do, but now with Highlander I've created a generic solution.

falcolas11y ago· 4 in thread

As someone who has had to write this themselves multiple times, there are a few bits that I consider to be missing:

1) Command line verification - is the pid owned by the same type of process as is running now? PIDs are re-used, ensure it's the same (the creation time check helps, but it doesn't say anything about what process wrote it).

2) Process Hang Detection - Has the process actually consumed any CPU ticks in the last minute?

3) Infinite loop detection - Is the other process stuck processing something uselessly?

4) Killing off stuck processes - 2 or 3 true? Behead it and continue on. Optionally do some form of alerting - stderr is probably fine.

Add these, and I would personally find it much more useful.

ccannonOP11y ago

To address your concerns:

1. I assume that it's the same type of process because by default the PID file is being written to the current working directory of the script. If you'd like, you can specify a location yourself to ensure that each type of process is grouped on one PID file.

2. Out of curiosity, how would you go about doing this?

3. I think this would be really difficult to accomplish.

4. I agree if we could somehow figure out 2 or 3, that would be great.

falcolas11y ago

1) As noted by Michael in a sibling comment - and I've had this happen in real life - a PID from a command which exited abnormally (and thus didn't clean up the file) can be picked up by another process, particularly on busy boxes. If your pid gets picked up by, say, an nginx worker, in which case your cron may never start again.

2) /proc/[pid]/stat column 14 - utime. Look for this to increment with every check.

3) An update function call from within the program itself - a particular count of a single location, or the lack of an update while (2) is updating could indicate an infinite loop.

michaelmior11y ago

The point for #1 is that PIDs are reused. Just because the cron job previously started a process with a PID 473 doesn't mean that PID 473 is that same process the next time the cron job comes around to check. It's entirely possible that the original process was finished or killed and a new process started with the same PID.

1 more reply

vezzy-fnord11y ago

Indeed, the use of PID files alone is racy for what is a glorified script (though programmatic here) that simply creates and removes a lock file. Using the cmdline as an identifier would be more reliable. Though then you can run into multiple instances, so UUID+cmdline might help?

geertj11y ago· 2 in thread

On systemd systems there's an easier alternative to this. You can use the per-user systemd instance (systemctl --user) to install a .timer that activates a .service file. If the .service is still running when the .timer next fires, it will not be started again. Systemd is pretty good at this kind of bookkeeping.

kiallmacinnes11y ago

This really isn't meant as piling on systemd - please don't read it that way!

But, has systemd now replaced cron too?

digi_owl11y ago

Yep: http://www.freedesktop.org/software/systemd/man/systemd.time...

wc-11y ago· 2 in thread

I've had a lot of sucess using a key in redis with a TTL value instead of a local PID file. Although adding redis to the picture adds a large new point of failure, I can then have a cronjob set up on multiple instances and still ensure it only runs once across all of them.

I'm sure there is a simpler way of doing this, how have other people solved redundantly ensuring a single cronjob runs?

kiallmacinnes11y ago

The common and "traditional" way of doing distributed locking is with a coordination service like ZooKeeper. ZooKeeper style services have an advantage of no TTLs - the moment a process dies, the lock is released, and the next in line waiting on the lock is immediately notified.

Redis/Memcache with a TTL serves this purpose for the most part, but if you require as close to a 100% guarantee that 1 and only 1 process holds the lock at any given time, these will eventually fail you. Think network partitions, tasks outlasting the TTL, replication lag/eventual consistency etc.

ZooKeeper and similar use concensus protocols like ZAB, Paxos or Raft to provide guarantees even in the face of failure.

ccannonOP11y ago

This is not meant for distributed systems.

1 more reply

snide11y ago· 2 in thread

Always bring this up anytime I see "Highlander" being used for a project name.

http://blogs.msdn.com/b/oldnewthing/archive/2014/09/23/10559...

volker4811y ago

Thats a great link, but I think the project is named highlander not because its the only such project with this function, but because its function is for there to be only one instance of process running.

ccannonOP11y ago

Haha I hadn't see that before, awesome. I still standby the name as being logical and memorable.

wumbernang11y ago· 1 in thread

Windows has by far the best solution to this since Windows Vista/2008 server. Full instance control provided by the OS, fully scriptable with powershell, desired state configuration (like ansible), clustering, logging, fully event driven i.e. can trigger on network/OS events with GUI, WMI, script and COM integration.

Genuinely wish someone knocked out something like this. systemd is part of the way there but not quite far enough.

ccannonOP11y ago

This is a multi-platform solution.

userbinator11y ago· 1 in thread

In other words, it effectively makes the process a singleton?

ccannonOP11y ago

If you want to look at it from an OO perspective I guess you could say that. It simply just only lets one python script run at a time.

88e282102ae2e5b11y ago

Why not just use the fnctl module from the standard library?

mrfusion11y ago

Is there a library to just parse cron strings and know when to fire?

j / k navigate · click thread line to collapse

43 comments

40 comments · 10 top-level

fideloper11y ago· 10 in thread

Does anyone use flock? I came across it recently and believe it serves the same purpose, very useful from cron tasks:

man page: http://linux.die.net/man/1/flock

Example: https://ma.ttias.be/prevent-cronjobs-from-overlapping-in-lin...

gerad11y ago

Yeah we use flock for this all the time. So much so that I'm surprised this is news.

ccannonOP11y ago

This is a pure python solution.

1 more reply

chubot11y ago

I use flock, and flock is better -- what if the machine reboots during the cron job? Then you will be left with a file, and the cron job will never restart without human intervention.

With flock, you're using kernel state, and everything works as expected. Everything is cleared when you reboot, no matter if it's during the cron job or otherwise.

ccannonOP11y ago

I think my solution is simpler, but their timeout command line argument has inspired me!

supster11y ago

Yep using flock right now, though it was kind of hard to find good information on it besides the example you linked above. I'm also using it for a python job so this repo might work out well.

ccannonOP11y ago

Yeah the flock documentation leaves something to be desired to say the least.

viraptor11y ago

Yes. I don't see a point in adding this code to the app itself, if it's already part of the system.

kiallmacinnes11y ago

The only argument for adding it to the app that I can think of is, if your distributing to a large number of end users.. A huge chunk of them won't use flock, even if told.

switch00711y ago

I use flock and timeout for most jobs.

ccannonOP11y ago· 8 in thread

sidmitra11y ago

Does this work with celery tasks?

Of course it complicates the logic a bit but I've noticed it's the same pattern that works everywhere so I just abstracted most of into a custom task class.

Another way typically is to use a shared lock just like above except in the cache backend. So you could probably extend highlander to use cache backend etc.

ccannonOP11y ago

I think you could use this with celery tasks so as long as each worker used a different PID file.

wyldfire11y ago

By the time I reach this level of complexity, I usually think that what I really want is a daemon.

But I suppose this is a good way to split the difference.

mitchty11y ago

Easy enough to just use halockrun from the hatools package for that. That is also its raison d'être really. Sure it is c but it is pretty much done as a tool over fcntl(3).

http://www.fatalmind.com/software/hatools/

jstoiko11y ago

What would be the difference between this and using something like APScheduler? Doesn't it achieve the same thing?

ccannonOP11y ago

This is a much simpler solution to a less complex problem.

walshemj11y ago

So do you not write your scripts to know if its already running using file locks?

ccannonOP11y ago

I do, but now with Highlander I've created a generic solution.

falcolas11y ago· 4 in thread

As someone who has had to write this themselves multiple times, there are a few bits that I consider to be missing:

2) Process Hang Detection - Has the process actually consumed any CPU ticks in the last minute?

3) Infinite loop detection - Is the other process stuck processing something uselessly?

4) Killing off stuck processes - 2 or 3 true? Behead it and continue on. Optionally do some form of alerting - stderr is probably fine.

Add these, and I would personally find it much more useful.

ccannonOP11y ago

To address your concerns:

2. Out of curiosity, how would you go about doing this?

3. I think this would be really difficult to accomplish.

4. I agree if we could somehow figure out 2 or 3, that would be great.

falcolas11y ago

2) /proc/[pid]/stat column 14 - utime. Look for this to increment with every check.

3) An update function call from within the program itself - a particular count of a single location, or the lack of an update while (2) is updating could indicate an infinite loop.

michaelmior11y ago

1 more reply

vezzy-fnord11y ago

geertj11y ago· 2 in thread

kiallmacinnes11y ago

This really isn't meant as piling on systemd - please don't read it that way!

But, has systemd now replaced cron too?

digi_owl11y ago

Yep: http://www.freedesktop.org/software/systemd/man/systemd.time...

wc-11y ago· 2 in thread

I'm sure there is a simpler way of doing this, how have other people solved redundantly ensuring a single cronjob runs?

kiallmacinnes11y ago