man page: http://linux.die.net/man/1/flock
Example: https://ma.ttias.be/prevent-cronjobs-from-overlapping-in-lin...
With flock, you're using kernel state, and everything works as expected. Everything is cleared when you reboot, no matter if it's during the cron job or otherwise.
My usual solution is to add checks in the cron job to make sure they don't repeat or duplicate anything, by using an audit table. So for example when a celery tasks triggers an email, I store an event called EMAIL_X_SENT to the audit table with meta data and check it later before sending it again.
Of course it complicates the logic a bit but I've noticed it's the same pattern that works everywhere so I just abstracted most of into a custom task class.
Another way typically is to use a shared lock just like above except in the cache backend. So you could probably extend highlander to use cache backend etc.
But I suppose this is a good way to split the difference.
1) Command line verification - is the pid owned by the same type of process as is running now? PIDs are re-used, ensure it's the same (the creation time check helps, but it doesn't say anything about what process wrote it).
2) Process Hang Detection - Has the process actually consumed any CPU ticks in the last minute?
3) Infinite loop detection - Is the other process stuck processing something uselessly?
4) Killing off stuck processes - 2 or 3 true? Behead it and continue on. Optionally do some form of alerting - stderr is probably fine.
Add these, and I would personally find it much more useful.
1. I assume that it's the same type of process because by default the PID file is being written to the current working directory of the script. If you'd like, you can specify a location yourself to ensure that each type of process is grouped on one PID file.
2. Out of curiosity, how would you go about doing this?
3. I think this would be really difficult to accomplish.
4. I agree if we could somehow figure out 2 or 3, that would be great.
2) /proc/[pid]/stat column 14 - utime. Look for this to increment with every check.
3) An update function call from within the program itself - a particular count of a single location, or the lack of an update while (2) is updating could indicate an infinite loop.
But, has systemd now replaced cron too?
I'm sure there is a simpler way of doing this, how have other people solved redundantly ensuring a single cronjob runs?
Redis/Memcache with a TTL serves this purpose for the most part, but if you require as close to a 100% guarantee that 1 and only 1 process holds the lock at any given time, these will eventually fail you. Think network partitions, tasks outlasting the TTL, replication lag/eventual consistency etc.
ZooKeeper and similar use concensus protocols like ZAB, Paxos or Raft to provide guarantees even in the face of failure.
http://blogs.msdn.com/b/oldnewthing/archive/2014/09/23/10559...
Genuinely wish someone knocked out something like this. systemd is part of the way there but not quite far enough.