The only caveats are that of performance (with a traditional server I wouldn't worry about performance until you need to process hundreds of items per second, but on EC2 nodes that threshold is more near the range of dozens per second), and the need to regularly archive the "done" directory (cron solves this nicely).
If you are really lucky, and your tickets only need to represent a single piece of data (some sort of ID for example), you can just use the name of the file itself for the data storage and deal only with empty files. Because this only uses a single inode/block, it represents the best case scenario for speed and scalability in terms of the number of tickets which can accumulate before you need to archive. But more likely, you are going to have to worry about ticket namespace collisions (unless you have some sort of "set" like requirement where each ID can only be in the queue once at a time) which means you are using something like mktemp to create the file and then storing the ID inside the file.
Another key is to make sure you create new jobs in a "staging" dir, and then mv them into the "in" dir. Otherwise you have a race condition between your queuing system and whatever creates the tickets.
Here's a basic layout: /stage, /in, /active, /done. Some process on your system creates a ticket (which could be a single file or a dir) in /stage and then moves it into /in. This wakes up your queue, which moves it to /active when it starts processing it, and then moves it to /done and moves on to the next ticket in /in.
Another nice thing this gives you is that recovering from a crash / unclean state amounts to running ls on /stage, /in, and /active.
One top tip from personal experience is to make the resulting structure reasonably straightforward to browse manually - having huge numbers of subdirectories is going to be a barrier to this.
For a quick start, I would look at the maildir specification, that includes instructions on how you should read form and write to maildir folders to avoid locking and get good performance: http://www.qmail.org/man/man5/maildir.html
Then, I would dive deeper by looking at the processes used to maintain the mail queues in qmail: http://www.qmail.org/qmail-manual-html/misc/INTERNALS.html . Obviously, you could also look at how postfix or exim handle their own queues.
Anyway, gathering all the experience buried in those systems and summarizing it in a logical way would make a great great article...
I felt dirty while doing it, but didn't want to build up a whole ActiveMQ (or similar) queue solution - it was just overkill.
6 years out that simple hack is still working today without needing any sort of maintenance.
A while back I looked at moving part of the queue into mysql, but I got stuck while trying to keep it a polling based system (I should have been able to accomplish this by having a mysql trigger touch a file in the filesystem, which would trigger inotify / wake up the queue, but I couldn't get it to work as described in the docs). After reading the author's mention of postgresql having some sort of listen/notify feature, I'll have to give that a look.
I can't vouch for the performance characteristics, but it's got some nice features around how notification delivery interacts with transactions (notifications within an explicit transaction are not delivered until & unless the transaction commits successfully, order of notification from a single transaction is preserved), guaranteed delivery, and some degree of deduplication of identical notifications.
However... PostgreSQL's "SELECT FOR UPDATE" seems to have significantly better performance than MySQL's version, most likely due to how concurrency & MVCC vs. locking interact. A few years back at a now mostly-defunct social network which shall remain nameless I had to implement a cluster-wide work queue for sending out member emails that couldn't involve installing new software and had no shared disk space to use for that style. A queue based on an existing PostgreSQL installation (the PG process had a 3 year uptime at that point) using "SELECT FOR UPDATE WHERE worker_id IS NULL /LIMIT 1" followed by an immediate update of the worker_id and transaction end had quite good performance on mid-2000s hardware. As far as I could tell from my research then the limit 1 with no ordering clause locked only one row and concurrent processes each got a different one, so they didn't have to serialize on grabbing a job. Definitely do your own research and testing, but in my experience SELECT FOR UPDATE used carefully with a thorough reading of the docs is a much more viable solution on PostgreSQL than MySQL for a few hundred worker processes. I wouldn't try it for G+ or Twitter, but if you're dealing with more than the 50-100K daily active visitors and 25M or so customized emails that went out monthly I suspect you know you're going to be putting in some extra engineering time. http://www.postgresql.org/docs/9.1/static/sql-select.html#SQ...
The only caveats are that of performance (with a traditional server I wouldn't worry about performance until you need to process hundreds of items per second, but on EC2 nodes that threshold is more near the range of dozens per second), and the need to regularly archive the "done" directory (cron solves this nicely).
...but why would you worry about these problems when other solutions like kestrel, beanstalk, and redis (my personal favorite) are equally easy to set up and understand?
And for that matter, how do you give multiple machines access to this workqueue?
Unix's "everything is a file" philosophy can be stretched pretty damn far.
but i agree. files and folders are an elegant abstraction, that when combined with the unix toolset become extremely powerful.
The big shortcoming I see with this solution, and maybe this is what you are saying in the caveats, is that it doesn't support multiple worker boxes.
Of course you could use NFS, but this complicates it. Suddenly the consistency model is more complex and workers must partition work, and so on.. At that point, a mysql backed queue becomes an appealing and easy way to make a distributed queue.
I think the transaction log and the forced structure of using SQL (barring some yutz carelessly using TRUNCATE) add some value managing the data, too. Not as big an issue where it's a single person maintaining the app.
hopefully whatever solution you have is tested and designed defensively so you don't accidentally rm the queue.
Blow their mind and show them join(1)
MySQL, nope I use postmap - http://www.postfix.org/postmap.1.html
For best results, it's good to have at least two redis servers, one with snapshotting as a cache (fast, less durable), one with 1 second Append only files (still fast, but slower) for data you care more about.
Undefined behavior!
Chrome 13.0.782.220 Ubuntu 11.04 (Linux 2.6.38-11-generic) GNOME 2.32.1
Extensions:
- Adblock Plus for Google Chrome™ (Beta) - Version: 1.1.4
- Xmarks Bookmark Sync - Version: 1.0.16
- Reddit Enhancement Suite - Version: 3.4 (Disabled)
Examples:
1. http://i.imgur.com/TD8UU.png
2. http://i.imgur.com/HIXbP.png -- with text selected.
However when I reloaded the page it fixed itself. In fact as the page reloads I can see the text layout first breaking and then immediately fixing itself...
I've personally been down this road many times, and the last time I made the mistake of relying on SELECT FOR UPDATE in a queueing system it broke down somewhere on the road between 1msgs/sec and 50msgs/sec. That application committed before it dispatched to the worker app so I would consider it a fairly similar access pattern as yours.
The solution I went with in that case was exactly what Baron describes at "Locking is actually quite easy to avoid." - something along the lines of UPDATE queue SET selected_by = dispatcher_id, selected_time = NOW().. and then SELECT * FROM queue WHERE selected_by = dispatcher_id. I hate putting pseudo-SQL because it's already setting bad ideas in some random reader's head. Anyways, that scaled up to several thousand messages per second and ran happily for years, long after I left that particular company. May still be running depending on who you ask.
Long story short, it's great that your solution is working for you but the weight of public knowledge suggests it's not a great solution for anyone else to pick up on. Ping Brigade looks nifty, I hope it works great for you. Please don't suggest this pattern to other people.
Personally the system I work on day-to-day these days runs a Redis set-based queue similar to Resque to send a few thousand emails per second and I'm ok with it. Not thrilled, but happy enough that I don't read the Resque introduction text and blanch in horror as I did reading your article, especially as a reply to Baron's which is based on... lots and lots of real world experience with many different applications.
I think, as he said, everyone shouldn't run out and replace a mysql job queue for their wordpress blog. In a great many cases it doesn't matter.
I also like how he never said "Don't use mysql as a queuing system" but "be careful of these things". I've used mysql as a queuing system, and it works fine. I looked at replacing it with a different database, but in that situation it was not worth the investment.
Signaling mysql + archiving performed work + no locks that lock more than the exact row that's being updated (and also avoiding concurrent workers acting on the same task) will take a mysql backed queuing system far. I've set up a system that processes well over 5,000 tasks / day using it.
Do I think everyone should use mysql as their queuing backend? No. People should probably use a queuing library, with persistence to a database (redis?) enabled for critical tasks. Of course, as the article said, be careful about the choice of backends.
Should be noted, this is not necessarily a good solution: a concurrent consumer, which may be another incarnation of a given script running with a lag, may hijack the queue element locked this way; as a result you may end up having two or more incarnations of the consumer handling the same queue element.
The most universal approach to DB queues is to assign each consumer process a unique ID which it should use for locking queue elements in their UPDATE ... LIMIT 1.
MySpace used it to keep their partitioned databases in sync: http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?...
MySQL queues work just fine with the recommendations Barron provides himself "1) avoid polling; 2) avoid locking 3) avoid mixing queue and archive tables".
http://patrick.wagstrom.net/weblog/2003/05/23/lpdforfunandmp... http://rendermania.com/building-a-renderfarm-with-cups/
I used to generate the messages and then insert them into queuing system but for 100k messages I never managed to make this fast... I have managed to queue all these messages in less than half a second using just one MySQL query.
If anyone has any better ideas, please let me know!
Once you start hitting hundreds of jobs per second, you'll want to scale horizontally, but that shouldn't be the case for 99% of use cases.
Still kind of alpha, but working for my purposes.
i actually implemented this once partially on innodb, and it worked pretty well, no waiting for locks, but abandoned my efforts due to another project.