1. We have problem, lets use Celery
2. Now we have one more problem.
I found http://python-rq.org/ much more handy and cover most cases. It uses redis as query broker. Flask, Django integration included https://github.com/mattupstate/flask-rq/ https://github.com/ui/django-rqThree days in, I can tell you that it does work, but it does take a lot of searching through the docs to optimize. It's very hard to run with class objects too, so we just created long-scripted functions for the worker.
Even now I'm trying to figure out why the worker is unable to refresh access tokens after 60 minutes, and tempted to just have it run as root.
- use gevent + gunicorn, or Tornado, in order to keep a socket open while the worker is processing the task?
- use polling? (less efficient)
- use websockets (but then the implementation is perhaps a bit more complex)
can you do this simply using Flask?
If your ajax request requires long task processing and requires you to wait for it than this is not a background task any more, it's done in one of the web server threads, and even if the thread outsources the task to another process it's still waiting on that proces to finish before returning the ajax response. This is bad.
I'm not entirely convinced about websocket solutions in Python yet, but I've been told flask-websockets is awesome. Nevertheless this doesn't solve the problem for you. Cause the request is just keeping an open line and waiting for a respone....blocking is bad.
The most simplest advise I would have is to have the ajax request trigger a background task and return immediately. The background task will then have some kind of side effect (ie. write some result to a database somewhere) which the ajax request can the look for with some kind of polling mechanism (on some other endpoint). Of course you can complicate this a lot, depending on your needs, but this seemed like the most straightforward solution.
"The most simplest advise I would have is to have the ajax request trigger a background task and return immediately. The background task will then have some kind of side effect (ie. write some result to a database somewhere) which the ajax request can the look for with some kind of polling mechanism (on some other endpoint)." Wow, overkill much? Polling is bad, and is exactly the kind of bad solution that a lot of these libraries are in place to prevent developers from needing to do.
Websockets were made to solve the long-polling and poll-spamming that was prevalent. Now all you have to do is keep a light, open web-socket connection to the server. And the server, being async/evented, will respond when the task is good and ready. Nice and clean.
I was thinking whether using something like gevent or Tornado, a bit like nodejs, would enable the webserver to keep the socket open without blocking while the computation is made in a worker, then return the result simply to the socket, thus avoiding having to write a more complex websocket-based or polling-based system, but rather using AJAX transparently :)
Polling seems to be the best way to do it, as it doesn't leave sockets open, and doesn't require a websocket enabled browser.
The implementation I'm working on involves keeping the task metadata in the DB, and polling against that lookup (it makes it easier to do things like restrict task results to specific users as well).
I was also thinking that another way to do it could be to write the result in its final format to a /ajax_output/ directory with a randomly generated name. Then your polling would depend entirely on nginx, which could end up being much more efficient than running through your application framework. Just make sure you regularly clean unused files if you have privacy concerns.
http://blog.fanout.io/2013/04/09/an-http-reverse-proxy-for-r...
With that experience, we wrote a task queue using Redis & gevent that puts visibility & tooling first: http://github.com/pricingassistant/mrq
Would love to have some feedback on that!
I'm not very happy with the community either. What with the dispersed, incomplete documentation, multiple discussion forums, and snide responses, I'm really getting ready to wash my hands of it.
And don't get me started on RabbitMQ.
I asked about a progress bar, for a long running web request on Stack Overflow, and Celery seemed to be the accepted way to do that.
I manged to get it set up eventually. Realized a month or so back that it hasn't been running, and it has taken me about 3 times as long to get it up again as it did on the first try. I am sure there must be an easier way.
Consider how background jobs are typically managed with RabbitMQ, Redis, etc. They are usually created in an "after commit" hook from whatever gets persisted to your relational database. In this scenario, there is a gap between the database transaction being committed and the job being sent to and persisted by RabbitMQ or Redis; during this gap the only record of that task is being held in a process's memory.
If this process gets killed suddenly during this gap, that background job will be lost forever. It sounds unlikely, but if RabbitMQ or Redis is down and the process has to sit and retry, waiting for them to come back online, the gap can be sizable.
The systems that use these kinds of tools are usually not structured in a way that they need to wait for something in the database to be stored. By nature they are async tasks and they should be able to run whenever and return sometime in the future, and they will most likely produce some kind of result in the database, so there is no reason to store the job information itself in the database.
Jobs are usually not created as hooks after a database commit, so jobs being persisted with database transactions is not quite relevant and Celery has failure mechanism and ways to recover if it was not able to send a task to the broker (ie. RabbitMQ was down).
Redis and RabbitMQ do have a mechanism of persisting jobs onto disk as well so they don't get lost when the process is restarted. So there is no way that a job get's lost forever as you say, if you handle all these cases correctly.
One more thing, Python's database drivers don't work quite as you've described. Namely they don't (by design) make use of the autocommit feature of the database engine, rather they wrap every sql statement in a transaction, so either way each statement get's executed separately in it's own transaction. This would not guarantee, let's say a db record being added and the job being saved as well. You would have to use explicit atomic blocks (something a kin to what Django >= 1.6 has) to get both things or none to be persisted.
I agree with your point about polling being bad, however as someone pointed out below it's not an issue with Postgres's LISTEN/NOTIFY (and I added a note to the queue_classic gem which makes this easy to take advantage of in MRI Ruby).
Obviously I'm aware that Redis and RabbitMQ persist jobs. That's not what I was talking about at all.
I think we're on different wavelengths here so I'll let it be. :-)
Not according to Jim Gray. See "THESIS: Queues are Databases"[1][2]
1- http://research.microsoft.com/apps/pubs/default.aspx?id=6849...
2- (pdf) http://research.microsoft.com/pubs/69641/tr-95-56.pdf
Honestly I think that's the ideal way of doing things, however that's not often how you see it done.
- If you're using AMQP/RabbitMQ as your result back end it will create a lot of dead queues to store results in. This can easily overwhelm your RabbitMQ server if you don't clear these out frequently. Newer releases of Celery will do this daily I think - but it's worth keeping in mind if your RMQ instance falls over in prod.
- Use chaining to build up "sequential" tasks that need doing instead of calling one after another in the same task (or worse, doing a big mouthful of work) in one task as Celery can prioritise many tasks better than synchronously calling several tasks in a row from one "master" task.
- Try to keep a consistent module import pattern for celery tasks, or explicitly name them, as Celery does a lot of magic in the background so task spawning is seamless to the developer. This is very important as you should never mix relative and absolute importing when you are dealing with tasks. from foo import mytask may be picked up differently than "import foo" followed by "foo.mytask" would resulting in some tasks not being picked up by Celery(!)
- Never pass database objects, as OP says, is true; but go one step further and don't pass complex objects at all if you can avoid it. I vaguely remember some of the urllib/httplib exceptions in Python not being serializable and causing very cryptic errors if you didn't capture the exception and sanitise it or re-raise your own.
- Use proper configuration management to set up and configure Celery plus what ever messaging broker/backend. There's nothing more frustrating than spending your time trying to replicate somebody's half-assed Celery/Rabbit configuration that they didn't nail down and test properly in a clean-room environment.
If task_C returns a value which no other task cares about, it will insert the value into the queue, and never gets consumed. This is why dead queues (also known as "tombstones") happen.
Always remember to set ignore_result=True for tasks which don't return any consumed value.
EDIT: "Tombstones", not gravestones
With the psycopg2 module, you can use this mechanism together with select(), so your worker thread(s) don't have to poll at all. They even have an example in the documentation.
http://www.postgresql.org/docs/9.3/interactive/sql-notify.ht...
Just slides though. Haven't gotten around to writing a post about it yet.
Unfortunately if you're using JRuby you can't benefit from this, as the Postgres JDBC driver does polling.
a > Redis uses less memory
b > Redis is easier to setup
https://celery.readthedocs.org/en/latest/userguide/signals.h...
Of course, logging_tree is a great tool as well!
Rq is a lot smaller, more than 10x by line count. So if it works just as well, I'd go with the simpler implementation.
That is why I decided to use Rq, it is better to know limitations of something simple then know possibilities but not able to make choice.
celery is like a .50 caliber machine gun, industrial strength, lots of options, used for a variety of completely different use cases.
For simple stuff, use rq, but celery + rabbitmq work better if you have dozens and dozens plus worker nodes (ie: different servers), whereas with rq, you use redis, which could potentially be a SPOF, even with redis sentinel.
It always depends on your use-case but generally you want your application to behave correctly, which means it has to have correct/fresh data...you can't sacrifice correctness because of an inability to scale your database.
If you are processing a lot of data in Celery, you really want to try to avoid performing any database queries. This might mean re-architecting the system. You might for example have insert-only tables (immutable objects) to address this type of concern.
It’s a sadly under-rated ingredient! The flavor is subtle but unmistakable.
Eg, execute other tasks only if there are no pending important tasks.
AMPQ = Advanced Message Queuing Protocol so it's wrong to say that a message broker is "an AMQP". Also, give Redis a try - it's much easier to set up and uses fewer resources.
We should probably talk about the elephant in the room when addressing newbies: the Celery daemon needs to be restarted each time new tasks are added or existing ones are modified. I got past that with the ugly hack of having only one generic task[1] but people new to Celery need to know what they're getting into.
[1]: http://docs.celeryproject.org/en/latest/userguide/workers.ht...
1. Use task specific logging if you have a bunch of task: http://blog.mapado.com/task-specific-logging-in-celery/
2.Use statsd counters to keep track of basic statistics (counts + timers) for each task
3. Use supervisor + monit to restart workers after lack of activity (I have seen this happen a few times, but never been able to track down why it happens, but this is an easy fix)
Distributed just means that you can have your task processing spread out across multiple machines.
A specific example would be, let's say, after your user registers on your website for the first time you wan't to get a list of all his facebook/twitter friends. This action will take a long time and is not vital to the whole registration/login process so you set a task to do that later, and let the user proceed to the site and not make him look at the spinner the whole time, and when the friend list becomes available it will show up on the website (on his profile or whatever). Makes sense?
I will say though Celery is probably overkill for a lot of tasks people think to use it for, in my case it was mandated to support scaling for a startup that never launched, partly because they kept looking at new technologies for problems they didn't have yet.
[1] https://github.com/celery/celery/blob/ee46d0b78d8ffc068d5b80...
Pypi: https://pypi.python.org/pypi/taskflow
Comments, feedback and questions welcome :-)