Celery – Best Practices | Better HN

103 comments

85 comments · 21 top-level

xenator12y ago· 9 in thread

In many projects Celery is overkill. Common scenario I saw:

  1. We have problem, lets use Celery
  2. Now we have one more problem.

I found http://python-rq.org/ much more handy and cover most cases. It uses redis as query broker. Flask, Django integration included https://github.com/mattupstate/flask-rq/ https://github.com/ui/django-rq

p_papageorgiou12y ago

Excellent recommendation. In my experience Celery is an overkill most of the times and will force you to spend more time doing ops guaranteed.

kapkapkap12y ago

Thanks for this, I had considered using celery for a recent project but ultimately backed away because I got the feeling it was more trouble than it was worth. As a point of reference would you say the learning curve for a celery setup is similar to that of django? Not that theres anything terribly hard about django, but Id agree that its probably overkill if youre relatively new to python and are just looking for a quick way to produce some html with no intent on developing it further.

bduerst12y ago

I just started using Celery this week for the first time, to handle parallel processing of thousands of tasks in a data pipeline.

Three days in, I can tell you that it does work, but it does take a lot of searching through the docs to optimize. It's very hard to run with class objects too, so we just created long-scripted functions for the worker.

Even now I'm trying to figure out why the worker is unable to refresh access tokens after 60 minutes, and tempted to just have it run as root.

goblin8912y ago

I wouldn't say Celery's learning curve is steeper than Django's, but it definitely seems like overkill for your case. If you need to do some time-consuming action periodically (and making an HTTP request by hand each time is not an option), then you could just use cron for the start if your project is relatively simple. And if you literally need to just produce some HTML when asked for, then why are you considering using an async task processor such as Celery?

collyw12y ago

Django is kind of slow learning curve, but understandable. Celery for me was quicker, but more along the lines of: follow the instructions, and google / stack overflow until it works. A lot less understanding involved.

denibertovicOP12y ago

It's somewhat akin to a Django vs. Flask discussion really. But yes for more light weight stuff I too would recommend rq.

mushfiq12y ago

I had the same feeling, though I am using celery for different projects, still it needs time for me to figure out "what is going on there?". Specially I have used for simple task queue system which was overkill. And python-rq definitely a good choice. It does one thing, API is quiet simple and short and it does the task well.

findjashua12y ago

Thanks! I didn't know about RQ, from a cursory glance it does look a lot simpler than Celery. Sidenote: I decided to follow RQ's author on Github, and discoverd Gitflow as well. So, double thanks!

infecto12y ago

This! Celery is well supported and powerful but often it is just too much to manage. Everytime we have an error crop up in our deployment it takes too much time to figure out whats going on.

oulipo12y ago· 8 in thread

Wondering about something: if you need to have a long task (5s to 10s) in the background, or even longer, for an AJAX request, what should you rather do:

- use gevent + gunicorn, or Tornado, in order to keep a socket open while the worker is processing the task?

- use polling? (less efficient)

- use websockets (but then the implementation is perhaps a bit more complex)

can you do this simply using Flask?

denibertovicOP12y ago

Hmm, seems we're talking about 2 things here.

If your ajax request requires long task processing and requires you to wait for it than this is not a background task any more, it's done in one of the web server threads, and even if the thread outsources the task to another process it's still waiting on that proces to finish before returning the ajax response. This is bad.

I'm not entirely convinced about websocket solutions in Python yet, but I've been told flask-websockets is awesome. Nevertheless this doesn't solve the problem for you. Cause the request is just keeping an open line and waiting for a respone....blocking is bad.

The most simplest advise I would have is to have the ajax request trigger a background task and return immediately. The background task will then have some kind of side effect (ie. write some result to a database somewhere) which the ajax request can the look for with some kind of polling mechanism (on some other endpoint). Of course you can complicate this a lot, depending on your needs, but this seemed like the most straightforward solution.

zo112y ago

"I'm not entirely convinced about websocket solutions in Python yet, but I've been told flask-websockets is awesome. Nevertheless this doesn't solve the problem for you. Cause the request is just keeping an open line and waiting for a response....blocking is bad." Tornado only blocks if you do something silly. It's event based, and can keep hundreds of connections open and waiting for it's async response event before actioning/responding the open connection.

"The most simplest advise I would have is to have the ajax request trigger a background task and return immediately. The background task will then have some kind of side effect (ie. write some result to a database somewhere) which the ajax request can the look for with some kind of polling mechanism (on some other endpoint)." Wow, overkill much? Polling is bad, and is exactly the kind of bad solution that a lot of these libraries are in place to prevent developers from needing to do.

Websockets were made to solve the long-polling and poll-spamming that was prevalent. Now all you have to do is keep a light, open web-socket connection to the server. And the server, being async/evented, will respond when the task is good and ready. Nice and clean.

oulipo12y ago

So you think polling is the most effective solution, it is perhaps the case.

I was thinking whether using something like gevent or Tornado, a bit like nodejs, would enable the webserver to keep the socket open without blocking while the computation is made in a worker, then return the result simply to the socket, thus avoiding having to write a more complex websocket-based or polling-based system, but rather using AJAX transparently :)

scott_w12y ago

For what it's worth, I'm working on exactly this problem with Django+Celery.

Polling seems to be the best way to do it, as it doesn't leave sockets open, and doesn't require a websocket enabled browser.

The implementation I'm working on involves keeping the task metadata in the DB, and polling against that lookup (it makes it easier to do things like restrict task results to specific users as well).

I was also thinking that another way to do it could be to write the result in its final format to a /ajax_output/ directory with a randomly generated name. Then your polling would depend entirely on nginx, which could end up being much more efficient than running through your application framework. Just make sure you regularly clean unused files if you have privacy concerns.

jaegerpicker12y ago

I really like tornado and websockets but keep in mind it gets dicey to scale on one box after you get to about 50 open connections at the same time on one box. You can do things to stretch that out but it's not the easiest thing. You also still have browser requirement issues. So it really depends on your use case polling, which is my least favorite method, is the most versatile method. It's easy to use flask for all of these issues. That said I'm a big fan of Tornado.

jkarneges12y ago

Pushpin could be good for keeping sockets open for the duration of a task without tying up threads in the webapp.

http://blog.fanout.io/2013/04/09/an-http-reverse-proxy-for-r...

mctx12y ago

What's the use case? Do you need to know exactly when the task is done? Does it vary in duration significantly? Can you split the call into two - one to start it, another to check the status given an ID?

oulipo12y ago

It could be the user sending a computation to the server and wanting its interface to be updated as soon as the computation is done, it is feasible by regularly polling the backend after launching a worker process, but this adds complexity compared to simply opening a non-blocking socket a la nodejs & waiting for the worker to finish its job & sending the result back to the browser

sylvinus12y ago· 7 in thread

I've worked 4+ years with Celery on 3 different projects and found it incredibly difficult to manage, both from the sysadmin and the coder point of view.

With that experience, we wrote a task queue using Redis & gevent that puts visibility & tooling first: http://github.com/pricingassistant/mrq

Would love to have some feedback on that!

scottc12y ago

I'll check this out. I recently started looking around for an alternative to celery. I literally just got over a celery-related bug that took way too long to diagnose and one that took even longer on my previous project.

I'm not very happy with the community either. What with the dispersed, incomplete documentation, multiple discussion forums, and snide responses, I'm really getting ready to wash my hands of it.

ris12y ago

This. In my experience Celery's capabilities are greatly oversold, both by itself and others. Most problems Celery purports to solve it tends to just overcomplicate and often not really solve at all. And most the time I've dug into the code I'd rather I hadn't (discovering that the thing you'd assumed was implemented fairly bulletproofly, really wasn't).

And don't get me started on RabbitMQ.

collyw12y ago

That sounds like exactly what happened to me.

I asked about a progress bar, for a long running web request on Stack Overflow, and Celery seemed to be the accepted way to do that.

I manged to get it set up eventually. Realized a month or so back that it hasn't been running, and it has taken me about 3 times as long to get it up again as it did on the first try. I am sure there must be an easier way.

drbsg12y ago

My major bone of contention is the frequency and stability of releases. There are too many releases and not enough testing before each release. I have frequently found myself trying out a new release because it included a patch I wanted, to only find it has broken something else.

john2x12y ago

Looks interesting. Can't find any links to the docs?

sylvinus12y ago

We are writing more docs now and putting a small website up, currently they are in the README. We provide support by email, there are already a few third-party users using it in production.

bduerst12y ago

I can't find any either - it looks like mrq is a front-end dashboard for python-rq: http://python-rq.org/

waffle_ss12y ago· 7 in thread

I disagree with the characterization in #1 (although I can't speak to the Celery particulars). I feel like if you have a job that is critical to your business process, the job should be persisted to your database and created within the same database transaction as whatever is kicking off the job.

Consider how background jobs are typically managed with RabbitMQ, Redis, etc. They are usually created in an "after commit" hook from whatever gets persisted to your relational database. In this scenario, there is a gap between the database transaction being committed and the job being sent to and persisted by RabbitMQ or Redis; during this gap the only record of that task is being held in a process's memory.

If this process gets killed suddenly during this gap, that background job will be lost forever. It sounds unlikely, but if RabbitMQ or Redis is down and the process has to sit and retry, waiting for them to come back online, the gap can be sizable.

denibertovicOP12y ago

I think you're missing the point. The Celery (or any task queue really) particulars are very important here, cause you don't want background workers hammering your database if they don't need to. Cause the workers wan't work with a AMQP implementation, which the database is not. It's like using a fork instead of a hammer, sure you might get a few nails but it's not the right tool for the job.

The systems that use these kinds of tools are usually not structured in a way that they need to wait for something in the database to be stored. By nature they are async tasks and they should be able to run whenever and return sometime in the future, and they will most likely produce some kind of result in the database, so there is no reason to store the job information itself in the database.

Jobs are usually not created as hooks after a database commit, so jobs being persisted with database transactions is not quite relevant and Celery has failure mechanism and ways to recover if it was not able to send a task to the broker (ie. RabbitMQ was down).

Redis and RabbitMQ do have a mechanism of persisting jobs onto disk as well so they don't get lost when the process is restarted. So there is no way that a job get's lost forever as you say, if you handle all these cases correctly.

One more thing, Python's database drivers don't work quite as you've described. Namely they don't (by design) make use of the autocommit feature of the database engine, rather they wrap every sql statement in a transaction, so either way each statement get's executed separately in it's own transaction. This would not guarantee, let's say a db record being added and the job being saved as well. You would have to use explicit atomic blocks (something a kin to what Django >= 1.6 has) to get both things or none to be persisted.

jaegerpicker12y ago

My reply should be closer to this, I'm on painkillers a little foggy but this reply is correct. The db record being saved does not guarantee the job will be saved. Particularly with Django.

waffle_ss12y ago

I'm coming at it from the Ruby angle, in which jobs are often triggered using ActiveRecord after_commit hooks. I admit to being ignorant of the Python/Celery way of doing things so perhaps I am missing the point. I'm talking about jobs being produced atomically with the data that necessitated the background job (I realize not all background jobs are spawned in this fashion).

I agree with your point about polling being bad, however as someone pointed out below it's not an issue with Postgres's LISTEN/NOTIFY (and I added a note to the queue_classic gem which makes this easy to take advantage of in MRI Ruby).

Obviously I'm aware that Redis and RabbitMQ persist jobs. That's not what I was talking about at all.

I think we're on different wavelengths here so I'll let it be. :-)

jaegerpicker12y ago

I disagree with this, in my experience it's almost always a really bad idea to use the DB as a queue. If rabbitmq is down the process should retry a finite amount of times (usually 3 in our use case) then set a status on the db record. Then you have audits running to pick up records in that state and retry the process once the system is back up and running. That way nothing is lost and you gain all of the benefits of Rabbitmq.

queuesaredbs12y ago

bad idea to use the DB as a queue

Not according to Jim Gray. See "THESIS: Queues are Databases"[1][2]

1- http://research.microsoft.com/apps/pubs/default.aspx?id=6849...

2- (pdf) http://research.microsoft.com/pubs/69641/tr-95-56.pdf

waffle_ss12y ago

So you are storing a job in the database then - a job whose job it is to send a job to RabbitMQ/Redis :-)

Honestly I think that's the ideal way of doing things, however that's not often how you see it done.

denibertovicOP12y ago

Precisely. Tnx jaegerpicker.

mickeyp12y ago· 5 in thread

Good, basic practices to follow. Here's a few more:

- If you're using AMQP/RabbitMQ as your result back end it will create a lot of dead queues to store results in. This can easily overwhelm your RabbitMQ server if you don't clear these out frequently. Newer releases of Celery will do this daily I think - but it's worth keeping in mind if your RMQ instance falls over in prod.

- Use chaining to build up "sequential" tasks that need doing instead of calling one after another in the same task (or worse, doing a big mouthful of work) in one task as Celery can prioritise many tasks better than synchronously calling several tasks in a row from one "master" task.

- Try to keep a consistent module import pattern for celery tasks, or explicitly name them, as Celery does a lot of magic in the background so task spawning is seamless to the developer. This is very important as you should never mix relative and absolute importing when you are dealing with tasks. from foo import mytask may be picked up differently than "import foo" followed by "foo.mytask" would resulting in some tasks not being picked up by Celery(!)

- Never pass database objects, as OP says, is true; but go one step further and don't pass complex objects at all if you can avoid it. I vaguely remember some of the urllib/httplib exceptions in Python not being serializable and causing very cryptic errors if you didn't capture the exception and sanitise it or re-raise your own.

- Use proper configuration management to set up and configure Celery plus what ever messaging broker/backend. There's nothing more frustrating than spending your time trying to replicate somebody's half-assed Celery/Rabbit configuration that they didn't nail down and test properly in a clean-room environment.

yen22312y ago

With regards to #1: What happens is that if task_B depends on a value that task_A returns, task_A will insert its value into the queue and task_B will consume it.

If task_C returns a value which no other task cares about, it will insert the value into the queue, and never gets consumed. This is why dead queues (also known as "tombstones") happen.

Always remember to set ignore_result=True for tasks which don't return any consumed value.

EDIT: "Tombstones", not gravestones

denibertovicOP12y ago

In general using an AMQP for the result storage is somewhat of a bad idea i think. But yes I agree about the ignoring results part seeing as most tasks I've seen in the wild don't return anything at all. Hence #6 in the post.

welder12y ago

I also like to wrap every task with a decorator which sends an email if the task fails:

https://gist.github.com/alanhamlett/dc8cdd4721ea63053f14

mjschultz12y ago

You might want to check out the CELERY_SEND_TASK_ERROR_EMAILS configuration option: http://celery.readthedocs.org/en/latest/configuration.html#c...

denibertovicOP12y ago

Awesome stuff, tnx for these. :)

keosak12y ago· 5 in thread

Points 1 and 2 are only valid because the Celery database backend implementation uses generic SQLAlchemy. Chances are, if you are using a relational database, it's PostgreSQL. And it does have an asynchronous notification system (LISTEN, NOTIFY), and this system allows you to specify which channel to listen/notify on.

With the psycopg2 module, you can use this mechanism together with select(), so your worker thread(s) don't have to poll at all. They even have an example in the documentation.

http://www.postgresql.org/docs/9.3/interactive/sql-notify.ht...

http://initd.org/psycopg/docs/advanced.html#async-notify

denibertovicOP12y ago

It is true that Postgres supports Pub/Sub but unfortunately the Celery broker driver does not take advantage of this. It would be great if we could get support for it. Nevertheless, just because it has pub/sub doesn't mean it's a full AMQP implementation. Also, there's the fact that most amqp solutions are in memory, wheres a database is on disk... also has it's costs.

denibertovicOP12y ago

Anyone that's interested in Postgres's pub/sub might find this useful: https://denibertovic.com/talks/real-time-notifications/#/

Just slides though. Haven't gotten around to writing a post about it yet.

waffle_ss12y ago

Yep, in Ruby there is a background processing gem built around this: https://github.com/ryandotsmith/queue_classic

Unfortunately if you're using JRuby you can't benefit from this, as the Postgres JDBC driver does polling.

ddorian4312y ago

but each worker will occupy a connection/session/process which are heavyweight ?

denibertovicOP12y ago

It is true that each worker will use up a connection. Not sure how heavy weight it is, depends on your setup and use-case I guess.

mataug12y ago· 4 in thread

What about using Redis as a celery backend ? Redis has a pub sub mechanism which seems quite reliable, so no need to poll.

TwistedWeasel12y ago

I used Redis for celery in production with great success for a year but then we started running some long running jobs that needed the ACKS_LATE setting and the Redis delivery timeout kept hurting us by resending the task to another worker. It's configurable but in the end we just switched to RabbitMQ. I found it quite painless to setup and migrate to.

denibertovicOP12y ago

Redis is still not an AMQP, but yes Redis's Pub/Sub works quite nicely. Out of all the brokers celery supports I'd recommend only RabbitMQ and Redis to people.

mataug12y ago

Yeah, I've been using redis with celery in production to perform lots of network io related tasks on a low end machine because

a > Redis uses less memory

b > Redis is easier to setup

yen22312y ago

Redis works great as a results backend, but I'd still use RabbitMQ for the queue. RabbitMQ is designed to be a message queue, and it does a great job at it.

TomaszZielinski12y ago· 3 in thread

This is not a Celery-specific tip, but as Celery also likes to "tweak" your logging configuration you can use https://pypi.python.org/pypi/logging_tree to see what's going on under the hood.

natedub12y ago

You can disable Celery's automatic logging configuration by connecting a listener to the setup_logging signal.

https://celery.readthedocs.org/en/latest/userguide/signals.h...

Of course, logging_tree is a great tool as well!

TomaszZielinski12y ago

Take a look at https://github.com/celery/celery/blob/v3.0.23/celery/utils/l... - it's an older version that I once checked but it seems to be patching loggers unconditionally (i.e. outside any signal handler).

denibertovicOP12y ago

Awesome, tnx for the tip! :)

geertj12y ago· 3 in thread

I've been looking at Python tasks queues recently. Does anyone have experience on how Celery and rq stack up?

Rq is a lot smaller, more than 10x by line count. So if it works just as well, I'd go with the simpler implementation.

xenator12y ago

I used both, ended with Rq. Freedom if choice can be good, but when you able to make decision. Variety of backends, storages force you to understand how each component really work and when you dig into details you find that they all not equivalent. But you just need something f--kng working and you don't want to pay another guy to maintain zoo of different products.

That is why I decided to use Rq, it is better to know limitations of something simple then know possibilities but not able to make choice.

geertj12y ago

That's very helpful, thanks!

SEJeff12y ago

rq is like a luger pistol, light, simple, gets the job done.

celery is like a .50 caliber machine gun, industrial strength, lots of options, used for a variety of completely different use cases.

For simple stuff, use rq, but celery + rabbitmq work better if you have dozens and dozens plus worker nodes (ie: different servers), whereas with rq, you use redis, which could potentially be a SPOF, even with redis sentinel.

zentrus12y ago· 2 in thread

Passing objects to Celery and not querying for fresh objects is not always a bad practice. If you have millions of rows in your database, querying for them is going to slow you way down. In essence, the same reason you shouldn't use your database as the Celery backend is the same reason you might not want to query the database for fresh objects. It depends on your use case of course. Passing straight values/strings should be strongly considered too since serializing and passing whole objects when you only need a single value is not good either.

denibertovicOP12y ago

Oh absolutely values before objects. I said "serializing" more in the sense that pickle is always used for storing the arguments into the queue (or whatever the default serializer).

It always depends on your use-case but generally you want your application to behave correctly, which means it has to have correct/fresh data...you can't sacrifice correctness because of an inability to scale your database.

zentrus12y ago

Yes. I think saying "you can't sacrifice correctness because of an inability to scale your database" is perhaps conveying the wrong message though. I mean, your very first point is about database scaling issues and the advantages of using something like RabbitMQ to avoid expensive SQL queries.

If you are processing a lot of data in Celery, you really want to try to avoid performing any database queries. This might mean re-architecting the system. You might for example have insert-only tables (immutable objects) to address this type of concern.

Eric_WVGG12y ago· 2 in thread

Am I the only person who was genuinely disappointed that this wasn’t about the vegetable?

It’s a sadly under-rated ingredient! The flavor is subtle but unmistakable.

denibertovicOP12y ago

Sorry to disappoint. Flower isn't about a real flower either. :P

nemo161812y ago

my first thought was "@shit_hn_says is gonna love this..."

peedy12y ago· 2 in thread

Has anybody been able to make a priority queue (with a single worker) in celery?

Eg, execute other tasks only if there are no pending important tasks.

jordonwii11y ago

The FAQ question isn't very clear about it, but it doesn't look like it's possible: http://celery.readthedocs.org/en/latest/faq.html#does-celery...

denibertovicOP12y ago

I don't think it's possible. At least with celery. The only way I've was able to do this is with more Queues (and workers).

zrail12y ago· 2 in thread

Small typo where you define `CELERY_ROUTES`. `my_taskA` should probably have the routing key `for_task_A`, right?

denibertovicOP12y ago

Not really, that's just the name of the actual task itself ie. "def my_taskA(a, b, c)".

zrail12y ago

This line `'my_taskA': {'queue': 'for_task_A', 'routing_key': 'for_task_B'},` "for_task_B" should be "for_task_A" to match the CELERY_QUEUES definition. Unless I'm misunderstanding what you're doing, of course.

stefantalpalaru12y ago· 2 in thread

> when you have a proper AMQP like RabbitMQ

AMPQ = Advanced Message Queuing Protocol so it's wrong to say that a message broker is "an AMQP". Also, give Redis a try - it's much easier to set up and uses fewer resources.

We should probably talk about the elephant in the room when addressing newbies: the Celery daemon needs to be restarted each time new tasks are added or existing ones are modified. I got past that with the ugly hack of having only one generic task[1] but people new to Celery need to know what they're getting into.

[1]: https://github.com/stefantalpalaru/generic_celery_task

malinoff12y ago

Let me repeat, you don't need this to load/reload tasks. There is 'pool_restart' broadcast command[1].

[1]: http://docs.celeryproject.org/en/latest/userguide/workers.ht...

denibertovicOP12y ago

Noted, the wording is a bit contrived i give you that. I like Redis as well, it was mentioned in the comments here a few times. Good thinking about pointing out reloading btw. Tnx.

misiti378012y ago· 1 in thread

I would add:

1. Use task specific logging if you have a bunch of task: http://blog.mapado.com/task-specific-logging-in-celery/

2.Use statsd counters to keep track of basic statistics (counts + timers) for each task

3. Use supervisor + monit to restart workers after lack of activity (I have seen this happen a few times, but never been able to track down why it happens, but this is an easy fix)

denibertovicOP12y ago

More awesome tips. Thank you.

TwistedWeasel12y ago· 1 in thread

Once you scale your worker pool up beyond a couple of machines you need some sort of config management with Celery. We use SaltStack to manage a large pool of celery workers and it does a pretty good job.

denibertovicOP12y ago

Indeed. I use Ansible myself.

stickperson12y ago· 1 in thread

I've heard so much about Celery but still have no clue when it would be used. Could someone give some specific examples of when you have used it? I don't really even know what a distributed task is.

denibertovicOP12y ago

A background task is just something that's computed outside of the standard http request/response process. So it's asynchronous in the sense that the result will be computed sometime in the future, but you don't care when.

Distributed just means that you can have your task processing spread out across multiple machines.

A specific example would be, let's say, after your user registers on your website for the first time you wan't to get a list of all his facebook/twitter friends. This action will take a long time and is not vital to the whole registration/login process so you set a task to do that later, and let the user proceed to the site and not make him look at the spinner the whole time, and when the friend list becomes available it will show up on the website (on his profile or whatever). Makes sense?

ehurrell12y ago

Excellent resource, I remember wrestling with learning celery and how to do some simple things, loved finding Flower to monitor things.

I will say though Celery is probably overkill for a lot of tasks people think to use it for, in my case it was mandated to support scaling for a startup that never launched, partly because they kept looking at new technologies for problems they didn't have yet.

TomaszZielinski12y ago

If you combine Celery with supervisord it's important to check the official config file[1]. At least two settings there are really important - `stopwaitsecs=600` and `killasgroup=true`. If you don't use them you might end up with a bunch of orphaned child Celery processes and your tasks might be executed more than once.

[1] https://github.com/celery/celery/blob/ee46d0b78d8ffc068d5b80...

harlowja12y ago

As one the authors of taskflow I'd like to give a little shout-out for its usage (since it can do similar things as celery, hopefully more elegantly and easily).

Pypi: https://pypi.python.org/pypi/taskflow

Comments, feedback and questions welcome :-)

DrJ12y ago

I'd also add: Be wary of context dependent actions (e.g. render_template, user.set_password, sign_url, base_url) as you aren't in the application/request context inside of a celery task.

j / k navigate · click thread line to collapse