I'm interested in seing the code/config used to benchark Celery. The default settings are not at all optimized for processing lots of small jobs, and you could easily tweak it to get a 100x speed up for that use case, e.g.:
CELERYD_PREFETCH_MULTIPLIER = 0
CELERY_DISABLE_RATE_LIMITS = True
Also, channels are not re-used unless you explicitly pass the Publisher, so e.g.
publisher = task.get_publisher()
for i in xrange(1000):
task.apply_async(args=(i, ), publisher=publisher)
publisher.close()
is known to be a
massive speed-up for sending tasks in batch (it seems the creation of channels is very expensive in pyamqplib).