Python Kafka Client Benchmarking (opens in new tab)

(activisiongamescience.github.io)

62 pointsboredandroid10y ago18 comments

18 comments

17 comments · 7 top-level

pixelmonkey10y ago· 4 in thread

My team at Parse.ly also did a benchmark comparing pykafka (pure Python) to pykafka with the librdkafka C extension enabled. That C module is clearly a huge win for Kafka consumer/producer performance on Python and other dynamic languages.

http://blog.parsely.com/post/3886/pykafka-now/

Unfortunately, as the OP illustrates, there are now 2 widely-used Python + Kafka drivers (pykafka and kafka-python), and as of recently, a third, confluent-kafka-python, which is a thin wrapper over librdkafka.

The reason there's all this fragmentation is because Kafka was quite the moving target for non-JVM languages for the past three years. We have used it in production since Kafka 0.7, so we've had to live through it all blow-by-blow. I'm hoping that with Kafka 0.10 recently released, we can finally unify the community around a single driver (somehow).

dpkp10y ago

I enjoyed your blog post, but I don't think this is a fair characterization: kafka-python is not "mostly 0.9+ focused..." kafka-python is the only driver that is both forward and backwards compatible w/ kafka 0.8 through 0.10. As I'm sure you remember, kafka-python was the original 0.8 driver, written to support the 0.8 protocol b/c Samsa (pykafka's previous incarnation) was only supporting 0.7 and did not have any plans to upgrade.

jdennison10y ago

There is obvious value in a having a pure python implementation of a kafka client. Many deployments don't want C extensions or want to use pypy. However, as python's scipy stack has shown, the right python api wrapping C code can have a vibrant community and the speed to boot.

pixelmonkey10y ago

@dkfp Apologies for that, I did not mean to mis-characterize. PyKafka also goes from 0.8 => 0.10. I had assumed kafka-python recently switched to be 0.9-only due to all the changes related to consumer groups.

dpkp10y ago

No apology required. Though note that pykafka requires >=0.8.2 , and is only forwards compatible w/ newer brokers. This means that pykafka implements the 0.8.2 feature set. Newer brokers support that feature set, but you are not taking advantage of 0.9 or 0.10 features if you connect to them. kafka-python, on the other hand is both forwards and backwards compatible. It supports all feature sets: from no offsets in 0.8, to zk offests in 0.8.1, to kafka offsets in 0.8.2, to group management in 0.9, to message timestamps and relative-offset compressed messages in 0.10. The feature set to use is chosen based on the broker version we're connected to. As far as I know, no other client supports this approach -- not python, not java, etc. [Though KIP-35 should open this up to other clients for backwards compatibility starting at 0.10]

1 more reply

fluential10y ago· 3 in thread

After a quick glance, first thing that strikes me is using docker for measuring network bound application performance. Across different versions docker handles networking differently and by default it may have quite significant impact on your results, good example comes from percona guys https://www.percona.com/blog/2016/02/05/measuring-docker-cpu... I wonder what would results be without using docker, or using docker with --net=host

jdennison10y ago

After rerunning the tests with docker host=net i see a small bump in the rate. ~1% across all the clients.

Msgs/s

confluent_kafka_consumer : 277573.293164 / 261407.908007 = 1.061%

pykafka_consumer : 33433.342585 / 33976.938217 = 0.984%

pykafka_consumer_rdkafka : 164311.503412 / 172008.742201 = 0.955%

python_kafka_consumer : 37667.971237 / 38622.727894 = 0.975%

So yes docker network magic adds overhead, but the bias is consistent across all clients.

StreamBright10y ago

I guess some performance testers just don't know what they are measuring, in this case: the overhead of docker of the performance of the Python code. To be fair it is hard to understand a whole system performance. I would love to see a test without Docker though.

jdennison10y ago

Original author here. The docker network point is a good one, I'll give it a try with host network.

There is still value with comparing different clients with the same network constraints. Yeah it is a contrived setup(noted in the post), but at least is the same contrived setup for each test.

dpkp10y ago· 1 in thread

kafka-python maintainer here. Our library is designed to be correct first, easy to use second, and fast third. It should not be surprising to anyone that using C extensions improves python performance. I have avoided requiring C compilation in kafka-python primarily because I've found that very few python users care about processing >10K messages per second per core (remember in python w/o C extensions you are generally bound to a single CPU, so spinning up multiple processes usually improves performance. see multiprocessing). I've also found the python infrastructure for distributing C extensions to be not easy (see goal #2 above). But that is changing! I would definitely consider leveraging C extensions for wire protocol decoding given the recent improvements to wheel distribution on linux. I'm not sure whether I would go so far as to delegate the entire client to a C extension. Part of the fun of python is that you can play with all of the guts at runtime. I've found users are very willing to hack up kafka-python internals to help debug issues. I dont think I could expect the same community involvement if it was all distributed as a complied C extension. But I could be wrong.

Anyways, always fun to read benchmarks. I hope kafka-python makes someone out there smile. That's the best benchmark in my book.

pwang10y ago

Distributing Python +C extensions are easy with Conda.

https://conda-forge.github.io/

iamspoilt10y ago· 1 in thread

I ran a couple of Kafka client benchmarks using Python, Jython and Java and got pretty interesting results. Check them here: http://mrafayaleem.com/2016/03/31/apache-kafka-producer-benc...

yahyaheee10y ago

Would have been interesting to add the c-wrappers in there, but still cool. Thanks

willvarfar10y ago· 1 in thread

Ah this reminds me of one of the very most tricky bugs I ever tracked down: https://github.com/dsully/pykafka/pull/15

DanWaterworth10y ago

You have my condolences.

nerdwaller10y ago

Has anyone tried much with the aiokafka library for asyncio (https://github.com/aio-libs/aiokafka)?

sheeshkebab10y ago

>I ran these tests within Vagrant hosted on a MacBook Pro 2.2Ghz i7.

Good ole laptop benchmarks

j / k navigate · click thread line to collapse

18 comments

17 comments · 7 top-level

pixelmonkey10y ago· 4 in thread

http://blog.parsely.com/post/3886/pykafka-now/

dpkp10y ago

jdennison10y ago

pixelmonkey10y ago

dpkp10y ago

1 more reply

fluential10y ago· 3 in thread

jdennison10y ago

After rerunning the tests with docker host=net i see a small bump in the rate. ~1% across all the clients.

Msgs/s

confluent_kafka_consumer : 277573.293164 / 261407.908007 = 1.061%

pykafka_consumer : 33433.342585 / 33976.938217 = 0.984%

pykafka_consumer_rdkafka : 164311.503412 / 172008.742201 = 0.955%

python_kafka_consumer : 37667.971237 / 38622.727894 = 0.975%

So yes docker network magic adds overhead, but the bias is consistent across all clients.

StreamBright10y ago

jdennison10y ago

Original author here. The docker network point is a good one, I'll give it a try with host network.

There is still value with comparing different clients with the same network constraints. Yeah it is a contrived setup(noted in the post), but at least is the same contrived setup for each test.

dpkp10y ago· 1 in thread

Anyways, always fun to read benchmarks. I hope kafka-python makes someone out there smile. That's the best benchmark in my book.

pwang10y ago

Distributing Python +C extensions are easy with Conda.

https://conda-forge.github.io/

iamspoilt10y ago· 1 in thread

I ran a couple of Kafka client benchmarks using Python, Jython and Java and got pretty interesting results. Check them here: http://mrafayaleem.com/2016/03/31/apache-kafka-producer-benc...

yahyaheee10y ago

Would have been interesting to add the c-wrappers in there, but still cool. Thanks

willvarfar10y ago· 1 in thread

Ah this reminds me of one of the very most tricky bugs I ever tracked down: https://github.com/dsully/pykafka/pull/15

DanWaterworth10y ago

You have my condolences.

nerdwaller10y ago

Has anyone tried much with the aiokafka library for asyncio (https://github.com/aio-libs/aiokafka)?

sheeshkebab10y ago

>I ran these tests within Vagrant hosted on a MacBook Pro 2.2Ghz i7.

Good ole laptop benchmarks

j / k navigate · click thread line to collapse