is hard to respond to a 6-part blog series content - released all at once - on an HN thread.
- what we can deterministically show is data loss on apache kafka with no fsync() [shouldn't be a surprise to anyone] - stay tuned for an update here.
- the kafka partition model of one segment per partition could be optimized in both arch
- the benefit for all of us, is that all of these things will be committed to the OMB (open messaging benchmark) and will be on git for anyone interested in running it themselves.
- we welcome all confluent customers (since the post is from the field cto office) to benchmark against us and choose the best platform. this is how engineering is done. In fact, we will help you run it for you at no cost. Your hardware, your workload head-to-head. We'll help you set it up with both.... but let's keep the rest of the thread technical.
- log.flush.interval.messages=1 - this is something we've taken a stance a long long time ago in 2019. As someone who has personally talked to hundreds of enterprises to date, most workloads in the world should err on the side of safety and flushing to disk (fsync()). Hardware is very good today and you no longer have to choose between safety and reasonable performance. This isn't the high latency you used to see on spinning disks.
Kafka and fsyncs: https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-...
Then, in Kafka, what if the leader dies with power failure and came back instantaneously?
i.e.: Let's say there are 3 replicas A(L), B(F), C(F) (L = leader, F = follower)
- 1) append a message to A
- 2) B, C replicas the message. The message is committed
- 3) A dies and came back instantaneously before zk.session.timeout elapsed (i.e. no leadership failover happens), with losing its log prefix due to no fsync
Then B, C truncates the log and the committed message could be lost? Or is there any additional safety mechanism for this scenario?
1) What about Kafka + KRaft, doesn't that suffer the same problem you point out in Redpanda? If so, recommending to your customers to run KRaft without fsync would be like recommending running with a Zookeeper that sometimes doesn't work. Or do I fundamentally misunderstand KRaft?
2) You mention simultaneous power failures deep into the fsync blog post. I think this should be more visible in your benchmark blog post, when you write about turnning off fsyncs.
Confluent themselves can show this, the part I'm curious about is whether you can show data loss outside of the known documented failure modes. Because I, as any can anyone, show data loss by running a cluster without fsync and simultaneously pulling the plug on every server.
Woah, yeah that's a serious problem. Data loss under that scenario is nothing to sneeze at.
Try this on your laptop see global data loss - hint: multi-az is not enough
Regardless of the replication mechanism you must fsync() your data to prevent global data loss in non-Byzantine protocols.
1. writen to majority 2. majority has done an fsync()
i can see in the future giving people opt-out options here tho.
For Redpanda:
1. I don't like that they did not include full disk performance, not sure if that was intentional but it feels like it... Seems like and obvious gap in their testing. Perhaps most of their workloads have records time out rather than get pushed out by bytes first, not sure.
2. Their benchmark was def selective, sure, but they sell via proof of performance for tested workloads IIUC, no via their posted benchmarks. The posted benchmarks just get them into the proof stage in a sales pipeline.
For Kafka (and Confluent, and this test):
1. Don't turn off fsync for Kafka if you leave it on with Redpanda, that's certainly not a fair test.
Batching should be done on the client side anyway, as most packages already do by default. If you are worried about too many fsyncs degrading performance, batch harder on your clients. It's the better way to batch anyway.
2. If confluent cloud is using java 11, then I don't like that java 17 is used for this either. It's not a fair comparison seeing that most people will want it managed anyways, so it gives unrealistic expectations of what they can get
3. Confluent charges a stupid amount of money
4. The author works for Confluent, so I'm not convinced that this test would have been posted if they saw Redpanda greatly outperform Kafka
With Both:
1. Exactly once delivery is total marketing BS. At least Redpanda mentions you need idempotency, but you get exactly once behavior with full idempotency anyway. What you build should be prepared for this, not the infra you use IMO as all you need is one external system to break this promise for the whole system to lose it
I prefer Redpanda as I find it easier to run, and Redpanda actually cares about their users whether they are paid or not. Confluent wont talk to you unless you have a monthly budget of at least $10k, Redpanda has extremely helpful people in their slack just waiting to talk to you.
Ultimately you don't just buy into the software, you buy into the team backing it, and I'd pick Redpanda easily, knowing that they can actually help me and care without needing to give them $10k.
This is of course why performance suffers with 50 producers and 288 partitions: not because there is any inherent scale issue in supporting 50 clients (Repanda supports 1000s of clients), but because a 500 MiB/s load spread out among 50 producers and 288 partitions is only ~36 KiB/s per partition-client pair, which is where batching happens. With a linger of 1 ms (the time you'd wait for a batch to form) that's only 36 bytes per linger period so this test is designed to ensure there is no batching at all, to maximize the cost of fsyncs and put Redpanda in a bad light.
A second problem is that most benchmarks, including the one used here, use uniform timings for everything. E.g., when you set the OpenMessaging benchmark to send 1000 messages per second, it schedules a send of one message every 1 millisecond, exactly: i.e., there is no variance in the inter-message timing.
In the real world, message timing is often likely to be much more random, especially when the messages come from external events, like a user click or market event (these are likely to follow a Poisson distribution).
This actually ends up mattering a lot, because message batching will in general be worse under perfect uniformity. E.g., if you have a linger time of 1 ms, a rate of say 900 messages/sec will get no batching (other than forced batching), because each message arrives ~1.1 ms after the last, missing the linger period. If the arrival times were instead random, or especially if they were bursty, you’d get a fair amount of batching just due to randomness, even though the average inter-message time would still be 1.1 ms.
Disclosure: I work at Redpanda.
Respect to the Kafka team as Kafka is an incredible piece of software, but the Mongo guys got torched for eternity for pulling the same shenanigans.
https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-...
Kafka has never tried to hide that fact and it does not, in any way, make Kafka unsafe.
Benchmarking a system that fsyncs every write to one that doesn't isn't an apples-to-apples comparison. You are free to make the argument that you might not need them, but if you are benchmarking systems and one of them fsyncs by default, that is the level of durability I'm going to expect, otherwise I can assume the other guy will be just as fast if he turns off fsyncs as well.
PS: I really miss working with mongodb. It's been almost 7 years since I last used it. I'm surprised I don't see it mentioned very often anymore.
With the usual recommended settings, XFS filesystem, 3 replicas, 2 "in-sync" replicas, etc., it is rather safe. You can also tune background flush to your liking.
The above tradeoffs are very reasonable and Kafka runs very fast on slow disk s(magnetic or in cloud), and even faster on SSD/NVMe disks.
And those that actually had tried it were aware that every client enabled fsync out of the box. So in fact the entire situation was seriously overblown.
But sure let irrational ideology affect your technology decisions. That will work out well.
And then in this article it's explained how Kafka is actually unsafe:
> Kafka may handle simultaneous broker crashes but simultaneous power failure is a problem.
just against simultaneous node crashes (whole VM/machine).
I mean - sure in practice running in different AZs, etc. will probably be good enough, but technically...
In the tail there are all kinds of things that will lose you data. I've actually seen systems lose data with the fsync every message strategy on simultaneous power loss. There was latent corruption of the filesystem due to a kernel bug. After power cycling a majority of nodes had unrecoverable filesystems.
In my experience, even on modern flash the cost of fsync is non trivial. It pessimizes io. You can try to account for this with group commit / batching but but generally the batch window needs to be large relative to network rtt to be effective.
fsync is much more necessary on single primary systems.
My best guess is their volume systems simply lied about the fsync, which I've heard of a few times about different vendors.
Kafka has never pretended that ack'd messages have been persisted to disk, only that they've been replicated per your requested acks.
Just like using HyperLogLog acceptable in many scenarios, using Kafka also acceptable. I am quite baffled how widespread the misuse of technology.
Need reliable data storage? Use a database.
This seems really disingenuous to use empty drive performance, since anyone who cares about performance is going to be caring about continuous use.
With page cache it's OK, because the FTL layer of the drive will work with 32MiB blocks but in case of Redpanda the drive will struggle because FTL mappings are complex and GC has more work. If Kafka would be doing fsync's the behaviour would be the same.
Overall, this looks like a smearing campaign against Redpanda. The guy who wrote this article works for Confluent and he published it on his own domain to look more neutral. The benchmarks are not fair because one of the systems is doing fsyncs and the other does not. Most differences could be explained by this fact alone.
In either case, Confluent Platform is ridiculously expensive and approached the costs (licensing alone) for our entire cloud spend. I'd love to see more run-on-k8s alternatives to CFK.
I have been using Pulsar for new projects not because of performance or anything but because all the features you expect to be built-in are. Georeplication, shared-subscription w/selective ACK, schema registry etc.
Also it's wildly more pluggable, the authn/authz plugin infrastruction in particular is great. I was even able to write a custom Pulsar segment compactor to do GDPR deletions without giving up offloaded segment longevity.
The segment offload is actually huge especially because tools like Secor for Kafka are dead now and you are stuck on the Kafka Connect ecosystem which personally I really find distasteful.
I don't even understand why Confluent should price their offering so high. ITs not like Real time is an exclusive service that other platforms don't have.
I've found talking to Confluent about anything is a complete waste of time unless it's a very specific technical issue. They're always pushing their cloud as the solution, and it's very aggressive.
We ended migrating to aiven after finding confluent pricing unreasonable.
Hopefully that can get ironed out in the future. Until then we will stick with the Strimzi operator and kafka.
Also Confluent is absolutely pricing themselves out of the market. We looked at their self hosted confluent operator and they wanted something like $9k per node, when they do nothing but provide an operator. Insanity.
Worked with many operators in the wild and anything that gives you more control through CRD/automation and less manual pod intervention is a huge win, let's us bake into our already existing pipelines for deployment and releases also. The Confluent($$$$)/Strimzi operators do well on that front. I'm super excited to have competition in this space!
I'll keep an eye out for the new release!
Cloud instances have their own performance pathologies, esp in the use of remote disks.
As for RP and Kafka performance, I'd love to see a parameter sweep over both configuration dimensions as well as workload. I know this is a large space, but it needs to be done to characterize the available capacity, latency and bandwidth.
There are so many dimensions, with configurations, CPU architecture, hardware resources plus all the workloads and the client configs. It gets kind of crazy. I like to use a dimension testing approach where I fix everything but vary one or possibly two dimensions at a time and plot the relationships to performance.
Can the instance do 2 GB/s to disk at the same time it is doing 3.1GB/s across the network? Is that bidirectional capacity or on a single direction? How many threads does it take to achieve those numbers?
That is kind of a nice property, that the network has 50% more bandwidth than the disk. 2x would be even nicer, but that turns out to be 1.5 and 3, so a slight reduction in disk throughput.
Are you able to run a single RP Kafka node and blast data into it over loopback? That could isolate the network and see how much of the available disk bandwidth a single node is able to achieve over different payload configurations before moving on to a distributed disk+network test. If it can only hit 1GB/s on a single node, you know there is room to improve in the write path to disk.
The other thing that people might be looking for when using RP over AK is less jitter due to GC activity. For latency sensitive applications this can be way more important than raw throughput. I'd use or borrow some techniques from wrk2 that makes sure to account for coordinated omission.
https://github.com/giltene/wrk2
This is absolutely rich from the company that keeps promising "exactly once delivery" (with reams of fine print about what "exactly" and "once" and "delivery" mean).
If you don't Fsync the batch, it's possible the server would send response to client saying data was written successfully while the batch is still just in memory and then the server loose power and never write it to disk.
Maybe the author have a different definition of unsafe but to me if it's not ACID it's unsafe!
But I don't think anyone would call a configuration where you ca lose message a safe configuration.
Well said
Seems like it'd be valuable to have a trusted third party like https://jepsen.io/ test it out! (not related, just a fan of their work)
I've found nats to be very lightweight, and it can bridge (bidirectional) to kafka.
Edit: Oh it also supports websockets to the browser
I don't think we can get a less reliable or trustworthy set of performance tests than when someone's paycheck depends on the outcome of those tests. If Redpanda's performance were found to be better, would he really publish the test results?
Personally I'm happy to see companies competing on performance like this. If one company puts out benchmarks I want to see their competition come in with their own benchmarks. Ideally we'll see improvements to both products, and a refined benchmarking suite and philosophy.
I do find it interesting that Confluent feels the need to respond to RP given the disparities in size, install base, etc.
For this particular post I like that they explained each settings change they're making and why. In many of these benchmarks people will make some change and either not mention it or won't explain why they made the change and users are left trying to figure it out.
But I've found through industry that most benchmarks, especially for infrastructure software, are performed by the vendors. The burden for standing up the system(s) to pull off the benchmark is usually high enough that independents are rarely going to take up that banner and do it themselves.
Also, notably closed source systems, some vendors don't license their software to allow public benchmarks.
So, transparency is all we can really hope for.
I remember the halcyon days of the database wars with the vendors publishing new benchmarks seemingly ever month. Fun to watch "Lies, damn lies, and statistics" rear up on its hind legs and roar. And some of the monster clusters of hardware these folks put together were legion.
Similarly I enjoyed when Sun was publishing JEE benchmarks on cheap hardware running Glassfish against MySQL. At least they were publishing on these smaller systems more akin to what many companies may run internally in contrast to these million dollar cluster benchmarks BEA and Oracle were publishing.
Finally, just to throw this out, modern hardware is just extraordinary. Hard to appreciate how fast modern machines are if you didn't live with them in the old days.
Were in the glory days where we, most of we, simply don't care. Off the shelf hardware running untuned servers with reasonable algorithms have so much bandwidth and capability, just gets harder and harder to saturate today.
Interestingly that's not necessarily the case in the public cloud. I'm messing around with AWS storage for an upcoming talk. You definitely can saturate storage on AWS, and it's sometimes hard to tell why.
Does this benchmark compare both 3 node Kafka against 3 node Redpanda cluster? It's unclear.
https://github.com/Vanlightly/openmessaging-benchmark-custom...
I -think- they are saying the original benchmark results done by RedPanda show a 9 node Kafka cluster being beaten by a 3 node RP cluster.
This new benchmark I assume is being done on identical hardware (i.e 3 nodes for both) just with less terrible Kafka settings.
> According to their Redpanda vs Kafka benchmark and their Total Cost of Ownership analysis, if you have a 1 GB/s workload, you only need three i3en.6xlarge instances with Redpanda, whereas Apache Kafka needs nine and still has poor performance.
but scanning it again, I think they are in fact doing 3 node vs 3 node benchmark. It's just a bit unclear.