My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.
A while back I was on team building a non-critical, low volume application. It just involves people sending a message basically. (there is more too it).
The consultants said he had to use Kafka because the messages could come in really fast.
I said we should stick with Postgres.
No, they said, we really need Kakfa to be able to handle this.
Then I went and spun up Postgres on my work laptop (nothing special), and got a loaner to act as a client. I simulated about 300% more traffic than we had any chance of getting. It worked fine. (did tax my poor work laptop).
No, we could not risk it, when we use Kafka we are safe.
Took it to management, Kafka won since Buzzword.
Now of course we have to write a process to feed the data into Postgres. After all its what everything else depends on
And thus we have a world where people have business needs that could be powered by my low end laptop but solutions inspired by the megacorps.
Sorry thats just a clickbait-y statement. I love Postgres, try handling 100-500k rps of data coming in from various sources reading and writing to it. You are going to get bottlenecked on how many connections you can handle, you will end up throwing pgBouncers on top of it.
Eventually you will run out of disk, start throwing more in.
Then end up in VACCUUM hell all while having a single point of failure.
While I agree Kafka has its own issues, it is an amazing tool to a real scale problem.
i love posgresql, but i would not use it to replace a rabbitmq instance -- one is an RDBMS, the other is a queue/event system.
"oh but psql can pretend to be kafka/rabbitmq!" -- sure, but then you need to add tooling to it, create libraries to handle it, and handle all the edge cases.
with rmq/kafka, there already a bunch of tools to handle the exact case of a queue/event system.
Kafka has its use case. Databases have theirs. You can make a DB do what kafka does. But you also add in the programming overhead of getting the DB semantics correct to make an event system. When I see people saying 'lets put the DB into kafka' I make the exact same argument. You will spend more time making kafka act like a database and getting the semantics right. Kafka is more of a data/event transportation system. A DB is an at rest data store that lets you manipulate the data. Use them to their strengths or get crushed by weird edge cases.
If you need queuing, you have libs like celery.
You don't need to go full kafka
Sometimes the restrictions placed on the user are as important. Kafka presents a specific interface to the user that causes users to build their applications in certain way.
While you can replicate almost all functionality of Kafka with Postgres (except for performance, but hardly anybody needs as much of it), we all know what we end up with when we set up Postgres and use it to integrate applications with each other.
If developers had discipline they could of course crate tables with appendable logs of data, marked with a partition, that consumers could process from with basically same guarantees as with Kafka.
But that is not how it works in reality.
It’s not that there’s anything wrong with Kafka, it’s a very good product and extremely robust. Same with Kubernetes, it has it uses and I can’t fault anyone for having it as a consideration.
My problem is when people ignore how capable modern servers are, and when developers don’t see the risks in building these highly complex systems, if something much simpler would solve the same problem, only cheaper and safer.
But there are many many projects where Kafka is used for low value event sourcing stuff where a SQL DB could be easier.
But for a production-grade version of the system I'm going with SQL and, where needed, IaC-defined SQS.
I've found this question very useful when pitched any esoteric database.
"You really can replace Kafka with a database."
They first gave their credentials by mentioning their experience.
Then they basically said ”given what I know about Kafka, with my experience, I require other people who ask for it to show me that they really need it before I accommodate them - often a beefy Postgres is enough”.
Postgres can pretend to be Redis, RabbitMQ, and Kafka, but redis, RabbitMQ, and Kafka would have a hard time pretending to be Postgres.
Postgres has the best database query language man has invented so far (AFAIK), well reasoned persistence and semantics, and as of recently partitioning features to boot and lots of addons to support different usecases. Despite all this postgres is mostly Boring Technology (tm) and easily available as well as very actively developed in the open, with a enterprise base that does consulting first and usually upstreams improvements after some time (2nd Quadrant, EDB, Citus, TimescaleDB).
The other tools win on simplicity for some (I'd take managing a PostgreSQL cluster over RMQ or Kafka any day), but for other things especially feature wise Postgres (and it's amalgamation of mostly-good-enough to great features) wins IMO.
...
You could probably hack any database to perform any task, but why would you? Use the right tool for the right task, not one tool for all tasks.
If the right tool is a relational database, then use Postgres/$OTHER-DATABASE
If the right tool is distributed, partitioned and replicated commit log service, then use Kafka/$OTHER-COMMIT-LOG
Not sure why people get so emotional about the technologies the know the best. Sure you could hack Postgres to be a replicated commit log, but I'm sure it'll be easier to just throw in Kafka instead.
It feels like two different lenses onto the same reality.
Transparency: I work for Timescale
> using a hammer to squash a bug..
Agreed - but Kafka is a much much bigger hammer. SES/Az Queues are also good choices.
But, that's speaking from my light experience with it. I'm also curious if there's a better way :-)
This ends up being required for one of our architectural goals, which is fully repeatable transformations: You must be able to model a transactional decision as a Flow derivation (like "does account X have funds to transfer Y to Z ?", and if you create a _copy_ of that derivation months later, get the exact same result.
Under the hood (and simplifying a bit) Flow always does a streaming shuffled read to map events from partitions to task shards, and each shard maintains a min-heap to process events in their ~wall-time order.
This also avoids the common "Tyranny of Partitioning", where your upstream partitioning parallelism N also locks you into that same task shard parallelism -- a big problem if tasks manage a lot of state. With a read-time shuffle, you can scale them independently.
You get guaranteed ordering at the partition level.
Items are partitioned by key so you also get guaranteed ordering for a key.
If you have guaranteed ordering for a key you can’t get total ordering across all keys but you can get eventual consistency across the keys.
Ultimately if you want ordering you have to design around being eventually consistent.
I don’t read a lot of papers but Leslie Lamports Time, Clocks, and the Ordering of Events in a Distributed System gave me a lot of insight in to the constraints. https://lamport.azurewebsites.net/pubs/time-clocks.pdf
Commit the checkpoint alongside state mutations in a single store transaction. Only then do you publish ACKs to all of the downstream streams.
Of course, you can fail immediately after commit but before you get around to publishing all of those ACKS. So, on recovery, the first thing a task assignment does is publish (or re-publish) the ACKs encoded in the recovered checkpoint. This will either 1) provide a first notification that a commit occurred, or 2) be an effective no-op because the ACK was already observed, or 3) roll-back pending messages of a partial, failed transaction.
More details: https://gazette.readthedocs.io/en/latest/architecture-exactl...
https://docs.oracle.com/en/middleware/fusion-middleware/12.2...
Using a very broad definition of ‘noSQL’ approach that would include solutions like Kafka, the issue becomes clear: A 2PC or ‘distributed transaction manager’ approach ala JTA comes with a performance/scalability cost — arguably a non-issue for most companies who don’t operate at LinkedIn scale (where Kafka was created).
Although, from various code bases I've seen, a lot of people just don't seem to worry about the possibility of data loss.
Data retention time is Kafka config 101. Are you sure it was a bug?
you can either send a kafka message but potentially not commit the db transaction (i.e. an event is published for which the action did not actually occur) or commit the db transaction and potentially not send the kafka message
it sounds like they implemented something like the Transactional Outbox pattern https://microservices.io/patterns/data/transactional-outbox....
i.e. you use the db transaction to also commit a record of your intent to send a kafka message - you can then move the actual event sending to a separate process and implement at-least-once semantics
This is the job queueing system they described in the article
Regarding:
When we produce first and the database update fails (because of incorrect state) it means in the worst case we enter a loop of continuously sending out duplicate messages until the issue is resolved
I don't understand where either 1) the incorrect state or 2) the need to continuously send duplicate messages come from.
Regarding:
The Job might still fail during execution, in which case it’s retried with exponential backoff, but at least no updates are lost. While the issue persists, further state change messages will be queued up also as Jobs (with same group value). Once the (transient) issue resolves, and we can again produce messages to Kafka, the updates would go out in logical order for the rest of the system and eventually everyone would be in sync.
This is the part that is equivalent to Kafka-first, except with all the extra steps of a job scheduling, grouping, tracking, and execution framework on top of it.
It also natively stores data as files in cloud storage. Brokers are ephemeral, you don't need to migrate data between them, and you're not constrained by their disk size. Gazette defaults to exactly-once semantics, and has stronger replication guarantees (your R factor is your R factor, period -- no "in sync replicas").
Estuary Flow [2] is building on Gazette as an implementation detail to offer end-to-end integrations with external SaaS & DB's for building real-time dataflows, as a managed service.
[0]: https://github.com/gazette/core [1]: https://gazette.readthedocs.io/en/latest/consumers-concepts.... [2]: https://github.com/estuary/flow
(2) this problem should be avoided in general by just having idempotent services. Just as a hard restriction, forever, build services to be idempotent. It should be the exception to have a non-idempotent service, and it should be carefully understood.
That said, if you have (1) as a consistent issue, like if every message is flaky, kafka isn't the right solution. Postgres-based queues are perfect for this because you can examine the table as a whole, making more informed decisions about what you want to process (or not process).
Maybe I'm misunderstanding the article, but having "Job tasks" both insert another Job to run as well as updating DB state, and then having the executor pick up the previously inserted Job (whos only purpose is to send a kafka message) seems overly complex. I'm having trouble seeing why this is needed.
As for ensuring transactional consistency that isn't so bad, you can use an table to track offset inserts making sure you verify from that before you update consumer offsets (or Pulsar subscription if you go that route).
Here is an example using Typescript SDK:
async function main(userId, intervals){
// Send reminder emails, e.g. after 1, 7, and 30 days
for (const interval of intervals) {
await sleep(interval * DAYS);
// can take hours if the downstream service is down
await activities.sendEmail(interval, userId);
}
// Easily cancelled when user unsubscribes
}
Disclaimer: I'm one of the creators of the project.