Issues we've encountered while building a Kafka based data processing pipeline (opens in new tab)

(sixfold.medium.com)

158 pointspoolik4y ago93 comments

93 comments

68 comments · 12 top-level

anotherhue4y ago· 38 in thread

I ran a few dozen kafka clusters at MegaCorp in a previous life.

My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.

ThinkBeat4y ago

This is exactly how I feel about it.

A while back I was on team building a non-critical, low volume application. It just involves people sending a message basically. (there is more too it).

The consultants said he had to use Kafka because the messages could come in really fast.

I said we should stick with Postgres.

No, they said, we really need Kakfa to be able to handle this.

Then I went and spun up Postgres on my work laptop (nothing special), and got a loaner to act as a client. I simulated about 300% more traffic than we had any chance of getting. It worked fine. (did tax my poor work laptop).

No, we could not risk it, when we use Kafka we are safe.

Took it to management, Kafka won since Buzzword.

Now of course we have to write a process to feed the data into Postgres. After all its what everything else depends on

jghn4y ago

People tend to not realize how big "at scale" problems really are. Instead anything at a scale at the edge of their experience is "at scale" and they reach for the tools they've read one is supposed to use in those situations. It makes sense, people don't know what they don't know.

And thus we have a world where people have business needs that could be powered by my low end laptop but solutions inspired by the megacorps.

2 more replies

sparsely4y ago

There are various Kafka to Postgres adaptors. Of course, now you're running 3 bits of software, with 3 bottlenecks, instead of just 1.

dtech4y ago

Were you advocating an endpoint + DB or different apps directly writing into a shared DB? The latter is not really a good idea for numerous reasons. Kafka a - potentially overkill - replacement for REST or whatever, not for your DB.

i_like_waiting4y ago

But can you stream data from lets say MS SQL directly into Postgres? Easiest way I found is Kafka, I would love some simple python script instead

1 more reply

superyesh4y ago

>My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.

Sorry thats just a clickbait-y statement. I love Postgres, try handling 100-500k rps of data coming in from various sources reading and writing to it. You are going to get bottlenecked on how many connections you can handle, you will end up throwing pgBouncers on top of it.

Eventually you will run out of disk, start throwing more in.

Then end up in VACCUUM hell all while having a single point of failure.

While I agree Kafka has its own issues, it is an amazing tool to a real scale problem.

dwohnitmok4y ago

I think anotherhue would agree that half a million write requests per second counts as a valid answer to "you can't do what you need with a beefy Postgres," but that is also a minority of situations.

1 more reply

fernandotakai4y ago

at least for me, every system has its place.

i love posgresql, but i would not use it to replace a rabbitmq instance -- one is an RDBMS, the other is a queue/event system.

"oh but psql can pretend to be kafka/rabbitmq!" -- sure, but then you need to add tooling to it, create libraries to handle it, and handle all the edge cases.

with rmq/kafka, there already a bunch of tools to handle the exact case of a queue/event system.

1 more reply

doliveira4y ago

Yeah, once you do have to scale a relational database you're in for a world of pain. Band-aid after band-aid... I very much prefer to just start with Kafka already. At the very least you'll have a buffer to help you gain some time when the database struggles.

anotherhue4y ago

Indeed! That sounds absolutely like it requires a real pub/sub. 'Actually needing it' is the exception in my experience though.

sumtechguy4y ago

I have done both patterns.

Kafka has its use case. Databases have theirs. You can make a DB do what kafka does. But you also add in the programming overhead of getting the DB semantics correct to make an event system. When I see people saying 'lets put the DB into kafka' I make the exact same argument. You will spend more time making kafka act like a database and getting the semantics right. Kafka is more of a data/event transportation system. A DB is an at rest data store that lets you manipulate the data. Use them to their strengths or get crushed by weird edge cases.

tengbretson4y ago

I'd argue that having the event system semantics layered on top of a sql database is a big benefit when you have an immature product, since you have an incredibly powerful escape hatch to jump in and fire off a few queries to fix problems. Kafka's visibility for debugging is pretty poor in my experience.

1 more reply

BiteCode_dev4y ago

Well, if you want events, you have lighter alternative. Redis pub/sub, crossbar... Even rabbitMQ is lighter.

If you need queuing, you have libs like celery.

You don't need to go full kafka

3 more replies

lmilcin4y ago

Just because you can do it with Postgres doesn't mean it is the best tool for the job.

Sometimes the restrictions placed on the user are as important. Kafka presents a specific interface to the user that causes users to build their applications in certain way.

While you can replicate almost all functionality of Kafka with Postgres (except for performance, but hardly anybody needs as much of it), we all know what we end up with when we set up Postgres and use it to integrate applications with each other.

If developers had discipline they could of course crate tables with appendable logs of data, marked with a partition, that consumers could process from with basically same guarantees as with Kafka.

But that is not how it works in reality.

mrweasel4y ago

We work with a client who has requested a Kafka cluster. They can’t really say why, or what they need it for, but the now have a very large cluster which doesn’t do much. I know why they want it, same reason why they “need” Kubernetes. So far they use it as a sort of message bus.

It’s not that there’s anything wrong with Kafka, it’s a very good product and extremely robust. Same with Kubernetes, it has it uses and I can’t fault anyone for having it as a consideration.

My problem is when people ignore how capable modern servers are, and when developers don’t see the risks in building these highly complex systems, if something much simpler would solve the same problem, only cheaper and safer.

sparsely4y ago

I've also had much more success with "queues in postgres" than "queues in kafka". I'm sure the use cases exist where kafka works better, but you have to architect it so carefully, and there are so many ways to trip up. Whereas most of your team probably already understands a RDMS and you're probably already running one reliably.

gilbetron4y ago

What velocity have you achieved with postgres, in terms of # messages/sec where messages could range between 1KB-100KB in size?

lima4y ago

Yup. That's what Kafka excels at and where it scales to throughput way beyond Postgres.

But there are many many projects where Kafka is used for low value event sourcing stuff where a SQL DB could be easier.

afandian4y ago

I had a great time with Kafka for prototyping. Being able to push data from a number of places, have mulitple consumers able to connect, go back and forth though time, add and remove independent consumer groups. Ran in pre-production very reliably too, for years.

But for a production-grade version of the system I'm going with SQL and, where needed, IaC-defined SQS.

aqme284y ago

> Show me that you can't do what you need with a beefy Postgres.

I've found this question very useful when pitched any esoteric database.

merarischroeder4y ago

I agree so much I wrote an article about that https://link.medium.com/eQqVhuvRtkb

"You really can replace Kafka with a database."

ceencee4y ago

I see posts like this a lot, and it makes me wonder what the heck you were using Kafka for that Postgres could handle, yet you had dozens of clusters? I question if you actually ever used Kafka or just operated it? Sure anyone can follow the "build a queue on a database pattern" but it falls over at the throughputs that justify Kafka. If you have a bunch of trivial 10tps workloads, of course a distributed system is overkill.

Cederfjard4y ago

They didn’t say that the Kafka clusters they personally ran could have been handled with Postgres instead.

They first gave their credentials by mentioning their experience.

Then they basically said ”given what I know about Kafka, with my experience, I require other people who ask for it to show me that they really need it before I accommodate them - often a beefy Postgres is enough”.

anotherhue4y ago

A major ecommerce site. We had hundreds of thousands of messages/s but for most use cases YAGNI.

alephu54y ago

Why Postgres? Why not redis, rabbitMQ or even Kafka itself?

hardwaresofton4y ago

Not those other tools because you can't achieve Postgres functionality with those other tools generally.

Postgres can pretend to be Redis, RabbitMQ, and Kafka, but redis, RabbitMQ, and Kafka would have a hard time pretending to be Postgres.

Postgres has the best database query language man has invented so far (AFAIK), well reasoned persistence and semantics, and as of recently partitioning features to boot and lots of addons to support different usecases. Despite all this postgres is mostly Boring Technology (tm) and easily available as well as very actively developed in the open, with a enterprise base that does consulting first and usually upstreams improvements after some time (2nd Quadrant, EDB, Citus, TimescaleDB).

The other tools win on simplicity for some (I'd take managing a PostgreSQL cluster over RMQ or Kafka any day), but for other things especially feature wise Postgres (and it's amalgamation of mostly-good-enough to great features) wins IMO.

1 more reply

capableweb4y ago

> My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.

...

You could probably hack any database to perform any task, but why would you? Use the right tool for the right task, not one tool for all tasks.

If the right tool is a relational database, then use Postgres/$OTHER-DATABASE

If the right tool is distributed, partitioned and replicated commit log service, then use Kafka/$OTHER-COMMIT-LOG

Not sure why people get so emotional about the technologies the know the best. Sure you could hack Postgres to be a replicated commit log, but I'm sure it'll be easier to just throw in Kafka instead.

chartpath4y ago

Something I find peculiar about this truth is that Postgres itself is based on the write ahead log structure for sequencing statements, replication, audit logging.

It feels like two different lenses onto the same reality.

silisili4y ago

Can you kinda high level the setup/processes for making Postgres a replacement for Kafka? I've not attempted such a thing before, and wonder about things like expiration/autodeletion, etc. Does it need to be vacuumed often, and is that a problem?

dionian4y ago

Honest question, how do you expire content in Postgres? Every time I start to use it for ephemeral data I start to wonder if I should have used TimescaleDB or if I should be using something else...

LoriP4y ago

You likely already know that TimescaleDB is an extension to PostgreSQL, so you get everything you'd get with PostgreSQL plus the added goodies of TimescaleDB. All that said, you can drop (or detach) data partitions in PostgreSQL (however you decide to partition...) Does that not do the trick for your use case, though? https://www.postgresql.org/docs/10/ddl-partitioning.html

Transparency: I work for Timescale

chartpath4y ago

Maybe something like https://github.com/citusdata/pg_cron or if it's not a rolling time window, just trigger functions that get called whenever some expression in SQL is true.

mvc4y ago

Record an event log and reliably connect it to a variety of 3rd party sinks using off-the-shelf services

bsaul4y ago

using a sql db for push/pop semantic feels like using a hammer to squash a bug.. How would you model queues & partitions with ordering guarantees with pg ?

throwaway815234y ago

With transactions, and stored procedures if that helps ;). Redis also seems well suited to the use cases I've seen for Kafka. Kafka must have capabilities beyond those use cases, and I've sometimes wondered what they are.

1 more reply

anotherhue4y ago

sql has many conveniences for doing so, it wouldn't be much work.

> using a hammer to squash a bug..

Agreed - but Kafka is a much much bigger hammer. SES/Az Queues are also good choices.

SergeAx4y ago

Read 500m messages in less than an hour?

zbentley4y ago

It depends. Are your reads also writes (e.g. simulating a job queue with a "claimed" state update or a "select for update ... skip locked")? If not, 500m should be no sweat, even on crappy hardware with minimal tuning. If so, things get a little trickier, and you may have to introduce other tools or a clustering solution to get that throughput with postgres.

bsaul4y ago· 7 in thread

Very interested to hear how people here overcome the limits of kafka for ordered events delivery in real world, and what those were.

luxurytent4y ago

I feel as if you're using Kafka and expect guaranteed ordering, then you're using the wrong tool. At best you have guaranteed ordering per partition but then you've tied your ordering/keying strategy to the amount of partitions you've enabled ... which may not ideal.

But, that's speaking from my light experience with it. I'm also curious if there's a better way :-)

jgraettinger14y ago

Not for Kafka, but we are building Flow [1] to offer deterministic ordering, even across multiple logical and physical partitions, and regardless of whether you're back-filling over history or processing in real-time.

This ends up being required for one of our architectural goals, which is fully repeatable transformations: You must be able to model a transactional decision as a Flow derivation (like "does account X have funds to transfer Y to Z ?", and if you create a _copy_ of that derivation months later, get the exact same result.

Under the hood (and simplifying a bit) Flow always does a streaming shuffled read to map events from partitions to task shards, and each shard maintains a min-heap to process events in their ~wall-time order.

This also avoids the common "Tyranny of Partitioning", where your upstream partitioning parallelism N also locks you into that same task shard parallelism -- a big problem if tasks manage a lot of state. With a read-time shuffle, you can scale them independently.

[1]: https://github.com/estuary/flow

BFLpL0QNek4y ago

It depends on what the events are, how they are structured.

You get guaranteed ordering at the partition level.

Items are partitioned by key so you also get guaranteed ordering for a key.

If you have guaranteed ordering for a key you can’t get total ordering across all keys but you can get eventual consistency across the keys.

Ultimately if you want ordering you have to design around being eventually consistent.

I don’t read a lot of papers but Leslie Lamports Time, Clocks, and the Ordering of Events in a Distributed System gave me a lot of insight in to the constraints. https://lamport.azurewebsites.net/pubs/time-clocks.pdf

sumtechguy4y ago

For kafka the default is round robin in each partition. A hash key can let you direct the work to particular partitions. Each partition is guaranteed ordering. Also only one consumer in a consumer group can remove an item from a partition at a time. No two consumers in a consumer group will get the same message.

1 more reply

orobinson4y ago

At lower data volumes (<10,000 events per minute) it’s perfectly feasible to just use single partition topics and then ordered event delivery is no problem at all. If a consuming service has processing times that means horizontal scaling is necessary then the topic can be repartitioned into a new topic with multiple partitions and the processing application can handle sorting the outputted data to some SLA.

csours4y ago

put a timestamp in the message. use a conflict free replicated data type

sethammons4y ago

If you need a guaranteed ordering, timestamps and distributed systems are not friends. See logical / vector clocks.

fafle4y ago· 4 in thread

The issue of running a transaction that spans multiple heterogeneous systems is usually solved with a 2 phase commit. The "jobs" abstraction from the article looks similar to the "coordinator" in 2PC. The article does not talk about how they achieve fault tolerance in case the "job" crashes inbetween the two transactions. Postgres supports the XA standard, which might help with this. Kafka does not support it.

jgraettinger14y ago

I can't speak to their solution, but when solving an equivalent problem within Gazette, where you desire a distributed transaction that includes both a) published downstream messages, and b) state mutations in a DB, the solution is to 1) write downstream messages marked as pending a future ACK, and 2) encode the ACK you _intend_ to write into the checkpoint itself.

Commit the checkpoint alongside state mutations in a single store transaction. Only then do you publish ACKs to all of the downstream streams.

Of course, you can fail immediately after commit but before you get around to publishing all of those ACKS. So, on recovery, the first thing a task assignment does is publish (or re-publish) the ACKs encoded in the recovered checkpoint. This will either 1) provide a first notification that a commit occurred, or 2) be an effective no-op because the ACK was already observed, or 3) roll-back pending messages of a partial, failed transaction.

More details: https://gazette.readthedocs.io/en/latest/architecture-exactl...

eternalban4y ago

IIRC ~2 decades ago we were dequeueing from JMS, updating RDBMS, and then enqueuing all under the cover of JTA (Java Transaction API) for atomic ops.

https://docs.oracle.com/en/middleware/fusion-middleware/12.2...

Using a very broad definition of ‘noSQL’ approach that would include solutions like Kafka, the issue becomes clear: A 2PC or ‘distributed transaction manager’ approach ala JTA comes with a performance/scalability cost — arguably a non-issue for most companies who don’t operate at LinkedIn scale (where Kafka was created).

zeckalpha4y ago

And MySQL didn’t yet support transactions!

1 more reply

nitwit0054y ago

The solution I've seen is to write the message you want to send to the DB along with the transaction, and have some separate thread that tries to send the messages to Kafka.

Although, from various code bases I've seen, a lot of people just don't seem to worry about the possibility of data loss.

HelloNurse4y ago· 3 in thread

The only time I used Kafka, it was involuntarily (included for the sake of fashion in some complicated IBM product, where it hid among WebSphere, DB2 and other bigger elephants) and it ran my server out of disk space because due to a bug ridiculously massive temporary files weren't erased. Needless to say, I wasn't impressed: just one more hazard to worry about.

kitd4y ago

due to a bug

Data retention time is Kafka config 101. Are you sure it was a bug?

geodel4y ago

Considering how half-assed Kafka is in general, that it needs all clients code changes when Kafka servers are upgraded. It is very likely that user hit Kafka bug.

3 more replies

EdwardDiego4y ago

What was the bug, out of curiosity?

LgWoodenBadger4y ago· 2 in thread

I didn't quite follow their explanation for why producing to Kafka first didn't/wouldn't work for them (db state potentially being out of sync requiring continuous messaging until fixed).

anentropic4y ago

it's a chicken and egg problem

you can either send a kafka message but potentially not commit the db transaction (i.e. an event is published for which the action did not actually occur) or commit the db transaction and potentially not send the kafka message

it sounds like they implemented something like the Transactional Outbox pattern https://microservices.io/patterns/data/transactional-outbox....

i.e. you use the db transaction to also commit a record of your intent to send a kafka message - you can then move the actual event sending to a separate process and implement at-least-once semantics

This is the job queueing system they described in the article

LgWoodenBadger4y ago

Their solution seems like a "produce to Kafka first" but with extra steps.

Regarding:

When we produce first and the database update fails (because of incorrect state) it means in the worst case we enter a loop of continuously sending out duplicate messages until the issue is resolved

I don't understand where either 1) the incorrect state or 2) the need to continuously send duplicate messages come from.

Regarding:

The Job might still fail during execution, in which case it’s retried with exponential backoff, but at least no updates are lost. While the issue persists, further state change messages will be queued up also as Jobs (with same group value). Once the (transient) issue resolves, and we can again produce messages to Kafka, the updates would go out in logical order for the rest of the system and eventually everyone would be in sync.

This is the part that is equivalent to Kafka-first, except with all the extra steps of a job scheduling, grouping, tracking, and execution framework on top of it.

1 more reply

jgraettinger14y ago· 1 in thread

If you're in the Go ecosystem, Gazette [0] offers transactional integrations [1] with remote DB's for stateful processing pipelines, as well as local stores for embedded in-process state management.

It also natively stores data as files in cloud storage. Brokers are ephemeral, you don't need to migrate data between them, and you're not constrained by their disk size. Gazette defaults to exactly-once semantics, and has stronger replication guarantees (your R factor is your R factor, period -- no "in sync replicas").

Estuary Flow [2] is building on Gazette as an implementation detail to offer end-to-end integrations with external SaaS & DB's for building real-time dataflows, as a managed service.

[0]: https://github.com/gazette/core [1]: https://gazette.readthedocs.io/en/latest/consumers-concepts.... [2]: https://github.com/estuary/flow

ivanr4y ago

Small suggestion: If Gazette is ready for a wider adoption, it may be useful to bump it up to 1.0 as a signal of confidence.

staticassertion4y ago· 1 in thread

(1) seems best solved by having a 'on heavy task, publish to a secondary topic'. This is good if you have flaky messages that need to be retried in the background, without blocking all of your 'good' messages.

(2) this problem should be avoided in general by just having idempotent services. Just as a hard restriction, forever, build services to be idempotent. It should be the exception to have a non-idempotent service, and it should be carefully understood.

That said, if you have (1) as a consistent issue, like if every message is flaky, kafka isn't the right solution. Postgres-based queues are perfect for this because you can examine the table as a whole, making more informed decisions about what you want to process (or not process).

rajin4444y ago

This seems like a much simpler solution. The "heavy task queue" would process tasks that could simply be retried until they're done.

Maybe I'm misunderstanding the article, but having "Job tasks" both insert another Job to run as well as updating DB state, and then having the executor pick up the previously inserted Job (whos only purpose is to send a kafka message) seems overly complex. I'm having trouble seeing why this is needed.

derekperkins4y ago

Solving both these problems is the best hidden feature of Vitess - Messaging. You can ack a message, do data work, and add to a destination queue all in a single transaction. You can do selective requeuing, and infinite parallelization, since it isn't sequential processing. It's easy to introspect, debug, and have metrics for since it's just MySQL. Vitess lets you shard horizontally, so it can handle any QPS, and has at YouTube. It also supports native message priority. All of that, plus your infrastructure is simplified because you don't have to maintain a separate data store from your main RDBMS. Highly recommended. https://vitess.io/docs/reference/features/messaging/

jpgvm4y ago

What you want is called Apache Pulsar. Log when you need it, work queue when you need that instead.

As for ensuring transactional consistency that isn't so bad, you can use an table to track offset inserts making sure you verify from that before you update consumer offsets (or Pulsar subscription if you go that route).

mfateev4y ago

temporal.io provides much higher level abstraction for building asynchronous microservices. It allows one to model async invocations as synchronous blocking calls of any duration (months for example). And the state updates and queueing are transactional out of the box.

Here is an example using Typescript SDK:

   async function main(userId, intervals){ 
     // Send reminder emails, e.g. after 1, 7, and 30 days
   
     for (const interval of intervals) {
       await sleep(interval * DAYS);
       // can take hours if the downstream service is down
       await activities.sendEmail(interval, userId); 
     } 
     // Easily cancelled when user unsubscribes
   }

Disclaimer: I'm one of the creators of the project.

sam0x174y ago

Couldn't the "state" issue be solved simply by enclosing the database save and kafka message send in the same database transaction block and only doing the kafka send if it reaches that part of the code?

at0mic224y ago

Should you not consider every article as a divine insight. Sixfold is nowhere close from being a classy tech comp.

j / k navigate · click thread line to collapse

93 comments

68 comments · 12 top-level

anotherhue4y ago· 38 in thread

I ran a few dozen kafka clusters at MegaCorp in a previous life.

My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.

ThinkBeat4y ago

This is exactly how I feel about it.

A while back I was on team building a non-critical, low volume application. It just involves people sending a message basically. (there is more too it).

The consultants said he had to use Kafka because the messages could come in really fast.

I said we should stick with Postgres.

No, they said, we really need Kakfa to be able to handle this.

No, we could not risk it, when we use Kafka we are safe.

Took it to management, Kafka won since Buzzword.

Now of course we have to write a process to feed the data into Postgres. After all its what everything else depends on

jghn4y ago

And thus we have a world where people have business needs that could be powered by my low end laptop but solutions inspired by the megacorps.

2 more replies

sparsely4y ago

There are various Kafka to Postgres adaptors. Of course, now you're running 3 bits of software, with 3 bottlenecks, instead of just 1.

dtech4y ago

i_like_waiting4y ago

But can you stream data from lets say MS SQL directly into Postgres? Easiest way I found is Kafka, I would love some simple python script instead

1 more reply

superyesh4y ago

>My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.

Eventually you will run out of disk, start throwing more in.

Then end up in VACCUUM hell all while having a single point of failure.

While I agree Kafka has its own issues, it is an amazing tool to a real scale problem.

dwohnitmok4y ago

I think anotherhue would agree that half a million write requests per second counts as a valid answer to "you can't do what you need with a beefy Postgres," but that is also a minority of situations.

1 more reply

fernandotakai4y ago

at least for me, every system has its place.

i love posgresql, but i would not use it to replace a rabbitmq instance -- one is an RDBMS, the other is a queue/event system.

"oh but psql can pretend to be kafka/rabbitmq!" -- sure, but then you need to add tooling to it, create libraries to handle it, and handle all the edge cases.

with rmq/kafka, there already a bunch of tools to handle the exact case of a queue/event system.

1 more reply

doliveira4y ago

anotherhue4y ago

Indeed! That sounds absolutely like it requires a real pub/sub. 'Actually needing it' is the exception in my experience though.

sumtechguy4y ago

I have done both patterns.

tengbretson4y ago

1 more reply

BiteCode_dev4y ago

Well, if you want events, you have lighter alternative. Redis pub/sub, crossbar... Even rabbitMQ is lighter.

If you need queuing, you have libs like celery.

You don't need to go full kafka

3 more replies

lmilcin4y ago

Just because you can do it with Postgres doesn't mean it is the best tool for the job.

Sometimes the restrictions placed on the user are as important. Kafka presents a specific interface to the user that causes users to build their applications in certain way.

If developers had discipline they could of course crate tables with appendable logs of data, marked with a partition, that consumers could process from with basically same guarantees as with Kafka.

But that is not how it works in reality.

mrweasel4y ago

It’s not that there’s anything wrong with Kafka, it’s a very good product and extremely robust. Same with Kubernetes, it has it uses and I can’t fault anyone for having it as a consideration.

sparsely4y ago

gilbetron4y ago

What velocity have you achieved with postgres, in terms of # messages/sec where messages could range between 1KB-100KB in size?

lima4y ago

Yup. That's what Kafka excels at and where it scales to throughput way beyond Postgres.

But there are many many projects where Kafka is used for low value event sourcing stuff where a SQL DB could be easier.

afandian4y ago

But for a production-grade version of the system I'm going with SQL and, where needed, IaC-defined SQS.

aqme284y ago

> Show me that you can't do what you need with a beefy Postgres.

I've found this question very useful when pitched any esoteric database.

merarischroeder4y ago

I agree so much I wrote an article about that https://link.medium.com/eQqVhuvRtkb

"You really can replace Kafka with a database."

ceencee4y ago

Cederfjard4y ago

They didn’t say that the Kafka clusters they personally ran could have been handled with Postgres instead.

They first gave their credentials by mentioning their experience.

anotherhue4y ago

A major ecommerce site. We had hundreds of thousands of messages/s but for most use cases YAGNI.

alephu54y ago

Why Postgres? Why not redis, rabbitMQ or even Kafka itself?

hardwaresofton4y ago

Not those other tools because you can't achieve Postgres functionality with those other tools generally.

Postgres can pretend to be Redis, RabbitMQ, and Kafka, but redis, RabbitMQ, and Kafka would have a hard time pretending to be Postgres.

1 more reply

capableweb4y ago

> My answer to anyone who asks for kafka: Show me that you can't do what you need with a beefy Postgres.

...

You could probably hack any database to perform any task, but why would you? Use the right tool for the right task, not one tool for all tasks.

If the right tool is a relational database, then use Postgres/$OTHER-DATABASE

If the right tool is distributed, partitioned and replicated commit log service, then use Kafka/$OTHER-COMMIT-LOG

Not sure why people get so emotional about the technologies the know the best. Sure you could hack Postgres to be a replicated commit log, but I'm sure it'll be easier to just throw in Kafka instead.

chartpath4y ago

Something I find peculiar about this truth is that Postgres itself is based on the write ahead log structure for sequencing statements, replication, audit logging.

It feels like two different lenses onto the same reality.

silisili4y ago

dionian4y ago

Honest question, how do you expire content in Postgres? Every time I start to use it for ephemeral data I start to wonder if I should have used TimescaleDB or if I should be using something else...

LoriP4y ago

Transparency: I work for Timescale

chartpath4y ago

Maybe something like https://github.com/citusdata/pg_cron or if it's not a rolling time window, just trigger functions that get called whenever some expression in SQL is true.

mvc4y ago

Record an event log and reliably connect it to a variety of 3rd party sinks using off-the-shelf services

bsaul4y ago

using a sql db for push/pop semantic feels like using a hammer to squash a bug.. How would you model queues & partitions with ordering guarantees with pg ?

throwaway815234y ago

1 more reply

anotherhue4y ago

sql has many conveniences for doing so, it wouldn't be much work.

> using a hammer to squash a bug..

Agreed - but Kafka is a much much bigger hammer. SES/Az Queues are also good choices.

SergeAx4y ago

Read 500m messages in less than an hour?

zbentley4y ago

bsaul4y ago· 7 in thread

Very interested to hear how people here overcome the limits of kafka for ordered events delivery in real world, and what those were.

luxurytent4y ago

But, that's speaking from my light experience with it. I'm also curious if there's a better way :-)

jgraettinger14y ago

[1]: https://github.com/estuary/flow

BFLpL0QNek4y ago

It depends on what the events are, how they are structured.

You get guaranteed ordering at the partition level.

Items are partitioned by key so you also get guaranteed ordering for a key.

If you have guaranteed ordering for a key you can’t get total ordering across all keys but you can get eventual consistency across the keys.

Ultimately if you want ordering you have to design around being eventually consistent.

sumtechguy4y ago

1 more reply

orobinson4y ago

csours4y ago

put a timestamp in the message. use a conflict free replicated data type

sethammons4y ago

If you need a guaranteed ordering, timestamps and distributed systems are not friends. See logical / vector clocks.

fafle4y ago· 4 in thread

jgraettinger14y ago

Commit the checkpoint alongside state mutations in a single store transaction. Only then do you publish ACKs to all of the downstream streams.

More details: https://gazette.readthedocs.io/en/latest/architecture-exactl...

eternalban4y ago

IIRC ~2 decades ago we were dequeueing from JMS, updating RDBMS, and then enqueuing all under the cover of JTA (Java Transaction API) for atomic ops.

https://docs.oracle.com/en/middleware/fusion-middleware/12.2...

zeckalpha4y ago

And MySQL didn’t yet support transactions!

1 more reply

nitwit0054y ago

The solution I've seen is to write the message you want to send to the DB along with the transaction, and have some separate thread that tries to send the messages to Kafka.

Although, from various code bases I've seen, a lot of people just don't seem to worry about the possibility of data loss.

HelloNurse4y ago· 3 in thread

kitd4y ago

due to a bug

Data retention time is Kafka config 101. Are you sure it was a bug?

geodel4y ago

Considering how half-assed Kafka is in general, that it needs all clients code changes when Kafka servers are upgraded. It is very likely that user hit Kafka bug.

3 more replies

EdwardDiego4y ago

What was the bug, out of curiosity?

LgWoodenBadger4y ago· 2 in thread

I didn't quite follow their explanation for why producing to Kafka first didn't/wouldn't work for them (db state potentially being out of sync requiring continuous messaging until fixed).

anentropic4y ago

it's a chicken and egg problem

it sounds like they implemented something like the Transactional Outbox pattern https://microservices.io/patterns/data/transactional-outbox....

i.e. you use the db transaction to also commit a record of your intent to send a kafka message - you can then move the actual event sending to a separate process and implement at-least-once semantics

This is the job queueing system they described in the article

LgWoodenBadger4y ago

Their solution seems like a "produce to Kafka first" but with extra steps.

Regarding:

I don't understand where either 1) the incorrect state or 2) the need to continuously send duplicate messages come from.

Regarding:

This is the part that is equivalent to Kafka-first, except with all the extra steps of a job scheduling, grouping, tracking, and execution framework on top of it.

1 more reply

jgraettinger14y ago· 1 in thread

If you're in the Go ecosystem, Gazette [0] offers transactional integrations [1] with remote DB's for stateful processing pipelines, as well as local stores for embedded in-process state management.

Estuary Flow [2] is building on Gazette as an implementation detail to offer end-to-end integrations with external SaaS & DB's for building real-time dataflows, as a managed service.

[0]: https://github.com/gazette/core [1]: https://gazette.readthedocs.io/en/latest/consumers-concepts.... [2]: https://github.com/estuary/flow

ivanr4y ago

Small suggestion: If Gazette is ready for a wider adoption, it may be useful to bump it up to 1.0 as a signal of confidence.

staticassertion4y ago· 1 in thread

rajin4444y ago

This seems like a much simpler solution. The "heavy task queue" would process tasks that could simply be retried until they're done.

derekperkins4y ago

jpgvm4y ago

What you want is called Apache Pulsar. Log when you need it, work queue when you need that instead.

mfateev4y ago

Here is an example using Typescript SDK:

   async function main(userId, intervals){ 
     // Send reminder emails, e.g. after 1, 7, and 30 days
   
     for (const interval of intervals) {
       await sleep(interval * DAYS);
       // can take hours if the downstream service is down
       await activities.sendEmail(interval, userId); 
     } 
     // Easily cancelled when user unsubscribes
   }

Disclaimer: I'm one of the creators of the project.

sam0x174y ago

at0mic224y ago

Should you not consider every article as a divine insight. Sixfold is nowhere close from being a classy tech comp.

j / k navigate · click thread line to collapse