Postgres: A better message queue than Kafka? (opens in new tab)

(dagster.io)

65 pointstylermcginnis3y ago45 comments

45 comments

39 comments · 14 top-level

andrewstuart3y ago· 8 in thread

Not just Postgres.

You can do exactly this with MySQL and SQL server too because they both support SKIP LOCKED.

Interestingly, the plain old file system on Linux also makes the basis of a perfectly acceptable message queue for many use cases - the thing that makes it work is that the file move operation is atomic. Atomic moves are what make queuing systems possible.

You could write a file system based message queue in 100 lines of async python, which I did here:

https://github.com/bootrino/arniesmtpbufferserver

File system based message queues can be written in any language, extremely simple and, most importantly - zero configuration. One of the most frustrating things about queuing systems is configuration - that includes database backed queuing systems. They can also be fast - I wrote one in Rust which maxed out the hard disk's random write capability well before maxing out the CPU - from memory it beat most of the common queuing systems in terms of messages per second.

Not all use cases for queues need to be able to globally distributed messages queues with the sort of guarantees needed for financial transaction processing. I would suggest to you that in fact most queues out there are used as outbound SMTP queues, which are then over engineered to use something like Celery, which is a nightmare to configure and debug.

smeagull3y ago

We found difficulty with the purely advisory locking that Linux had. Also the possibility of network filesystems made it a pain.

Do you have any experience with either?

andrewstuart3y ago

You don't lock, you move the file that it next to be processed. File moves are atomic. You move the file out of the list of files that are being picked up for processing.

Lock free.

andrewstuart3y ago

Network file systems do not support atomic moves, but you should not run such an application on a network file system.

remram3y ago

I also found that advisory locking has a lot of gotchas, especially when used in multithread contexts (apparently you can lose the lock because a different thread closed a different file descriptor on the same file).

hbrn3y ago

What makes it atomic is running publishers and consumers on the same box (since you're sharing filesystem between those).

Also listdir is a big bottleneck here:

    while True:
        # get files in outbox
        files_in_outbox = [f'{PREFIX}/outbox/{x}' for x in os.listdir(f'{PREFIX}/outbox')]

andrewstuart3y ago

>> What makes it atomic is running publishers and consumers on the same box (since you're sharing filesystem between those).

It's the move/rename that is atomic.

https://man7.org/linux/man-pages/man2/rename.2.html

1 more reply

whalesalad3y ago

love to see this kind of hackery

motogpjimbo3y ago

I love to see effective yet brutally-simple "redneck engineering" solutions to software problems, particularly ones that straddle the line between genius and stupidity in a way that makes architecture astronauts feel uncomfortable.

I used to work at an brokerage that worked with a panel of around 12 providers, all of whom offered 5+ products that were updated multiple times per year to meet changing regulatory requirements. When a sales adviser recommended product X from provider Y to a customer, the adviser would then need to fill in an application form that was precisely tailored to that particular revision of that particular product. Bear in mind that these were complex, multi-sectioned forms. Needless to say, this created a huge workload for the devteam to keep track of all the product changes and update the UI accordingly each time.

At some point, someone on the devteam had the genius idea to simply take the PDF application forms from the provider, extract the individual pages as PNGs and overlay HTML form elements on top of them. The provider would essentially be doing our UI design for us. Add in an editor tool so the sales managers could set up the forms themselves and a tagging system so specific fields could be pre-filled from customer data we already had stored in the DB and the devteam's workload dropped by maybe 90%. Simple, stupid perhaps, but effective.

bsaul3y ago· 4 in thread

"The write path. We built a daemon that would select log entries that were older than two weeks, copy them into a file in S3, and delete the rows from the database"

Seriously... How can you ever consider saying "a rdbms is just fine as a kafka alternative" under those conditions ?

xupybd3y ago

How does Kafka solve this problem any better?

Your messages expire so if you want to archive you need a consumer to write them to disk. That sounds very similar.

peterhunt3y ago

Why not?

andrewstuart3y ago

Ideology.

There's alot of people who are ideologically opposed to database backed message queues. They're usually reluctant to give detailed explanations why, because it's an emotional thing.

bsaul3y ago

Because that would be equivalent to saying « rdbms are fine as a kafka alternative provided you build a compensation mechanism for all the things it does worse than kafka». It doesn’t make any sense.

mannyv3y ago· 4 in thread

You can spew messages into sqs then have a lambda on sqs sending to pg.

Funny, we were going to use kafka too, but just sending to mysql worked just fine <shrug>. One day that will change, obviously.

eurasiantiger3y ago

It works fine on the happy path, but if mysql goes down, you lose things. Having a distributed, resilient queue has availability benefits.

pritambaral3y ago

> It works fine on the happy path, but if mysql goes down, you lose things.

Why do you have to lose things if the DB goes down? Agreed, untuned & unconfigured MySQL (and MongoDB) out-of-the-box can lose things due to bugs and design issues, but that is the case even when they are running. However, DBs, in general are made precisely for the purpose of safely storing things and not losing them.

OTOH, the number of Kafka setups I have seen that'd lose things when something goes down ... maybe this is not a guaranteed win for the Kafka side of arguments.

> Having a distributed, resilient queue has availability benefits.

High availability is not a function exclusive to Kafka. On the other hand, there's some functions that may come in handy to use in a queue that Kafka simply cannot provide, but DBs can. Off the top of my head: ACID, instant scalability (both up and down) of consumer groups, and the sheer flexibility (and power) that comes with a DB in general.

----

Overall, there's some merits to using a distributed log as a message queue, sure, but there are also merits to using a DB for that.

1 more reply

manv13y ago

That's why we have an RDS cluster. We're not multi-region because, frankly, if the region barfs we can blame it on Amazon. And in any case we'll still deliver just fine as long as cloudfront keeps going.

manv13y ago

Also, we use SQS as an intermediary between our logging and mysql, so even if mysql goes down SQS will keep things for...a while.

hbrn3y ago· 3 in thread

A controversial, but a very pragmatic take.

Queues are great for semi-infinite scalability, but you rarely need it.

There's numerous subtle benefits to using db compared to regular message queues that are often overlooked.

Being able to delete, reorder, or edit specific messages can be a lifesaver when things go wrong.

Most common issue with queue-based systems is getting overwhelmed with messages. Either your consumers went down. Or your producers had a bug / executed too frequently.

Being able to recover with a simple SQL query is a blessing.

hbrn3y ago

Here's just a few examples from my experience:

1. Huge number of messages from test system were accidentally inserted into production.

Queue solution: disable consumers, move messages to a temporary queue while filtering them, move messages back to the old queue, enable consumers

DB solution: just delete rogue messages

2. We want to store some of the messages, but we're not ready to process them yet

Queue solution: create a separate queue for each message type, insert messages to different queues, manually move them when you're ready to process them (and keep in mind that not every queue can persist messages forever, SQS for example can't hold messages longer than 14 days)

DB solution: just skip those messages while processing

3. Consumers went down and the queue now contains a big number of duplicate messages. While it was fine to just wait a couple hours to let it stabilize, a whale customer started complaining

Queue solution: none (any hacky solution would take longer than it takes for the system to naturally stabilize)

DB solution: move whale customer messages to the front of the queue

EdwardDiego3y ago

1. Kafka isn't a message queue, but advance consumer offsets past bad data. Hit person who inserted test data into prod with a cricket bat and then hit person who designed a system where they could do that with the same cricket bat, but harder.

2. Kafka - do what you did with your DB consumers.

3. Kafka - consumers going down can't cause duplicates, what's even going on with your queue?

1 more reply

whalesalad3y ago

Throughput is often where an RDBMS falls over here. Unless you have partitions/isolation around your high throughput workloads (or throw tons of horsepower/$$$ at it) you can’t achieve the same ops/sec as a dedicated queueing mechanism.

But you do have a lot of valid points, particularly about failure recovery. My approach tends to involve a hybrid of the two if high speed is a necessary component somewhere. The system can restore its runtime state from a db, but the active mechanics of it are running on message passing of some kind.

lightbendover3y ago· 3 in thread

File this under "If a headline is asking a question, then the answer is NO." Honestly, I'm not even sure what point the author is trying to make besides "anecdotally and at low scale, unusual-for-the-purpose technology X solved problem Y." A near-infinite number of bad patterns can solve problems along happy paths and resolve plenty of edge cases to boot. 99.9% availability was a goal post here? There are systems where 7 9s is unacceptable. The author didn't even provide data backing the measurement goals.

xupybd3y ago

This solves their problems not some theoretical other problem.

They avoided over engineering and building more complexity than required.

hbrn3y ago

What are those mythical systems where seven nines is unacceptable? What are the chances you're going to work on one of those?

Google Spanner for example is up to five nines. Are there a lot of systems that need to be three orders of magnitude more reliable than Google Ads?

lightbendover3y ago

You're right, nothing is more important than serving ads.

1 more reply

srajabi3y ago· 2 in thread

I don’t think this article made a compelling reason not to use Kafka. In fact it may have made the opposite point.

Wouldn’t it have been easier to just use Kafka?

greymalik3y ago

From TFA Postgres was an established tool in their org they already had expertise with. The author argues that adopting a new tool would have been risky and costly.

manv13y ago

Learning how to deploy something at scale costs time.

EdwardDiego3y ago· 1 in thread

> Postgres: a better message queue than Kafka?

"We didn't actually evaluate Kafka for this use case, because we know Postgres, so we used that."

Weird article angle.

0x6c6f6c3y ago

I suppose they're asking the question too here

chrsig3y ago

Something I think needs to be stated: Kafka is not a message queue. If you need things like work stealing, you're going to have a bad time. It doesn't seem to apply in this case, but the terminology misuse is going to lead some poor developer astray.

It seems like they only tried postgres? so they didn't compare it with kafka at all....so yeah, you can make a message queue or log table in a rdbms.

benadam113y ago

"Many haters want you to buy their message queue product" lol

hk13373y ago

I don’t want to doubt Postgres’ effectiveness as a message queue (however difficult it may be) but I have to wonder if maybe they were doing something weird or wrong in the consumer for it to be ineffective?

Maybe it’s not that Postgres is a better than Kafka as much as Postgres is a better solution for what they were trying to do and use Kafka?

amai3y ago

“ Today these are not major problems for us, and frankly, I’m not sure how we’ll tackle these problems. Maybe we’ll grow multi-region support and buy really big Postgres boxes. Maybe we’ll move to sharded Postgres. Maybe we’ll switch to Kafka or some other distributed queue. ”

LOL

ivankelly3y ago

Since they're already doing offload to S3 for older log messages, it seems like they could do this from the get-go and save a bunch of storage and interzone transfer costs in RDS. Depends on the log message sizes though of course.

kennethh3y ago

There is also commercial products like NServiceBus that can use SQL Server as a connector to provide storage. I always used to think about doing this in the database as poor mans message solution. Perfect for small projects.

steeve3y ago

Honestly that's comparing apples and oranges. Also, that title is clickbait. It feels like a PR article disguised as a tech article.

j / k navigate · click thread line to collapse

45 comments

39 comments · 14 top-level

andrewstuart3y ago· 8 in thread

Not just Postgres.

You can do exactly this with MySQL and SQL server too because they both support SKIP LOCKED.

You could write a file system based message queue in 100 lines of async python, which I did here:

https://github.com/bootrino/arniesmtpbufferserver

smeagull3y ago

We found difficulty with the purely advisory locking that Linux had. Also the possibility of network filesystems made it a pain.

Do you have any experience with either?

andrewstuart3y ago

You don't lock, you move the file that it next to be processed. File moves are atomic. You move the file out of the list of files that are being picked up for processing.

Lock free.

andrewstuart3y ago

Network file systems do not support atomic moves, but you should not run such an application on a network file system.

remram3y ago

hbrn3y ago

What makes it atomic is running publishers and consumers on the same box (since you're sharing filesystem between those).

Also listdir is a big bottleneck here:

    while True:
        # get files in outbox
        files_in_outbox = [f'{PREFIX}/outbox/{x}' for x in os.listdir(f'{PREFIX}/outbox')]

andrewstuart3y ago

>> What makes it atomic is running publishers and consumers on the same box (since you're sharing filesystem between those).

It's the move/rename that is atomic.

https://man7.org/linux/man-pages/man2/rename.2.html

1 more reply

whalesalad3y ago

love to see this kind of hackery

motogpjimbo3y ago

bsaul3y ago· 4 in thread

"The write path. We built a daemon that would select log entries that were older than two weeks, copy them into a file in S3, and delete the rows from the database"

Seriously... How can you ever consider saying "a rdbms is just fine as a kafka alternative" under those conditions ?

xupybd3y ago

How does Kafka solve this problem any better?

Your messages expire so if you want to archive you need a consumer to write them to disk. That sounds very similar.

peterhunt3y ago

Why not?

andrewstuart3y ago

Ideology.

There's alot of people who are ideologically opposed to database backed message queues. They're usually reluctant to give detailed explanations why, because it's an emotional thing.

bsaul3y ago

mannyv3y ago· 4 in thread

You can spew messages into sqs then have a lambda on sqs sending to pg.

Funny, we were going to use kafka too, but just sending to mysql worked just fine <shrug>. One day that will change, obviously.

eurasiantiger3y ago

It works fine on the happy path, but if mysql goes down, you lose things. Having a distributed, resilient queue has availability benefits.

pritambaral3y ago

> It works fine on the happy path, but if mysql goes down, you lose things.

OTOH, the number of Kafka setups I have seen that'd lose things when something goes down ... maybe this is not a guaranteed win for the Kafka side of arguments.

> Having a distributed, resilient queue has availability benefits.

----

Overall, there's some merits to using a distributed log as a message queue, sure, but there are also merits to using a DB for that.

1 more reply

manv13y ago

Also, we use SQS as an intermediary between our logging and mysql, so even if mysql goes down SQS will keep things for...a while.

hbrn3y ago· 3 in thread

A controversial, but a very pragmatic take.

Queues are great for semi-infinite scalability, but you rarely need it.

There's numerous subtle benefits to using db compared to regular message queues that are often overlooked.

Being able to delete, reorder, or edit specific messages can be a lifesaver when things go wrong.

Most common issue with queue-based systems is getting overwhelmed with messages. Either your consumers went down. Or your producers had a bug / executed too frequently.

Being able to recover with a simple SQL query is a blessing.

hbrn3y ago

Here's just a few examples from my experience:

1. Huge number of messages from test system were accidentally inserted into production.

Queue solution: disable consumers, move messages to a temporary queue while filtering them, move messages back to the old queue, enable consumers

DB solution: just delete rogue messages

2. We want to store some of the messages, but we're not ready to process them yet

DB solution: just skip those messages while processing

3. Consumers went down and the queue now contains a big number of duplicate messages. While it was fine to just wait a couple hours to let it stabilize, a whale customer started complaining

Queue solution: none (any hacky solution would take longer than it takes for the system to naturally stabilize)

DB solution: move whale customer messages to the front of the queue

EdwardDiego3y ago

2. Kafka - do what you did with your DB consumers.

3. Kafka - consumers going down can't cause duplicates, what's even going on with your queue?

1 more reply

whalesalad3y ago

lightbendover3y ago· 3 in thread

xupybd3y ago

This solves their problems not some theoretical other problem.

They avoided over engineering and building more complexity than required.

hbrn3y ago

What are those mythical systems where seven nines is unacceptable? What are the chances you're going to work on one of those?

Google Spanner for example is up to five nines. Are there a lot of systems that need to be three orders of magnitude more reliable than Google Ads?

lightbendover3y ago

You're right, nothing is more important than serving ads.

1 more reply

srajabi3y ago· 2 in thread

I don’t think this article made a compelling reason not to use Kafka. In fact it may have made the opposite point.

Wouldn’t it have been easier to just use Kafka?

greymalik3y ago

From TFA Postgres was an established tool in their org they already had expertise with. The author argues that adopting a new tool would have been risky and costly.

manv13y ago

Learning how to deploy something at scale costs time.

EdwardDiego3y ago· 1 in thread

> Postgres: a better message queue than Kafka?

"We didn't actually evaluate Kafka for this use case, because we know Postgres, so we used that."

Weird article angle.

0x6c6f6c3y ago

I suppose they're asking the question too here

chrsig3y ago

It seems like they only tried postgres? so they didn't compare it with kafka at all....so yeah, you can make a message queue or log table in a rdbms.

benadam113y ago

"Many haters want you to buy their message queue product" lol

hk13373y ago

Maybe it’s not that Postgres is a better than Kafka as much as Postgres is a better solution for what they were trying to do and use Kafka?

amai3y ago

LOL

ivankelly3y ago

kennethh3y ago

steeve3y ago

Honestly that's comparing apples and oranges. Also, that title is clickbait. It feels like a PR article disguised as a tech article.

j / k navigate · click thread line to collapse