I'm currently dealing with a queuing-related issue.
I have a series of tasks running across servers that consume from a queue and run a task.
Often, these tasks die mid-execution (but can be resumed by any other server). So, the queue is a database, and the running tasks "touch" a timestamp in the database if they are still executing. When a database document hasn't been updated for a while, the "consumption query" makes it so that it is 'redelivered' to an available server listening to the "queue".
Of course, this is subpar, but we haven't yet come across an elegant (and not too over-engineered) way to replace this.
It's built into some RDBMS. SQL Server has READPAST [1, 2], so you can do:
BEGIN TRANSACTION;
DELETE TOP (1) QueueTable WITH (READPAST) OUTPUT deleted.* ORDER BY QueueId;
-- Do your work
COMMIT TRANSACTION;
And if your process dies midway through, the transaction is rolled back and the row is immediately visible to another worker.[1] https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-tra... READPAST is primarily used to reduce locking contention when implementing a work queue that uses a SQL Server table. A queue reader that uses READPAST skips past queue entries locked by other transactions to the next available queue entry, without having to wait until the other transactions release their locks.
[2] https://docs.microsoft.com/en-us/sql/t-sql/queries/output-cl...
while True:
message = queue.consume()
process(message)
message.ack()
RabbitMQ will automatically put the message back on the queue when the consumer that pulled it disconnects w/o acknowledging it first. Alternatively, you could explicitly reject messages: while True:
message = queue.consume()
try:
process(message)
except Exception:
message.reject()
raise
If you're using Python you might want to check out Dramatiq[1][1]: https://dramatiq.io
With RabbitMQ, I'd need to ack rightaway, otherwise it would re-send the message again after a while.
An item is queued with a specific visibility timeout (it should take 10 seconds to process so we give it 10 minutes), a job picks up that item and it disappears from the queue for 10 minutes. If the job succeeds, the job explicitly deletes that queue item. If the job fails, the item reappears after 10 minutes for another instance of the job to pick up.
We've been using it since April 2014. There's more information built into the queue item, like the number of times dequeued and original queued timestamp, so we can send trigger alerts if items are getting old.
> The first is to be able to track each message individually (i.e. not using a single commit offset) to make suitable for asynchronous tasks. > The second is the ability to schedule messages to be consumed in the future. This make it suitable for retries.
That's a start - I'd love to see how did you solve the problem? How your solution compares to other similar products? Why should I care about stuff you are mentioning? It is not an issue of not mature product - I think one should start with defining exactly and precisely what is being solved here.
When solving a technical problem you always have to tailor your solution to a specific set of requirements and it is never like: "distributed, horizontally scalable, persistent, time sorted message queue." So please stop using such buzzwords.
The most convincing technologies have clear, simple value that's immediately apparent (like zeromq not needing a central coordinator/exchange), or some real-world use-cases which have been improved by using this technology (actual, already-happened use-cases, not hypothetical: I'm more convinced by words from people who've integrated with the product rather than the authors + their innate bias).
I would have put the title as, 'open-source proof-of-concept messaging queue (Go, single author)'
I have had this too many times, exactly what OP said, even in areas I am strong in... often I have no clue or little clue what to expect without doing additional research.
What do you feel is missing or miscommunicated?
This would be a clear violation of the unwritten open source policy to never mention similar/competing projects!
I'm not saying I wouldn't give this project a closer look, but I would much rather a product make it painfully obvious what compromises were made to offer its availability.
My instinct is: either you need strict constraints on ordering and delivery, in which case you use rabbit, or you need at-least-once-no-matter-what semantics, in which case you use something else and make your app less fragile.
> something else
https://www.confluent.io/blog/exactly-once-semantics-are-pos...
It's really hard to be extremely available and to also guarantee there won't be dupes or out-of-order messages. Like databases, someone saying they offer both is usually hiding a trade-off.
EDIT: I think here "really hard" is a polite way of saying "impossible"
i can sorta see how i would build a message queue that ordered by time, but lost messages would be a problem and so forth - just to get n idea of the trade offs
It's worth mentioning the new Apache Pulsar messaging system which can replace Kafka with pub/sub and queueing semantics while providing better scalability and per-message acks, probably better suited to those who want a combined system.
If you dont need/want that, then its just a single ZK cluster as with Kafka or anything else. ZK + Brokers + Bookies = Pulsar.
That's a bit unfair given that the project didn't mention Kafka in it's README and different products have different suitability at different scale and this could simply be a "if your traffic is low and you need this functionality, this will suffice" thing or just an academic interest in producing an ordered distributed message queue.
But as you're asking: Proven ability to consume ~5+ million messages per second with a similar or less hardware requirement than Kafka and high reliability. Well document set of edge cases / compromises where applicable and a high degree of observability. Well understood operational requirements, and SRE runbooks (or just a lot of Github issues that go into how to handle various scenario). An active community of people to assist, and more than 1 committer.
That's the "off the top of my head" thing. YMMV.
> Well document set of edge cases / compromises where applicable and a high degree of observability. Well understood operational requirements, and SRE runbooks (or just a lot of Github issues that go into how to handle various scenario). An active community of people to assist, and more than 1 committer.
Are you talking about Sandglass or Kafka? Because Sandglass seems 3 months old, has 1 contributor and is featured here as "Show HN"... So it probably isn't as mature solution as Kafka is. Or am I missing something?
I'd love to see comparative stress tests, Jepsen-like, to assess the ability to survive partitions, corruption, node restart and node loss, to better estimate the probability that a job gets lost.
[1]: http://nats.io/documentation/streaming/nats-streaming-intro/
https://github.com/celrenheit/sandglass/blob/dev/storage/roc...
The first is to be able to track each message individually (i.e. not using a single commit offset) to make suitable for asynchronous tasks.
The second is the ability to schedule messages to be consumed in the future. This make it suitable for retries.