It's annoying but also understandable how often people just dump stuff into an unbounded queue and punt on making sure things work until the system is falling down. Often queues are more of a tool developers and operators can use to navigate unknown performance characteristics and scale later than they are a requirement for the actual system itself.
I work on a system that can process a million financial transactions a second, and this principle of "always use bounded queues" has definitely made an incredible impact on the design.
We use bounded queues everywhere — and everything is also static allocation: memory, network resources, disk too, even all the messages that might possibly be needed to execute different variants of the distributed consensus protocol. It's really nice coding on it, because all the limits are clearly defined.
It forces a little more upfront thinking. This thinking can be a little difficult for a day or two, but the net result is a better design.
I love Little's law.
Isn’t a retry just extending the queue to the caller?
So, yes. But that's the point.
It's annoying if it is done by the infrastructure team (I mean, they should know the details of the queue they are managing). It's understandable if it is done by product developers (they are more into "inheritance vs composition" kind of things).
Well, they also serve to hold your state while you add processing capacity.
If you look at a market queues, when they start to grow, management deviates people from non time sensitive tasks into cashiers. The queues make keeps things working during the change.
Equivalent situations happen on software too.
503 explicitly mentions server overload.
Solution: "Step 1. Identify the bottleneck. Step 2: ask the bottleneck for permission to pile more data in"
If your system doesn't have a way to shed load, it will eventually overload.
The problem for many is that when it does finally shed load, the load shedder gets blamed (and turned off). See how many people look for ways to turn off the OOMKiller and how few look to figure out how to get more RAM.
> All of a sudden, the buffers, queues, whatever, can't deal with it anymore. You're in a critical state where you can see smoke rising from your servers, or if in the cloud, things are as bad as usual, but more!
The only way we could stay sane was aggressively following Ferd's rule – nobody got to surface an API unless it surfaced HTTP 429s on overload[1], and if you called a downstream API without checking for and responding to a 429, tough noogies, your requests got dropped and you had to figure out what to do to fix it.
Funny enough, at the time I did software for a manufacturing facility that did high-throughput manufacturing for many varying products. I was lucky enough to do a "gemba walk" where they made us engineers put on hard-hats and eye protection and reminded us who actually made the company money, and by god those guys were doing exactly what we did: they had JIT delivery of raw materials / sub-components to assembly lines, and the delivery of resources to the line was always constrained by a fixed buffer, quite literally "how many units you could put on a shelf". You couldn't just toss more product at a line that wasn't ready for it, it would literally not fit.
Sure, they used different terms than we did – muda and muri and value-stream mapping – but the outcome was very similar. Their version of the NOC was a team of managers who watched the cycle time of each line, how long it took material to be transported, which areas were bottlenecking and which were at risk of running into resource starvation, and they had the ability to balance the hot spots by bursting product in, caching upstream components into hold docks, releasing held work product to downstream lines, or even turning on a flex line to work around longer-term constraints. They rarely had fixed, static bottlenecks, because they optimized the life out of them. Their bottlenecks arose when a shipment didn't arrive (and you better believe those were tracked just as aggressively), a machine broke down, an operator got sick, things like that.
But at the end of the day? Same principle: downstream components would only accept as much as they could process from upstream components; you asked before you pushed.
[1]: and if you said "I swear, you can't overload this service, we don't need to send 429s" a swarm of senior engineers descended upon you and only left until you convinced them they were wrong, or – way more likely – they convinced you.
Put another way, in the second "No good" diagram showing one task being worked on with none in the queue, another task could arrive before the current one is finished.
I suspect the counterargument might be that the task arrival rate is not usually that predictable, but even so, I expect that the amount of variance predicts how true this theory is in practice.
> What’s to prevent the system from bouncing between “1 task in processor & 1 task in queue” and “1 task in processor & 2 tasks in queue” while maintaining 100% utilization?
> Nothing! That could totally happen in a queueing system. However, the arrival process would need to be tuned quite precisely to the processing rate. You would need to watch the status of the processor and, when a task finishes, only then insert a new task into the queue. But this implies a shell game: if you always have a task ready to put into the queue when it needs to be padded back out, then isn’t that task in a sense already “queued”?
> Instead, in queueing theory we usually assume a random arrival process: sometimes a minute may pass between arrivals into the queue; sometimes only a second. So the system can’t bounce for arbitrarily long between 1-in-queue and 2-in-queue states. Eventually, one of two things will randomly occur:
> 1. From the 1-in-queue state, the active task finishes processing before a new task arrives, bringing queue size to 0.
> 2. From the 2-in-queue state, a new task arrives in the queue before the active task finishes, causing the queue size to grow to 3.
> When a system has steady-state 100% utilization, its queue will grow to infinity. (Emph mine.)
This is clearly false, there are clearly counterexamples where the queue does not grow that way. You don’t have control over the arrival rate etc but it could just be your lucky day. It would be more accurate to say you can’t guarantee that either the queue length or wait time are finite. Author didn’t include the words “assume the worst case” or anything like that. I feel like that’s a pretty crucial piece of framing that’s missing.
The uncertainty exists in two places though... It's not just the events coming in but also the time to process each queued event, is another source of uncertainty in the same system that can have the same effect.
The author mentions “M/M/1,” this quantifies that there is only one queue-handling process, that events come in with the sort of randomness that raindrops have (no periodicity), and that handling an event does not have any “timeouts” nor any “long tails” where once a handler has been occupied for 3 or 4 T (where T is the average time to resolve) then the prediction for how much time is remaining for them either drops to 0 (timeout) or escalates dramatically (long tail)... The exponential distribution is used, kind of to be nicely in between these, it is a model where over a time dt the handler is asked to pick a random number with success rate p=dt/T, if they pick the right random number then the event is handled, otherwise a new random number is generated, repeat ad nauseam.
The basic lessons are surprisingly transferable though. So for example traffic on the highway does not have this raindrop sort of randomness because cars kind of repel each other, they space out much more uniformly than you would have predicted on short term scales... The handling time is also nothing like this model of random guessing. Nevertheless, traffic jams still form at high utilization! And they form for a really interesting reason which is that the n handlers—lanes, we call them—have correlations between them. All these queuing models optimistically assume that a handler never stops another handler, “hey have you ever seen anything like this? where can I find that password?” etc.
But in traffic, a car can only lane change from one lane to another if both lanes synchronize speed. With a fixed ability to accelerate, this depends the amount of space between you and the next car, and the amount of space in the lane you want to switch to: as utilization increases both of these spaces drop and synchronization becomes a massive source of inefficiency.
Same happens at a company too, it is possible that an employee might spend more time waiting for meetings, and finding convenient times to schedule them, in order to synchronize two streams of work, than the work itself takes. Can be very frustrating!
Assuming the input and output are just a little bit random, and are only equal on average. You will have periods during which the input and output balances out. You will have periods during which the output is faster, and the queue will shrink to 0 (but no lower). And you will have periods where the output is slower, and the queue grows, with no bound.
> an M/M/1 queue represents the queue length in a system having a single server, where arrivals are determined by a Poisson process...
So the arrival times are not arriving at any fixed rate, but according to a Poisson distribution. In the article he did reference this fact right before the first plot:
> (For the wonks: I used a Poisson arrival process and exponentially distributed processing times)
And then finally, he answered this question more directly in the comments to the article:
> What’s to prevent the system from bouncing between “1 task in processor & 1 task in queue” and “1 task in processor & 2 tasks in queue” while maintaining 100% utilization?
> Nothing! That could totally happen in a queueing system. However, the arrival process would need to be tuned quite precisely to the processing rate. You would need to watch the status of the processor and, when a task finishes, only then insert a new task into the queue. But this implies a shell game: if you always have a task ready to put into the queue when it needs to be padded back out, then isn’t that task in a sense already “queued”?
> Instead, in queueing theory we usually assume a random arrival process: sometimes a minute may pass between arrivals into the queue; sometimes only a second. So the system can’t bounce for arbitrarily long between 1-in-queue and 2-in-queue states. Eventually, one of two things will randomly occur:
> 1. From the 1-in-queue state, the active task finishes processing before a new task arrives, bringing queue size to 0.
> 2. From the 2-in-queue state, a new task arrives in the queue before the active task finishes, causing the queue size to grow to 3.
What all of this actually boils down to, though, is that the odds of your next event being completing a job or getting a new one are exactly even.
...and karaoke singer rotations. In 6 years of frequently singing karaoke at bars, I've never known a host to turn away new singers, unless it's almost time to end the show. So you could say the queue is unbounded, with predictable results for the few singers who show up early and then get frustrated with the ever-increasing time until their next turn as more singers arrive. I don't know what the best solution is.
Now that may actually be suboptimal for the business (if they can only let 10 people in they'd rather let the 10 who will spend the most, say) which is why things like reservations, etc come into play. I wonder if restaurants that do both reservations and a queue see one group pay more on average ...
Yes. That's called the "rejection rate".
Unless, some of the time, queue length is zero, you will have a nonzero rejection rate. This is worth bringing up with store managers who want to run understaffed checkouts. One of the things retail consultants do is point out how sales are being lost that way, both in customers who leave and customers who never come back.
Much of my early work on network congestion was based on that. In the early days of networking, everyone was thinking Poisson arrivals, where arrivals are unaffected by queue lengths. This is partly because the original analysis for the ARPANET, by Leonard Klienrock, was done that way. It was done that way because his PhD thesis was based on analyzing Western Union Plan 55-A, which handled telegrams. (Think of Plan 55-A as a network of Sendmail servers, but with queues made from paper tape punches feeding paper tape readers. The queue was a bin between punch and reader.)[1], at 7:00. Queue length was invisible to people sending telegrams, so senders did behave like Poisson arrivals. That's still true of email today.
The IP layer is open loop with rejection. Transport protocols such as TCP are closed loop systems. The two have to be considered together. Everybody gets this now, but it was a radical idea in 1985.[2]
There's also putting restrictions on the input. Have a maximum song length.
Otherwise, require additional constraints before queueing like making it computationally expensive or having a priority system based on criteria. Charge a cover (with increasing costs) or have a membership club that gives more/better access.
Also, for a long time, sending messages to a process with a long queue might be deprioritized (suspended), as a form of backpressure, although that got removed for local processes because it wasn't effective with SMP, if I'm remembering properly.
This argument is incorrect because it proves too much. If tasks arrive predictably, say at the rate of one per minute, and the processor completes tasks at the same rate, you can have maximum throughput and a finite queue -- indeed, the queue never has to exceed two waiting tasks. The unbounded queue growth problem only arises when tasks arrive unpredictably. Since this argument proves the same conclusion without regard to predictability, it's incorrect.
Here's how I think it actually works: if tasks randomly come in and out at the same rate, the size of the queue over time can be modeled as an unbiased random walk with a floor at zero. Elementary techniques prove that such a random walk will hit zero in finite time, and will therefore hit zero infinitely many times. These correspond to the idle times of the processor. The only way to avoid hitting zero over and over is if tasks come in faster than they come out, which leads to tasks piling up over time.
Queues are balloons. If you put more air in than you are taking out, they grow until they pop. As utilisation grows (ie as the input volume of air starts to match the output) it takes much longer to get rid of the backlog.
This means that any traffic anomalies that would otherwise be hidden by the queue are instantly amplified.
I wonder if anyone here has experience with a fixed-size backlog for work tickets.
The problem is that people add things to the backlog during their ordinary work, so as you scale up the number of workers, the pressure on the queue increases.
That said, my experience in doing bug triage is that some large percentage of tickets are never going to be done. Those should just be dropped on the floor.
For example, what if tasks were not monolithic? As the queue size grows, increase the caching timeout and don't hit that DB, or decrease timeout, or something like that.
Or, what if the input task variance was bounded and we just initialized the queue with 10 tasks? This way, the addition of tasks would never dip below 1 and would never exceed 20 (for example).
Processing capacity might not be constant. If you're in a cloud, maybe you can launch more processing power as queue length increases.
Multiple items in the queue might be satisfiable by the same processing, e.g. multiple requests for the same item. In that case, having more requests in queue can increase processing efficiency.
Edit: another one. All requests may not need the same quality of service. For some, best effort might be acceptable.
https://lmax-exchange.github.io/disruptor/disruptor.html
It has a completely opposite pattern: the more you load it the faster it goes.
Not sure what do you mean with that. The article is about the general theoretical properties of (unbounded) queues.
LMAX is a bounded queue, with the quirk that it will drop older messages in favour of newer ones and it assumes that either there is a side channel for recovery or consumer(s) can tolerate message drops.
What I was trying to get at is the specific memory access patterns, batching effects and other nuances related to the physical construction of the CPU which modulate the actual performance of these things. I think this quote better summarizes what I was trying to convey:
> When consumers are waiting on an advancing cursor sequence in the ring buffer an interesting opportunity arises that is not possible with queues. If the consumer finds the ring buffer cursor has advanced a number of steps since it last checked it can process up to that sequence without getting involved in the concurrency mechanisms. This results in the lagging consumer quickly regaining pace with the producers when the producers burst ahead thus balancing the system. This type of batching increases throughput while reducing and smoothing latency at the same time. Based on our observations, this effect results in a close to constant time for latency regardless of load, up until the memory sub-system is saturated, and then the profile is linear following Little’s Law. This is very different to the “J” curve effect on latency we have observed with queues as load increases.
A large list of books (free and online), I cannot speak to the quality of any particular book but you can take some of the titles and authors and perhaps find reviews.
http://web2.uwindsor.ca/math/hlynka/qbook.html
Linked at the top of the first page I shared, but not available online (though a few are available through some services like O'Reilly's digital library).
Client --
If you cache a new submission (message) locally before submitting, you just keep re-submitting until the server returns an ACK for that submission; and your local cache is empty.
Scale clients out as needed.
Server --
If a message is received in good order, return an ACK.
If a message is a duplicate, discard the message and return an ACK.
If a message can only be received once, discard if it already exists on the server, and return an ACK.
If a message is invalid, return a FAIL.
Scale hardware out or up, as needed (ref. "capacity, item 2 in the linked blog post above). Scale queues out by adding service endpoints on the server. Async/await makes the client experience painless. You save your employer $$ because no additional procurement contracts, no licensing fees, no additional server-side infrastructure to run queues, and no consultancy fees to set up and operate Rabbit MQ/Tibco/IBM MQ/Amazon SQS/Azure Queue Storage or whatever other queueing solution the org uses.
Message passing includes concepts auch as durability, security policies, message filtering,delivery policies, routing policies, batching, and so on. The above can support all of that and, if your scenario calls for it, none of it.
The financial argument reduces to dev time vs. procurement, deployment and operational costs of whatever 3rd party product is used, as well as integration, of course.
* I note and welcome the downvotes. However I'd love it more if you'd present a coherent counter argument with your vote.
This has terrible resource use and offers no visibility into how many clients are waiting. And yet it's still a queue. Why would anyone do that?
The rest of your post I can't parse at all.
A directory of files to be processed may be treated as a queue. Add several directories for different stages of processing, and you have a rudimentary state machine.
Distributed systems in particular may benefit from an MQ, but are by no means necessary.
Generally, when we add an MQ we are really regulating an already existing implicit queue. It's such a common and intuitive data structure that one may easily create one without even realizing it.
If, however, you use the queue as a way to process queries more efficiently then that's a different story.
> I won’t go into the M/M/1 part, as it doesn’t really end up affecting the Important Thing I’m telling you about. What does matter, however, is that ‘∞’.
I'm really no expert but if I understand it correctly the `M/M` part defines the distributions of the arrival and service processes, so this definitely is important as others have already mentioned in the comments. E.g. a D/D/1 queue where D stands for deterministic arrival shouldn't suffer from this problem.
This doesn't change the interesting fact the article presents but throwing in technical terms without proper explanation is imo bad style. Either don't mention it (would be better in this case) or explain it and why it is relevant.
This is also a common "mistake" unexperienced people make when writing papers. They mention a fact that is somehow related but is completely irrelevant to the topic and the rest of the paper. I don't want to assume anything but to me this often smells like showing-off.
Relevant xkcd: https://www.explainxkcd.com/wiki/index.php/2501:_Average_Fam...
Instead, the right way to think about limiting queue size is load shedding when you feel back pressure.
Here’s an example at Cronitor: if our main sqs ingestion queue backs up, our pipeline will automatically move from streaming to micro-batching, drastically reducing the number of messages on the queue at the expense of slightly increased latency per message. At the same time, a less critical piece of our infrastructure pauses itself until the queue is healthy again, shedding one of the largest sources of db load and giving that capacity to ingestion.
To me the goal is to feel and respond to back pressure before blowing up and rejecting messages.
.
I appreciate that this truism has nothing to do with the article.
They added "express lanes" back in the 80s to somewhat address this, but then you have the person who ignores the sign and gets in the line with a full cart.
The article by Dan is referenced there too.
FWIW envoy also has an adaptive concurrency experimental plugin that seems similar that I'd also love to hear about any real world experience with: https://www.envoyproxy.io/docs/envoy/latest/configuration/ht...
But tasks are variable. One task requires 5 minutes of processing, another one of 35. They arrive randomly. This is also explicitly given, in a caption under one of the diagrams. "A queueing system with a single queue and a single processor. Tasks (yellow circles) arrive at random intervals and take different amounts of time to process."
People calling the article wrong may be thinking of capacity as "unit time amount of work"; like when the processor is at full capacity, it's doing one second's worth of work every second.
If we define capacity this way, the problem goes away: the processor just becomes a leaky bucket. So that is to say, if we know exactly how long each task will take, then we can measure the queue size in terms of total number of seconds of work in the queue. And so then, as long as no more than one second's worth of work is being added to the queue per second, it will not grow without bound, just like a leaky bucket that is not being refilled faster than its leak rate.
When capacity is given as a maximum number of tasks per second, there has to be some underlying justification for that, like there is some fixed part to servicing a job such as set up time and clean up time that doesn't go away even if the job takes next to zero seconds, such that jobs effectively have a built-in minimum duration. If it takes one second to set up a job, and one second to clean up, then the maximum capacity is half a job per second: 1800 jobs per hour and so on. Of course the queue starts to backlog when you approach capacity, because the jobs also require nonzero real work in relation to the administrative time.
If jobs have no minimum fixed cost attached, then the maximum job rate is unbounded: the shorter the jobs being queued, the more of them can be done per unit time: one million one-microsecond jobs can be done in a second, or a billion one-nanosecond jobs, and so on.
Ted Dzuba Monitoring theory covers unbounded queues pretty well: http://widgetsandshit.com/teddziuba/2011/03/monitoring-theor...
or less elegantly, circuit breaking / fail-retry
But if rates are not well specified, what "a little" means is also not, and rejecting additions to the queue is the only answer.