> With our RabbitMQ servers, you wont have to deal with message loss in the event of a network partition.
Reading on, I found that your answer to partition tolerance is to avoid the possibility of partitions by not supporting clustering at all. So that kind of rules out high availability, practically speaking. Shovel and federation are poor options.
As someone who is actively looking for highly available AMQP without message loss, I have to say that I'm not going to pay someone else for a poor solution to the problem. A managed service has to solve the hard problems to be compelling. I can run my own single instance and hope it doesn't go down at a bad time, which is all you're offering.
I know this is all very negative, and I regret that, but I'm part of your target market and you need to know what your offering looks like from my perspective. A managed service can't sidestep the difficult problems of operating their core technology.
What you have described is not negative at all. It is just what it is with RabbitMQ. I can remember the first time(many years ago) when I experienced my first RabbitMQ node split and I was so disappointed with it. But reading on their documentation, I was able to realise that, that is just the way RabbitMQ handles partitions.
> A managed service has to solve the hard problems(high availability) to be compelling.
We hear you. And are working very hard to try and come up with high availability solutions/plans that do not have the same failure modes that currently exist in RabbitMQ. But we believe that those problems need to be solved in RabbitMQ itself. So we are looking to see if we can come up with other failure handling modes, apart from the 3 outlined in[1]. We cant make any promises, however.
If you are already running your own node and are looking for HA solutions, then we might not be right for you, yet. But we still believe we can offer you value by taking over the headache of running the node from you so that you can concentrate on other things.
Thanks for checking us out!
When Rabbit recovers from a network partition and has to decide between multiple potential master versions of a queue, it picks the largest one to become the new master, and discards the others. It's rather mind-boggling that it can't merge them instead; after all, if your application is capable of handling duplicate deliveries, then merging (which would potentially result in previously ACKed messages becoming visible again) would be a perfectly acceptable solution.
The only way to make it non-lossy is to turn off HA recovery and manually handle network partitions, but it turns out that's not practically feasible, because there are no tools to work with Rabbit queues at a low level; the only way to recover is to discard one or more nodes.
We've also found Rabbit's clustering to be very flaky in general, beside the lack of partition tolerance. We recently had a Rabbit crash where one Rabbit node (not the machine itself) went down, and things got really stuck; the only way to recover was to stop all the nodes, then start them again. After we did that, all the queues were empty. We've also had instances where suddenly bindings go missing, or the bindings are there but attempting to declare them from a client fails with an "bindings already exist" error. And many other weird errors.
The last year or so, after having to endure all of these issues, we've decided to ditch clustering altogether and run a single node. That's risky, but ironically it's a lot more stable than our previous three-node cluster.
In my opinion, Pivotal really needs to redesign RabbitMQ's clustering.
Has anyone successfully moved off Rabbit? ActiveMQ, NSQ? Disque [1] looked promising, but seems dead (last commit was 18 months ago) at this point.
A good default cluster is 3 x r4.large with 100GB GP2 for RABBITMQ_MNESIA_BASE & pause_minority. For queues that need HA, a good default is ha-mode: exactly, ha-params: 2, ha-sync-mode: automatic. As for the Erlang version, we recommend 19.3.6.2 which has important fixes relevant for RabbitMQ. Today we recommend RabbitMQ 3.6.11, and 3.6.12 as soon as it ships.
In the past 6 months, I have been focusing on RabbitMQ stability and operability on AWS, GCP & vSphere. Can you tell me more about your RabbitMQ deployment lobster_johnson? This will help: https://s3-eu-west-1.amazonaws.com/rabbitmq-share/help-us-un...
I wouldn't mind moving this discussion to rabbitmq-users mailing list, so that it can benefit more in the RabbitMQ community.
Thanks, Gerhard
As for merging queues after partition recovery, the RabbitMQ devs have been talking about implementing that for years. I understand it's a hard problem, or it would already be part of RabbitMQ, since it's the most obvious and desirable solution for applications that can handle duplicates.
We're doing the same thing wrt avoiding clustering and accepting the brief downtime when the single RabbitMQ instance fails.
The only reason why we are not offering a clustering plan is because when you experience a network partition in such a case, you will loose data. And we do not want to be in a situation where we are explaining to our customers that there data has been lost through no fault of our own, but because that's the way RabbitMQ works.
We would rather have single node plans where failure modes are much easier to deal with.
Hi, sorry about that. We'll fix the comparison page.
> Why not support TLS/amqps at all pricing levels?
When we started out, we were offering AMQPS/TLS for all plans through certificates from letsencrypt[1]. However, it became hard to manage certificate renewals, since we had to renew them on one machine and scp them to the respective RabbitMQ servers. This was too labour intensive and not worth. We however still have plans to roll out TLS for all plans at no extra cost.
It started out as a side project but it has grown to a point where we are now doing it full time.
When we used to use RabbitMQ we were never able to restart a fully loaded instance it always just seemed to hang.