Show HN: Amqphosting, Managed RabbitMQ service (opens in new tab)

(amqphosting.com)

11 pointsRabbitmqGuy8y ago16 comments

16 comments

14 comments · 4 top-level

zedpm8y ago· 8 in thread

My first question when I saw this was "How to you handle network partitions?", since RabbitMQ's partition handling is, uh, suboptimal. I read bullet points until I found this:

> With our RabbitMQ servers, you wont have to deal with message loss in the event of a network partition.

Reading on, I found that your answer to partition tolerance is to avoid the possibility of partitions by not supporting clustering at all. So that kind of rules out high availability, practically speaking. Shovel and federation are poor options.

As someone who is actively looking for highly available AMQP without message loss, I have to say that I'm not going to pay someone else for a poor solution to the problem. A managed service has to solve the hard problems to be compelling. I can run my own single instance and hope it doesn't go down at a bad time, which is all you're offering.

I know this is all very negative, and I regret that, but I'm part of your target market and you need to know what your offering looks like from my perspective. A managed service can't sidestep the difficult problems of operating their core technology.

RabbitmqGuyOP8y ago

Hi.

What you have described is not negative at all. It is just what it is with RabbitMQ. I can remember the first time(many years ago) when I experienced my first RabbitMQ node split and I was so disappointed with it. But reading on their documentation, I was able to realise that, that is just the way RabbitMQ handles partitions.

> A managed service has to solve the hard problems(high availability) to be compelling.

We hear you. And are working very hard to try and come up with high availability solutions/plans that do not have the same failure modes that currently exist in RabbitMQ. But we believe that those problems need to be solved in RabbitMQ itself. So we are looking to see if we can come up with other failure handling modes, apart from the 3 outlined in[1]. We cant make any promises, however.

If you are already running your own node and are looking for HA solutions, then we might not be right for you, yet. But we still believe we can offer you value by taking over the headache of running the node from you so that you can concentrate on other things.

Thanks for checking us out!

1. https://www.rabbitmq.com/partitions.html

lobster_johnson8y ago

Indeed. In my opinion RabbitMQ is essentially useless in clustered mode.

When Rabbit recovers from a network partition and has to decide between multiple potential master versions of a queue, it picks the largest one to become the new master, and discards the others. It's rather mind-boggling that it can't merge them instead; after all, if your application is capable of handling duplicate deliveries, then merging (which would potentially result in previously ACKed messages becoming visible again) would be a perfectly acceptable solution.

The only way to make it non-lossy is to turn off HA recovery and manually handle network partitions, but it turns out that's not practically feasible, because there are no tools to work with Rabbit queues at a low level; the only way to recover is to discard one or more nodes.

We've also found Rabbit's clustering to be very flaky in general, beside the lack of partition tolerance. We recently had a Rabbit crash where one Rabbit node (not the machine itself) went down, and things got really stuck; the only way to recover was to stop all the nodes, then start them again. After we did that, all the queues were empty. We've also had instances where suddenly bindings go missing, or the bindings are there but attempting to declare them from a client fails with an "bindings already exist" error. And many other weird errors.

The last year or so, after having to endure all of these issues, we've decided to ditch clustering altogether and run a single node. That's risky, but ironically it's a lot more stable than our previous three-node cluster.

In my opinion, Pivotal really needs to redesign RabbitMQ's clustering.

Has anyone successfully moved off Rabbit? ActiveMQ, NSQ? Disque [1] looked promising, but seems dead (last commit was 18 months ago) at this point.

[1] https://github.com/antirez/disque

gerhardlazu8y ago

For a stable RabbitMQ cluster, you want dedicated RabbitMQ hosts with sufficient CPU, disk & network throughput for your workload. Most RabbitMQ users don't know what their workload is, or what their hardware boundaries are. We, the RabbitMQ team, should make this easier - and we will, in due course.

A good default cluster is 3 x r4.large with 100GB GP2 for RABBITMQ_MNESIA_BASE & pause_minority. For queues that need HA, a good default is ha-mode: exactly, ha-params: 2, ha-sync-mode: automatic. As for the Erlang version, we recommend 19.3.6.2 which has important fixes relevant for RabbitMQ. Today we recommend RabbitMQ 3.6.11, and 3.6.12 as soon as it ships.

In the past 6 months, I have been focusing on RabbitMQ stability and operability on AWS, GCP & vSphere. Can you tell me more about your RabbitMQ deployment lobster_johnson? This will help: https://s3-eu-west-1.amazonaws.com/rabbitmq-share/help-us-un...

I wouldn't mind moving this discussion to rabbitmq-users mailing list, so that it can benefit more in the RabbitMQ community.

Thanks, Gerhard

zedpm8y ago

Antirez said he plans on merging disque into a Redis module, now that such a thing exists. I'm pretty excited, and would love to migrate off of RabbitMQ to Disque or whatever the module version is named, as we're already successfully running redis instances.

As for merging queues after partition recovery, the RabbitMQ devs have been talking about implementing that for years. I understand it's a hard problem, or it would already be part of RabbitMQ, since it's the most obvious and desirable solution for applications that can handle duplicates.

We're doing the same thing wrt avoiding clustering and accepting the brief downtime when the single RabbitMQ instance fails.

RabbitmqGuyOP8y ago

Yes.

The only reason why we are not offering a clustering plan is because when you experience a network partition in such a case, you will loose data. And we do not want to be in a situation where we are explaining to our customers that there data has been lost through no fault of our own, but because that's the way RabbitMQ works.

We would rather have single node plans where failure modes are much easier to deal with.

xerxes9018y ago

We run our RabbitMQ cluster with pause_minority as the partition handling strategy. This should eliminate most message loss on partition, no?

2 more replies

nicodjimenez8y ago

I'm totally with you. I think the problem is AMQP itself. It just doesn't lend itself well to fully managed reliable message passing across network partitions because it was engineered (some say overengineered) to be "zero overhead" and have "delivery guarantees". If you need message passing across network partitions you really need to ask yourself if RabbitMQ is the right tool for the job. As services like Pubnub get cheaper and gain more acceptance from developers I think they will eliminate many of the things RabbitMQ is currently being used for. If you need message passing features that aren't either in RabbitMQ, Redis, or hosted services like Pubnub, then you're probably doing something sophisticated and probably want to build your infrastructure from the ground up.

cheald8y ago

Totally agreed. A single Rabbit node is brain-dead easy to run - the hardest part is getting the right Erlang packages installed. HA Rabbit is a problem I'd be willing to pay someone else to solve for me.

esdott8y ago· 1 in thread

Why not support TLS/amqps at all pricing levels? That's a huge turn off for me especially since you only have it at your highest pricing level. I'd also make that clear on your comparison page as it seems like you support amqps at the $55 level but do not on your pricing page. Good luck! (seriously, no sarcasm)

RabbitmqGuyOP8y ago

> as it seems like you support amqps at the $55 level but do not on your pricing page.

Hi, sorry about that. We'll fix the comparison page.

> Why not support TLS/amqps at all pricing levels?

When we started out, we were offering AMQPS/TLS for all plans through certificates from letsencrypt[1]. However, it became hard to manage certificate renewals, since we had to renew them on one machine and scp them to the respective RabbitMQ servers. This was too labour intensive and not worth. We however still have plans to roll out TLS for all plans at no extra cost.

1. https://letsencrypt.org/

kevinsimper8y ago· 1 in thread

Looks nice, is it a single side project that you are launching? What are your experience with rabbitmq vs. SQS or something like it? :)

RabbitmqGuyOP8y ago

Hi.

It started out as a side project but it has grown to a point where we are now doing it full time.

jv222228y ago

Great idea. Nice to see a managed service for this.

When we used to use RabbitMQ we were never able to restart a fully loaded instance it always just seemed to hang.

j / k navigate · click thread line to collapse

16 comments

14 comments · 4 top-level

zedpm8y ago· 8 in thread

My first question when I saw this was "How to you handle network partitions?", since RabbitMQ's partition handling is, uh, suboptimal. I read bullet points until I found this:

> With our RabbitMQ servers, you wont have to deal with message loss in the event of a network partition.

RabbitmqGuyOP8y ago

Hi.

> A managed service has to solve the hard problems(high availability) to be compelling.

Thanks for checking us out!

1. https://www.rabbitmq.com/partitions.html

lobster_johnson8y ago

Indeed. In my opinion RabbitMQ is essentially useless in clustered mode.

In my opinion, Pivotal really needs to redesign RabbitMQ's clustering.

Has anyone successfully moved off Rabbit? ActiveMQ, NSQ? Disque [1] looked promising, but seems dead (last commit was 18 months ago) at this point.

[1] https://github.com/antirez/disque

gerhardlazu8y ago

I wouldn't mind moving this discussion to rabbitmq-users mailing list, so that it can benefit more in the RabbitMQ community.

Thanks, Gerhard

zedpm8y ago

We're doing the same thing wrt avoiding clustering and accepting the brief downtime when the single RabbitMQ instance fails.

RabbitmqGuyOP8y ago

Yes.

We would rather have single node plans where failure modes are much easier to deal with.

xerxes9018y ago

We run our RabbitMQ cluster with pause_minority as the partition handling strategy. This should eliminate most message loss on partition, no?

2 more replies

nicodjimenez8y ago

cheald8y ago

esdott8y ago· 1 in thread

RabbitmqGuyOP8y ago

> as it seems like you support amqps at the $55 level but do not on your pricing page.

Hi, sorry about that. We'll fix the comparison page.

> Why not support TLS/amqps at all pricing levels?

1. https://letsencrypt.org/

kevinsimper8y ago· 1 in thread

Looks nice, is it a single side project that you are launching? What are your experience with rabbitmq vs. SQS or something like it? :)

RabbitmqGuyOP8y ago

Hi.

It started out as a side project but it has grown to a point where we are now doing it full time.

jv222228y ago

Great idea. Nice to see a managed service for this.

When we used to use RabbitMQ we were never able to restart a fully loaded instance it always just seemed to hang.

j / k navigate · click thread line to collapse