And our log messages are ridiculously big at times (15k to as big as 50k).
Our pipe never has problems. What fails for us is Elastic Search. In fact at one point in the past we did 100k messages/s when embarrassingly had debug turned on in production and RabbitMQ did not fail but Elastic Search and sadly Flume did as well (I tried to get rid of flume with a custom Rust AMQP to Elastic Search client but at the time had some bugs with the libraries.. Maybe I will recheck out Mozilla Heka someday).
There is this sort of beating of the developer chest with a lot of tech companies.. that hey listen we are ultra important and we are dealing with ridiculously traffic and we need ultra high performance. Please tell/show me these numbers.... Or maybe stop logging crap you don't need to log.
Or maybe I'm wrong and we should log absolutely everything and Auth0 made the right choice given their needs (lets assume they have millions of messages a second), I still think I could make a sharded RabbitMQ go pretty far.
This goes with other technology as well. You don't need to pick hot glamorous NoSQL when Postgresql or MySQL and a tiny bit of engineering will get the job done just fine particularly when mature solutions give you such many things free out of the box (RabbitMQ gives you a ton of stuff like a cool admin UI and routing that you would have to build in ZeroMQ).
BTW I didn't mean to denigrate Elastic Search (I assume that is why I'm getting downvoted.... a comment would help). We just haven't had the chance to upgrade it and properly configure it.
In fact Elastic has been pretty darn speedy as of lately particularly since we purge some of the data after 6 months (we still have permanent filesystem storage of logs of course).
How many MB/s are you indexing?
I'd be really interested to hear how you can achieve such a thoughput with rabbitmq
I do know we use multiple queues and even exchanges (and I did not know about the one core to queue).
A simple googling shows though folks have achieved far greater throughput[1] than 100k (and by the way this wasn't sustained.. it was spikes).
[1]: https://blog.pivotal.io/pivotal/products/rabbitmq-hits-one-m...
If you don't need amqps out there are more modern, better supported projects.
import socket
import time
cs = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
cs.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
cs.setsockopt(SOL_SOCKET, SO_BROADCAST, 1)
while True:
cs.sendto('Node ID', ('255.255.255.255', 54545))
time.sleep(4)
Everybody listening on the same on port 54545 without knowing Node ID's IP address will get these messages which includes the broadcaster IP address. import socket
s=socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.bind(('',54545))
m=s.recvfrom(1024)
print m[0]
This is a very useful technique when using ZeroMQ generally as you can broadcast services without knowning any IP address so they can come up and down on new addresses if / when necessary.I am wondering if working on solving the actual problems with Kafka would have been the better route. I've never used Kafka and i find ZeroMQ great, but reading that their logging solution does drop log messages is a huge no-go for operations. How can you claim to run a serious business and say "babies will die" when you can't be sure to be able to find problems?
Because, when will you lose logs? Not in normal operation, but when weird things happen. When networking has a hiccup. When Load on the system is too high, so most likely when many people are using your service. Exactly when shit hits the fan. And you just made the decision that it's ok to drop log messages in such cases? That's not good.
I think you should either dive into Kafka/Zookeeper and fix your problems or switch to another logging solution. You should probably just drop that non-sense "streaming and real-time logs" requirement and live with a log delay of a few seconds and build something really stable instead of building something inherently unstable. Honestly, just collecting syslogs on the core vm and sending them to a central server would have been the better solution. Better then looking into fancy real-time, streaming logs on a sunday night because the system is having a breakdown and you can't even be sure that you are not missing essential logs.
How do you end up with a system design that can generate logs so fast that you can't keep up? It seems to be that some fundamental element of capacity planning was missed long ago and we're trying to fix the symptoms and not the cause.
If I have a geographically distributed system, I'm going to have bandwidth and latency issues if I try this, sure. But why do I care? A request to the SF data center shouldn't involve the Munich data center. That is, if I care about response times, and if I don't care about that, then why do I care about instantaneous log availability?
I think sometimes we get so bored with the problems we have that we invent new things to get upset about. Or management does, which is always worse.
ZMQ silently drops messages when subscribers fail or not listening or when buffers fill up, but as they describe later on "access to historical logs", it's much easier to set up separate process/es for just that.
It seems that when shit hits the fan for this reason ZMQ really is a more reliable choice because it's more flexible.
Also, as he describes in the article, historical logs are scoped out and it is "likely" they they will develop something for those logs in the future. Again it looks like the plan is to use ZMQ and a subscriber to put those logs into logstash. That doesn't solve the problem i mentioned at all. ZMQ may still drop the logs! So, as far as i understand they don't have a plan for reliable logging. Even if they would, they'd have one reliable solution and an unreliable solution. The unreliable ZMQ based approach is probably neat and leads to fancy realtime dashboard stuff, but since it's not a reliable source of information it's not a good solution for operating a system where "babies will die".
But if the Auth0 runs their entire operations on AWS, maybe Kinesis would have been a more natural transition.
To anyone reading this with Kafka experience, do you have any tips/advice when it comes to maintaining a Kafka service?
Now about Kafka vs ZeroMQ: you want Kafka if you cannot tolerate the loss of even a single message. The append-only log with committed reader positions is a perfect fit for that.
It is? Can you provide any evidence supporting this claim?
Mesos is mostly used to deploy stateless services.
From what I can see it doesn't really support database deployments except for ArangoDB and Cassandra.
I knew a big part of the reliability problems we were having was related to the distributed state that needed to be kept synchronized. I wanted to move to something simpler that did not rely on any durable, distributed state, while supporting the messaging patters we required. ZeroMQ fit the bill.
While there were other implementations with similar properties, there is no reasonable way to compare them up front given that what makes the real difference at the end of the day is the behavior of the system at 2am one day after a prolonged stress run. As a startup one does not have resources to conduct an up front analysis of that sort. You just take a bet. If it does not pan out, you pivot. This is exactly what we have done with the move from Kafka to ZeroMQ in the first place.
Now that we've been using ZeroMQ for over a year and have been perfectly happy, there is no incentive to look elsewhere.
There was some drama when the maintainer quit briefly before rejoining. Since then the gitter channel has been more active than I remember it being before. The mailing list is quiet it is true. Somebody just released a Rust version, and version 1.0.0 of nanomsg was indeed released.
You can see commit history here: https://github.com/nanomsg/nanomsg/commits/master/src
Did that ever happen? I still like the idea behind nanomsg.
I was surprised by the contrasting sense of importance of delivery guarantees in the article. At the start, losing a message was akin to the death of a child. At the end, shrug. Now every single machine failure (or even ømq process restart) failure will lose you log messages stored in memory :(.
Glad to hear you found a solution that worked for you though! Would love to hear about difficulties you had with the new system, in particular adding brokers.
I've never heard anyone say managing a production Kafka cluster was easy or simple. Well, anyone who has had to actually maintain such clusters hasn't said it anyway.
True, but it appears to me that availability problems and dropped log messages often have the same root cause - network issues.
So whenever they do have availability issues (and dying babies) they won't be able to investigate properly because log messages are being lost as well.
That's obviously a very general observation. It may well be that in their architecture availability issues are mostly caused by something unrelated to networking (e.g. the database).
Running Zk and Kafka on the same nodes is likely not the best thing.
Though for my part I still don't understand why Zookeeper wasn't built as a library to add distributed strongly consistent coordination to software that needs/benefits from it rather than being an external service that needs to be connected to, and thus introduces a gnarly mess of new failure modes that make Zookeeper client behavior extremely critical and often fragile. Something that's more like a "libpaxos/libraft" (e.g. serf for Go-lang or riak_ensemble for Erlang) seems a lot more valuable. /shrug
If we have K1,Z1 -- K2,Z2 -- K3,Z3 -- and one node goes down, you've now taken down both a broker and a Zk node. Remember, the brokers don't care about connecting to any Zk node, they want the leader. So you aren't gaining any more fault tolerant by co-locating them.
If there's a network partition between the leader Zk node and other nodes, the local Kafka broker won't actually be able to do much because the Zk cluster will elect a new leader, on another node, so again, you aren't gaining anything.
Moreover, you're now tying the scalability of Kafka with Zk. Zk doesn't scale linearly, so there's only so many nodes you may have in a cluster. Kafka, on the other hand, scales linearly. So if you're colocating them and you have to bump up Kafka, do you still start up Zk for those nodes (but they don't actually join the cluster)? You're now special casing and adding more edge cases.
I used MQTT but only as a message bus.
Nothing magical/weird about it, just depends on whether or not you've got a nail to hammer with your MQ-hammer.
It's a great read and describes most scenarios well and easy to understand.
The kicker is that Kafka can be rock solid in terms of handling massive throughput and reliability when the wheels are well greased, but there are a lot of largely undocumented lessons to learn along the way RE: configuration and certain surprising behavior that can arise at scale (such as https://issues.apache.org/jira/browse/KAFKA-2063, which our team ran into maybe a year ago & is only being fixed now).
Symptoms of these issues can cause additional knock-on effects with respect to things like leader election (we wound up with a "zombie leader" in our cluster that caused all sorts of bizarre problems) and graceful shutdowns.
Add to that the fact the software is still very much under active development (sporadic partition replica drops after an upgrade from 0.8.1 to 0.8.2; we had to apply some small but crucial patches from Uber's fork) & that it needs a certain level of operational maturity to monitor it all ... it's easy to get nervous about what the next "surprise" will be.
Having said all that, I'd use Kafka again in a heartbeat for those high volume use cases where reliability matters. Not sure I'd advise others without similar operational experience to do the same for anything mission critical, though -- unless you like stress. That stress is why Confluent is in business. :)
Kafka and Confluent Platform are very much still works in progress. I had to patch Kafka Connect HDFS connector because a fix I needed was left out of the last release. Be prepared to do something similar with any of Kafka's components.
http://wanderr.com/jay/tail-error-logs-to-slack-for-fun-and-...
0MQ doesn't sound like it is the right solution either, but yeah... often you pick the wrong tool and learn something in the process.
Kafka gives you features that certain systems cannot live without, like on disk persistence (saved my life couple of times) and topics. Filtering messages on the client side like ZeroMQ does it not an option in many cases, just think about security. I think Kafka has a long way to go before it can be used as a general message queue (many features are not there yet like visibility timeout for example) but if you can manage Zookeeper and have means to work with it (somebody understands it and knows its quirks) it can provide a reliable platform for distributing a large number of messages with low latency and high throughput, just like it does at LinkedIN.
If deployed using the Netflix co-processes both are very durable.
There are about a dozen message systems out there that will handle much more than Kafka with minimal or no operational overhead while supporting everything they need.