From Kafka to ZeroMQ for real-time log aggregation (opens in new tab)

(tomasz.janczuk.org)

197 pointsjanczukt9y ago105 comments

105 comments

68 comments · 21 top-level

agentgt9y ago· 10 in thread

I don't understand why people need such ridiculously fast systems when we are using RabbitMQ and crappy Apache flume and we generate more than 5k with spikes of 50k messages/second. Please author of the article tell me your metrics.

And our log messages are ridiculously big at times (15k to as big as 50k).

Our pipe never has problems. What fails for us is Elastic Search. In fact at one point in the past we did 100k messages/s when embarrassingly had debug turned on in production and RabbitMQ did not fail but Elastic Search and sadly Flume did as well (I tried to get rid of flume with a custom Rust AMQP to Elastic Search client but at the time had some bugs with the libraries.. Maybe I will recheck out Mozilla Heka someday).

There is this sort of beating of the developer chest with a lot of tech companies.. that hey listen we are ultra important and we are dealing with ridiculously traffic and we need ultra high performance. Please tell/show me these numbers.... Or maybe stop logging crap you don't need to log.

Or maybe I'm wrong and we should log absolutely everything and Auth0 made the right choice given their needs (lets assume they have millions of messages a second), I still think I could make a sharded RabbitMQ go pretty far.

This goes with other technology as well. You don't need to pick hot glamorous NoSQL when Postgresql or MySQL and a tiny bit of engineering will get the job done just fine particularly when mature solutions give you such many things free out of the box (RabbitMQ gives you a ton of stuff like a cool admin UI and routing that you would have to build in ZeroMQ).

packetized9y ago

We run an average of 14k logs/sec through a two-node RMQ cluster, with max sustained throughput in the ~50k range. You're spot on with the bottleneck being Elasticsearch, but the latest releases in the 2.x train have a lot of fine adjustments that have drastically improved our indexing rate, such that we actually index at a 50k/sec rate. Would be interested to hear about your ES cluster configuration.

agentgt9y ago

I'm embarrassed to say that at the present moment we currently don't use ES clustering but rather a monstrous powerful bare metal machine as we had issues with the cluster failing with some network issues we had with Rackspace.

BTW I didn't mean to denigrate Elastic Search (I assume that is why I'm getting downvoted.... a comment would help). We just haven't had the chance to upgrade it and properly configure it.

In fact Elastic has been pretty darn speedy as of lately particularly since we purge some of the data after 6 months (we still have permanent filesystem storage of logs of course).

2 more replies

gerakinis9y ago

Are you using HA functionality and also on disk backing? These two things bring down performance roughly 5-10x and are mostly required for situations that can't afford message loss. I still like the rabbitmq solution, it is my own, but i've found it takes more hardware than you are suggesting.

1 more reply

trimbo9y ago

> We run an average of 14k logs/sec through a two-node RMQ cluster

How many MB/s are you indexing?

benmccann9y ago

What are the new options that help improve indexing rate?

smetj9y ago

100K msg/s going through RabbitMQ ... Would you mind commenting on how your Rabbitmq is setup? Is it a cluster? Distributed queue(s)? Synced queues? What kind of exchange? How many queues your messages end up in? (because 1 queue is bound to 1 core), persistent queue? lazy queue? What is the "Consumer utilisation" value when doing 100K msg/s?

I'd be really interested to hear how you can achieve such a thoughput with rabbitmq

agentgt9y ago

I'm not sure how much I can help because I didn't setup the RMQ cluster so I don't know the configuration details but I know it is fairly powerful (its also partly why I can't entirely be critical of Auth0 because ZMQ is probably far less expensive infrastructure wise).

I do know we use multiple queues and even exchanges (and I did not know about the one core to queue).

A simple googling shows though folks have achieved far greater throughput[1] than 100k (and by the way this wasn't sustained.. it was spikes).

[1]: https://blog.pivotal.io/pivotal/products/rabbitmq-hits-one-m...

gerakinis9y ago

Heka is really stupid to configure, really fast, and now deprecated... I've used it for log tailing and metrics forwarding extensively and can't recommend it enough if you need to use amqps out.

If you don't need amqps out there are more modern, better supported projects.

brightball9y ago

Agree with everything you said, just curious is the GT in your name for Georgia Tech?

agentgt9y ago

Yep... I was the first agent@cc.gatech.edu circa 99-04 (I wonder who owns it now). Advance apologies if you knew me then... I was an inconsiderate a$$hole at the time. I'm still an a$$hole but less inconsiderate.

1 more reply

bachback9y ago· 6 in thread

With ZeroMQ I had the worst possible results and experience. Honestly much of what it claims is bogus. It is highly optimized for certain cases and utterly useless for distributed systems. Try and find out in PUB/SUB what the IP addresses of the subscribers are. Not possible. In many cases you will be much better off learning TCP/IP yourself. In the mentioned case you simply iterate over the vector of subscribers - much more powerful and the sane default. It seems at some point people confused internal networking solutions with the Internet.

vegabook9y ago

it is trivially easy for any node to broadcast its IP address to the whole network periodically (in my case every 2 seconds) using a separate thread and UDP. Using this technique I have rock solid ZeroMQ topology that reconnects with max downtime about 2.5 seconds (because I broadcast every 2 sdconds) for any single node failure. I agree that this functionality could be better implemented in zmq but using this simple technique, the rest of zmq becomes amazing. In Python:

  import socket
  import time
  cs = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
  cs.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
  cs.setsockopt(SOL_SOCKET, SO_BROADCAST, 1)
  while True:
      cs.sendto('Node ID', ('255.255.255.255', 54545))
      time.sleep(4)

Everybody listening on the same on port 54545 without knowing Node ID's IP address will get these messages which includes the broadcaster IP address.

  import socket
  s=socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
  s.bind(('',54545))
  m=s.recvfrom(1024)
  print m[0]

This is a very useful technique when using ZeroMQ generally as you can broadcast services without knowning any IP address so they can come up and down on new addresses if / when necessary.

mej109y ago

This would be great, but I don't think it works on AWS -- I don't think they support broadcast.

2 more replies

kchoudhu9y ago

The ZeroMQ documentation is pretty up front about the need for you to build those pieces yourself. It would appear you chose the wrong tool for your requirements.

bachback9y ago

ZeroMQ wants to be a neutral wrapper in any language but in the end its a C++ library enforcing C++ concepts. You can't map in a straightforward way process concepts from C++ to other languages (also the OS and VM sits in between that). In the end its just mapping programming logic to state machines. There are much better ways to do this and end up with something much more powerful (with first class meta-programming).

2 more replies

PieterH9y ago

If you try to use ZeroMQ to replace TCP with the same semantics, then no, it won't work.

bachback9y ago

How should ZeroMQ replace TCP? Internet runs on TCP/IP (last time I checked). In private networks many problems of public networks don't appear. I'm aware that organisations use it for their internal systems. As soon as one bridges to the outside world, one is going to hit with a problem.

1 more reply

buster9y ago· 5 in thread

To me it sounds like Kafka was not understood in full detail (maybe because missing documentation or the high complexity) and they switched to a system they build themselves. Naturally they know in full detail what is going on and can set up the system as needed.

I am wondering if working on solving the actual problems with Kafka would have been the better route. I've never used Kafka and i find ZeroMQ great, but reading that their logging solution does drop log messages is a huge no-go for operations. How can you claim to run a serious business and say "babies will die" when you can't be sure to be able to find problems?

Because, when will you lose logs? Not in normal operation, but when weird things happen. When networking has a hiccup. When Load on the system is too high, so most likely when many people are using your service. Exactly when shit hits the fan. And you just made the decision that it's ok to drop log messages in such cases? That's not good.

I think you should either dive into Kafka/Zookeeper and fix your problems or switch to another logging solution. You should probably just drop that non-sense "streaming and real-time logs" requirement and live with a log delay of a few seconds and build something really stable instead of building something inherently unstable. Honestly, just collecting syslogs on the core vm and sending them to a central server would have been the better solution. Better then looking into fancy real-time, streaming logs on a sunday night because the system is having a breakdown and you can't even be sure that you are not missing essential logs.

AYBABTME9y ago

One has only two choices in those situations: drop logs or block receiving more logs. Given their availability requirements, I don't think that blocking is a viable choice. So dropping logs seems to be the only sane choice here. There's no other alternative really so I'm not sure about the consternation.

hinkley9y ago

I come to this discussion much like one would sit down at a bar and find their two friends are deep in a coding discussion they didn't hear the beginning of.

How do you end up with a system design that can generate logs so fast that you can't keep up? It seems to be that some fundamental element of capacity planning was missed long ago and we're trying to fix the symptoms and not the cause.

If I have a geographically distributed system, I'm going to have bandwidth and latency issues if I try this, sure. But why do I care? A request to the SF data center shouldn't involve the Munich data center. That is, if I care about response times, and if I don't care about that, then why do I care about instantaneous log availability?

I think sometimes we get so bored with the problems we have that we invent new things to get upset about. Or management does, which is always worse.

buster9y ago

What? No. You can just save logs on the disk and buffer them. Just dropping logs or blocking because some network resource is not available are both terrible choices and that's not how logging worked for the last decades. Throwing away logs ist a major step backwards.

1 more reply

macns9y ago

[..] i find ZeroMQ great, but reading that their logging solution does drop log messages is a huge no-go for operations.

ZMQ silently drops messages when subscribers fail or not listening or when buffers fill up, but as they describe later on "access to historical logs", it's much easier to set up separate process/es for just that.

It seems that when shit hits the fan for this reason ZMQ really is a more reliable choice because it's more flexible.

buster9y ago

No. When you can't rely on your collected ZMQ logs and need to "access historical logs" by some other means, why use the ZMQ logs at all? You usually don't know that something was not logged.

Also, as he describes in the article, historical logs are scoped out and it is "likely" they they will develop something for those logs in the future. Again it looks like the plan is to use ZMQ and a subscriber to put those logs into logstash. That doesn't solve the problem i mentioned at all. ZMQ may still drop the logs! So, as far as i understand they don't have a plan for reliable logging. Even if they would, they'd have one reliable solution and an unreliable solution. The unreliable ZMQ based approach is probably neat and leads to fancy realtime dashboard stuff, but since it's not a reliable source of information it's not a good solution for operating a system where "babies will die".

1 more reply

htn9y ago· 5 in thread

FWIW, you can get Kafka packaged as a fully managed and HA service from https://aiven.io on AWS and also Azure, GCE and DigitalOcean.

But if the Auth0 runs their entire operations on AWS, maybe Kinesis would have been a more natural transition.

mpd9y ago

Eh, Kinesis has some pretty significant trade-offs to know about if you are comparing it with Kafka (e.g. data retention time and write latency).

janczuktOP9y ago

We need an on-premise and cloud story, so cloud only solutions did not cut it for us.

PieterH9y ago

The article is a little old. How has the system run since you deployed it? Do you have any interesting figures?

1 more reply

ZenoArrow9y ago

I'm in a similar boat. I'm hoping to propose Kafka to help with some data replication and consolidation tasks, but it has to be both on-premise and as low maintenance as possible (low maintenance in the sense of the work local developers would do).

To anyone reading this with Kafka experience, do you have any tips/advice when it comes to maintaining a Kafka service?

1 more reply

abritishguy9y ago

Kinesis is very poor

Nimimi9y ago· 5 in thread

You can deploy Kafka using DC/IO and it takes care about HA for you. DC/IO is quickly becoming the go-to solution for database deployments. ArangoDB even recommends it as default.

Now about Kafka vs ZeroMQ: you want Kafka if you cannot tolerate the loss of even a single message. The append-only log with committed reader positions is a perfect fit for that.

bogomipz9y ago

>"DC/IO is quickly becoming the go-to solution for database deployments."

It is? Can you provide any evidence supporting this claim?

Mesos is mostly used to deploy stateless services.

oblio9y ago

Do you mean this? https://dcos.io/get-started/ aka DC/OS?

From what I can see it doesn't really support database deployments except for ArangoDB and Cassandra.

ryanmaclean9y ago

Riak and MySQL are in Universe, for example: https://github.com/mesosphere/universe/tree/version-3.x/repo...

Nimimi9y ago

Ok, database was the wrong term. Maybe "things that big data companies use" or something.

mej109y ago

I wish there was more information on how much work it takes to maintain a DC/OS setup. All the marketing makes it out to be the easiest thing in the world.

bdowling9y ago· 4 in thread

But why ZeroMQ and not nanomsg?

janczuktOP9y ago

The answer is rather simplistic and does not even scratch the surface of the drama surrounding zeromq/nanomsg.

I knew a big part of the reliability problems we were having was related to the distributed state that needed to be kept synchronized. I wanted to move to something simpler that did not rely on any durable, distributed state, while supporting the messaging patters we required. ZeroMQ fit the bill.

While there were other implementations with similar properties, there is no reasonable way to compare them up front given that what makes the real difference at the end of the day is the behavior of the system at 2am one day after a prolonged stress run. As a startup one does not have resources to conduct an up front analysis of that sort. You just take a bet. If it does not pan out, you pivot. This is exactly what we have done with the move from Kafka to ZeroMQ in the first place.

Now that we've been using ZeroMQ for over a year and have been perfectly happy, there is no incentive to look elsewhere.

PieterH9y ago

See http://hintjens.com/blog:112 for my opinion on why nano isn't (wasn't, perhaps, as it seems to be doing better) a good choice.

kal31dic9y ago

With sincerely the greatest respect and admiration for what you achieved with ZeroMQ, Pieter, I think perhaps one might be a bit more nuanced when one isn't a neutral party. Full disclosure - I wrote a D wrapper, but I am not involved in nanomsg development and just a user.

There was some drama when the maintainer quit briefly before rejoining. Since then the gitter channel has been more active than I remember it being before. The mailing list is quiet it is true. Somebody just released a Rust version, and version 1.0.0 of nanomsg was indeed released.

You can see commit history here: https://github.com/nanomsg/nanomsg/commits/master/src

2 more replies

raarts9y ago

From the blog: "Crazy Idea: Clone nanomsg, move to zeromq organization, relicense as MPL, support ZMTP, only new socket types and expose CZMQ API."

Did that ever happen? I still like the idea behind nanomsg.

1 more reply

wcdolphin9y ago· 3 in thread

Did you ever try running 5 ZK's in the ensemble? 3 is the absolute minimum to survive a single machine failure. If you are having trouble with availability, it seems natural to increase your safety factor there.

I was surprised by the contrasting sense of importance of delivery guarantees in the article. At the start, losing a message was akin to the death of a child. At the end, shrug. Now every single machine failure (or even ømq process restart) failure will lose you log messages stored in memory :(.

Glad to hear you found a solution that worked for you though! Would love to hear about difficulties you had with the new system, in particular adding brokers.

_qc3o9y ago

They said availability was "death of a child", not dropping log messages. The trade-off they've made here in terms of being available with some potential loss of visibility is the right one. The system overall is clearly simpler and simpler systems have simpler failure modes and so it is easier to add mitigation components on top that can recover from those failure modes to guarantee higher uptime.

I've never heard anyone say managing a production Kafka cluster was easy or simple. Well, anyone who has had to actually maintain such clusters hasn't said it anyway.

fauigerzigerk9y ago

>They said availability was "death of a child", not dropping log messages.

True, but it appears to me that availability problems and dropped log messages often have the same root cause - network issues.

So whenever they do have availability issues (and dying babies) they won't be able to investigate properly because log messages are being lost as well.

That's obviously a very general observation. It may well be that in their architecture availability issues are mostly caused by something unrelated to networking (e.g. the database).

1 more reply

kod9y ago

I've managed a production kafka cluster at my current gig for over two years. It has been easy and trouble free with the exception of one incident, which was ultimately our fault.

TheHydroImpulse9y ago· 2 in thread

FYI, Kafka doesn't need to fetch from disk every time as it caches the logs pretty aggressively, as long as you have enough memory.

Running Zk and Kafka on the same nodes is likely not the best thing.

im_down_w_otp9y ago

Why? I would think that, as long as there wasn't massive I/O contention between the two, that co-locating Kafka and Zookeeper on the same machines would mitigate a whole massive class of weird edge cases by removing one of the failure modes; the network boundary between the two critical components.

Though for my part I still don't understand why Zookeeper wasn't built as a library to add distributed strongly consistent coordination to software that needs/benefits from it rather than being an external service that needs to be connected to, and thus introduces a gnarly mess of new failure modes that make Zookeeper client behavior extremely critical and often fragile. Something that's more like a "libpaxos/libraft" (e.g. serf for Go-lang or riak_ensemble for Erlang) seems a lot more valuable. /shrug

TheHydroImpulse9y ago

But co-locating them won't actually remove a class of errors because Zk is not HA. The Kafka brokers need to communicate with the leader in the Zk cluster.

If we have K1,Z1 -- K2,Z2 -- K3,Z3 -- and one node goes down, you've now taken down both a broker and a Zk node. Remember, the brokers don't care about connecting to any Zk node, they want the leader. So you aren't gaining any more fault tolerant by co-locating them.

If there's a network partition between the leader Zk node and other nodes, the local Kafka broker won't actually be able to do much because the Zk cluster will elect a new leader, on another node, so again, you aren't gaining anything.

Moreover, you're now tying the scalability of Kafka with Zk. Zk doesn't scale linearly, so there's only so many nodes you may have in a cluster. Kafka, on the other hand, scales linearly. So if you're colocating them and you have to bump up Kafka, do you still start up Zk for those nodes (but they don't actually join the cluster)? You're now special casing and adding more edge cases.

k__9y ago· 2 in thread

I'm a total message queue noob. What are the usecases for them?

I used MQTT but only as a message bus.

zo19y ago

From my point of view, the main things behind message queues (Not zMQ specifically) is guaranteed delivery, persistence, multiple-message atomicity, message passing/forwarding, and sometimes guaranteed message ordering. Other than that, all it does is facilitate communication between different actors.

Nothing magical/weird about it, just depends on whether or not you've got a nail to hammer with your MQ-hammer.

macns9y ago

You should have a look at http://zguide.zeromq.org/page:all

It's a great read and describes most scenarios well and easy to understand.

weitzj9y ago· 2 in thread

Did you look at nsq.io or NATS?

tjholowaychuk9y ago

+1 for NSQ, it's not a magic bullet in terms of scalability but you can get quite far. When I was at Segment we were pushing an easy 2-3B messages per day through it, if not more with message "amplification" internally.

tapirl9y ago

Do you have any experience on NATS? It (with NATS steam) looks great.

thomaslee9y ago· 1 in thread

I used to be on a team responsible for a single small-ish Kafka cluster (between 6-12 nodes) doing non-trivial throughput on bare metal. Without commenting on whether ZeroMQ is the right alternative: I can understand being scared off. Our hand was forced such that we had to go the other way and understand what was going on in Kafka.

The kicker is that Kafka can be rock solid in terms of handling massive throughput and reliability when the wheels are well greased, but there are a lot of largely undocumented lessons to learn along the way RE: configuration and certain surprising behavior that can arise at scale (such as https://issues.apache.org/jira/browse/KAFKA-2063, which our team ran into maybe a year ago & is only being fixed now).

Symptoms of these issues can cause additional knock-on effects with respect to things like leader election (we wound up with a "zombie leader" in our cluster that caused all sorts of bizarre problems) and graceful shutdowns.

Add to that the fact the software is still very much under active development (sporadic partition replica drops after an upgrade from 0.8.1 to 0.8.2; we had to apply some small but crucial patches from Uber's fork) & that it needs a certain level of operational maturity to monitor it all ... it's easy to get nervous about what the next "surprise" will be.

Having said all that, I'd use Kafka again in a heartbeat for those high volume use cases where reliability matters. Not sure I'd advise others without similar operational experience to do the same for anything mission critical, though -- unless you like stress. That stress is why Confluent is in business. :)

BrandonBradley9y ago

I can attest to 'getting nervous about what the next surprise will be' with Kafka. And I'm only dealing with a single node right now.

Kafka and Confluent Platform are very much still works in progress. I had to patch Kafka Connect HDFS connector because a fix I needed was left out of the last release. Be prepared to do something similar with any of Kafka's components.

wanderr9y ago· 1 in thread

I came up with a very different solution for real time access to logs: tail them to slack. It's not an aggregation solution and doesn't work well if you have chatty logs with nothing to filter on, but if you just want to be notified when things are happening in the logs it's pretty nice and doesn't need any infrastructure.

http://wanderr.com/jay/tail-error-logs-to-slack-for-fun-and-...

wanderr9y ago

why the downvote? the article says "Real-time access to server-side logs is what makes backend development palatable in the era of cloud computing. As a developer you want to be able to get real-time feedback from your server side code deployed to the actual execution environment in the cloud, especially during active development or staging." and this is another solution that provides that.

asasidh9y ago· 1 in thread

So you used Kafka for something that should have been handled by a MQTT or ZeroMQ in the first place ?

cbsmith9y ago

MQTT is just a protocol, so not sure how that helps.

0MQ doesn't sound like it is the right solution either, but yeah... often you pick the wrong tool and learn something in the process.

StreamBright9y ago

The author correctly points out that he is comparing apples to oranges.

Kafka gives you features that certain systems cannot live without, like on disk persistence (saved my life couple of times) and topics. Filtering messages on the client side like ZeroMQ does it not an option in many cases, just think about security. I think Kafka has a long way to go before it can be used as a general message queue (many features are not there yet like visibility timeout for example) but if you can manage Zookeeper and have means to work with it (somebody understands it and knows its quirks) it can provide a reliable platform for distributing a large number of messages with low latency and high throughput, just like it does at LinkedIN.

_halgari9y ago

ZMQ's default behavior (and in some cases only behavior) of dropping new messages when buffers are full, made it a no-go for my client. We ended up switching away from ZMQ to a more traditional durable queue and ended up saving a ton of code complexity and got a lot of reliability in the process. Having now researched it I can't think of a reason I'd ever use ZMQ again. I'll either use a durable queue when I care about message delivery, or something much more traditional when I don't.

markpapadakis9y ago

Maybe TANK ( https://github.com/phaistos-networks/TANK ) would have been a good alternative on there. No features parity with Kafka, but setting it up is a matter of running one binary and creating a few topics, and it is faster than Kafka for produce/consume operations. (disclosure: I am involved in its development).

siscia9y ago

Did you consider MQTT? Sound to me a more natural choice.

jpgvm9y ago

Probably should have been running ZK and Kafka queues separate to CoreOS/container shenanigans.

If deployed using the Netflix co-processes both are very durable.

manigandham9y ago

Why dont all these companies ever just use real enterprise software?

There are about a dozen message systems out there that will handle much more than Kafka with minimal or no operational overhead while supporting everything they need.

jvoorhis9y ago

2015

efangs9y ago

Anyone use collectd + rrd for this purpose? Still trying to understand at what level it's worth to move to something else.

j / k navigate · click thread line to collapse

105 comments

68 comments · 21 top-level

agentgt9y ago· 10 in thread

And our log messages are ridiculously big at times (15k to as big as 50k).

packetized9y ago

agentgt9y ago

BTW I didn't mean to denigrate Elastic Search (I assume that is why I'm getting downvoted.... a comment would help). We just haven't had the chance to upgrade it and properly configure it.

In fact Elastic has been pretty darn speedy as of lately particularly since we purge some of the data after 6 months (we still have permanent filesystem storage of logs of course).

2 more replies

gerakinis9y ago

1 more reply

trimbo9y ago

> We run an average of 14k logs/sec through a two-node RMQ cluster

How many MB/s are you indexing?

benmccann9y ago

What are the new options that help improve indexing rate?

smetj9y ago

I'd be really interested to hear how you can achieve such a thoughput with rabbitmq

agentgt9y ago

I do know we use multiple queues and even exchanges (and I did not know about the one core to queue).

A simple googling shows though folks have achieved far greater throughput[1] than 100k (and by the way this wasn't sustained.. it was spikes).

[1]: https://blog.pivotal.io/pivotal/products/rabbitmq-hits-one-m...

gerakinis9y ago

Heka is really stupid to configure, really fast, and now deprecated... I've used it for log tailing and metrics forwarding extensively and can't recommend it enough if you need to use amqps out.

If you don't need amqps out there are more modern, better supported projects.

brightball9y ago

Agree with everything you said, just curious is the GT in your name for Georgia Tech?

agentgt9y ago

1 more reply

bachback9y ago· 6 in thread

vegabook9y ago

  import socket
  import time
  cs = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
  cs.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
  cs.setsockopt(SOL_SOCKET, SO_BROADCAST, 1)
  while True:
      cs.sendto('Node ID', ('255.255.255.255', 54545))
      time.sleep(4)

Everybody listening on the same on port 54545 without knowing Node ID's IP address will get these messages which includes the broadcaster IP address.

  import socket
  s=socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
  s.bind(('',54545))
  m=s.recvfrom(1024)
  print m[0]

This is a very useful technique when using ZeroMQ generally as you can broadcast services without knowning any IP address so they can come up and down on new addresses if / when necessary.

mej109y ago

This would be great, but I don't think it works on AWS -- I don't think they support broadcast.

2 more replies

kchoudhu9y ago

The ZeroMQ documentation is pretty up front about the need for you to build those pieces yourself. It would appear you chose the wrong tool for your requirements.

bachback9y ago

2 more replies

PieterH9y ago

If you try to use ZeroMQ to replace TCP with the same semantics, then no, it won't work.

bachback9y ago

1 more reply

buster9y ago· 5 in thread

AYBABTME9y ago

hinkley9y ago

I come to this discussion much like one would sit down at a bar and find their two friends are deep in a coding discussion they didn't hear the beginning of.

I think sometimes we get so bored with the problems we have that we invent new things to get upset about. Or management does, which is always worse.

buster9y ago

1 more reply

macns9y ago

[..] i find ZeroMQ great, but reading that their logging solution does drop log messages is a huge no-go for operations.

It seems that when shit hits the fan for this reason ZMQ really is a more reliable choice because it's more flexible.

buster9y ago

No. When you can't rely on your collected ZMQ logs and need to "access historical logs" by some other means, why use the ZMQ logs at all? You usually don't know that something was not logged.

1 more reply

htn9y ago· 5 in thread

FWIW, you can get Kafka packaged as a fully managed and HA service from https://aiven.io on AWS and also Azure, GCE and DigitalOcean.

But if the Auth0 runs their entire operations on AWS, maybe Kinesis would have been a more natural transition.

mpd9y ago

Eh, Kinesis has some pretty significant trade-offs to know about if you are comparing it with Kafka (e.g. data retention time and write latency).

janczuktOP9y ago

We need an on-premise and cloud story, so cloud only solutions did not cut it for us.

PieterH9y ago

The article is a little old. How has the system run since you deployed it? Do you have any interesting figures?

1 more reply

ZenoArrow9y ago

To anyone reading this with Kafka experience, do you have any tips/advice when it comes to maintaining a Kafka service?

1 more reply

abritishguy9y ago

Kinesis is very poor

Nimimi9y ago· 5 in thread

You can deploy Kafka using DC/IO and it takes care about HA for you. DC/IO is quickly becoming the go-to solution for database deployments. ArangoDB even recommends it as default.

Now about Kafka vs ZeroMQ: you want Kafka if you cannot tolerate the loss of even a single message. The append-only log with committed reader positions is a perfect fit for that.

bogomipz9y ago

>"DC/IO is quickly becoming the go-to solution for database deployments."

It is? Can you provide any evidence supporting this claim?

Mesos is mostly used to deploy stateless services.

oblio9y ago

Do you mean this? https://dcos.io/get-started/ aka DC/OS?

From what I can see it doesn't really support database deployments except for ArangoDB and Cassandra.

ryanmaclean9y ago

Riak and MySQL are in Universe, for example: https://github.com/mesosphere/universe/tree/version-3.x/repo...

Nimimi9y ago

Ok, database was the wrong term. Maybe "things that big data companies use" or something.

mej109y ago

I wish there was more information on how much work it takes to maintain a DC/OS setup. All the marketing makes it out to be the easiest thing in the world.

bdowling9y ago· 4 in thread

But why ZeroMQ and not nanomsg?

janczuktOP9y ago

The answer is rather simplistic and does not even scratch the surface of the drama surrounding zeromq/nanomsg.

Now that we've been using ZeroMQ for over a year and have been perfectly happy, there is no incentive to look elsewhere.

PieterH9y ago

See http://hintjens.com/blog:112 for my opinion on why nano isn't (wasn't, perhaps, as it seems to be doing better) a good choice.

kal31dic9y ago

You can see commit history here: https://github.com/nanomsg/nanomsg/commits/master/src

2 more replies

raarts9y ago

From the blog: "Crazy Idea: Clone nanomsg, move to zeromq organization, relicense as MPL, support ZMTP, only new socket types and expose CZMQ API."

Did that ever happen? I still like the idea behind nanomsg.

1 more reply

wcdolphin9y ago· 3 in thread

Glad to hear you found a solution that worked for you though! Would love to hear about difficulties you had with the new system, in particular adding brokers.

_qc3o9y ago

I've never heard anyone say managing a production Kafka cluster was easy or simple. Well, anyone who has had to actually maintain such clusters hasn't said it anyway.

fauigerzigerk9y ago

>They said availability was "death of a child", not dropping log messages.

True, but it appears to me that availability problems and dropped log messages often have the same root cause - network issues.

So whenever they do have availability issues (and dying babies) they won't be able to investigate properly because log messages are being lost as well.

That's obviously a very general observation. It may well be that in their architecture availability issues are mostly caused by something unrelated to networking (e.g. the database).

1 more reply

kod9y ago

I've managed a production kafka cluster at my current gig for over two years. It has been easy and trouble free with the exception of one incident, which was ultimately our fault.

TheHydroImpulse9y ago· 2 in thread

FYI, Kafka doesn't need to fetch from disk every time as it caches the logs pretty aggressively, as long as you have enough memory.

Running Zk and Kafka on the same nodes is likely not the best thing.

im_down_w_otp9y ago

TheHydroImpulse9y ago

But co-locating them won't actually remove a class of errors because Zk is not HA. The Kafka brokers need to communicate with the leader in the Zk cluster.

k__9y ago· 2 in thread

I'm a total message queue noob. What are the usecases for them?

I used MQTT but only as a message bus.

zo19y ago

Nothing magical/weird about it, just depends on whether or not you've got a nail to hammer with your MQ-hammer.

macns9y ago

You should have a look at http://zguide.zeromq.org/page:all

It's a great read and describes most scenarios well and easy to understand.

weitzj9y ago· 2 in thread

Did you look at nsq.io or NATS?

tjholowaychuk9y ago

tapirl9y ago

Do you have any experience on NATS? It (with NATS steam) looks great.

thomaslee9y ago· 1 in thread

BrandonBradley9y ago

I can attest to 'getting nervous about what the next surprise will be' with Kafka. And I'm only dealing with a single node right now.

wanderr9y ago· 1 in thread

http://wanderr.com/jay/tail-error-logs-to-slack-for-fun-and-...

wanderr9y ago

asasidh9y ago· 1 in thread

So you used Kafka for something that should have been handled by a MQTT or ZeroMQ in the first place ?

cbsmith9y ago

MQTT is just a protocol, so not sure how that helps.

0MQ doesn't sound like it is the right solution either, but yeah... often you pick the wrong tool and learn something in the process.

StreamBright9y ago

The author correctly points out that he is comparing apples to oranges.

_halgari9y ago

markpapadakis9y ago

siscia9y ago

Did you consider MQTT? Sound to me a more natural choice.

jpgvm9y ago

Probably should have been running ZK and Kafka queues separate to CoreOS/container shenanigans.

If deployed using the Netflix co-processes both are very durable.

manigandham9y ago

Why dont all these companies ever just use real enterprise software?

There are about a dozen message systems out there that will handle much more than Kafka with minimal or no operational overhead while supporting everything they need.

jvoorhis9y ago

2015

efangs9y ago

Anyone use collectd + rrd for this purpose? Still trying to understand at what level it's worth to move to something else.

j / k navigate · click thread line to collapse