One thing is odd though, there is no mention of disk space at all and only a configuration of retention time. One of Kafka's best features is the use of disk to store large amounts of messages, you are not RAM bound. Heroku seems to only allows you to set retention times? This could be awesome if they are giving you "unlimited" disk space, but could also be a beta oversight. Interested to see how this progresses.
> Apache Kafka is a distributed commit log for fast, fault-tolerant communication between producers and consumers using message based topics. Kafka provides the messaging backbone for building a new generation of distributed applications capable of handling billions of events and millions of transactions
Can anyone translate this into meaningful English for me?
Very useful if, say, you have some real world event and dozens of different micro services need to do something about that event, independently.
You can also just use it for logging.
If you're actually interested in Kafka, just read the documentation, it's quite good.
Biggest competitors of Kafka are RabbitMQ and amazon SQS.
Biggest competitors would be AWS Kinesis, Azure EventHubs and Google PubSub.
https://medium.com/salesforce-engineering/the-architecture-f...
Other than that, openshift is nice, though, I agree.
[1] https://openshift.uservoice.com/forums/258655-ideas/suggesti...
I explained how we moved a bunch of smaller sites to S3, reluctantly because I really like having a unified platform for all our sites. But even though (or perhaps precisely because) we are spending thousands of dollars a month with Heroku, I find the $20/month SSL charge insulting. SSL is not an option anymore.
The good news is, the sales rep said this has come up a lot, they hear us, and to "stay tuned".
SSL is a pain point, though I do empathize with them - I think they're doing something expensive for that. What I do is to use AWS Cloudfront and ACM for a free cert and site speedup - if they are personal projects the CF bill ought to be in the low few dollars anyway.
If you have say 10 apps then Heroku costs 10*$7, but you might still only have used 1-3 servers depending on memory use of apps etc so then Heroku looses on cost.
Naturally I got a total mix of quite a few on Heroku's classic or new free plan, some on their hobby plan, some on AWS, some on docker cloud, most proxied behind a SSL certificate running on AWS..... (https://flurdy.com/docs/letsencrypt/nginx.html)
In my experience, Kafka is a solid system when you work in its wheelhouse, which is a relatively static set of servers / topics, that you add to slowly and deliberately. If you can't use something like Kinesis, then its a good choice.
In Kafka, programmatic administration is generally an afterthought. They have APIs for doing things, but they generally involve directly modifying znodes. Simple things don't work or have bugs, deleting topics didn't work at all until 0.8.2, and even now has bugs. We've seen cases where if you delete a topic while an ISR is shrinking or expanding, your cluster can get into an unrecoverable state where you have to reboot everything, and even then it doesn't always get fixed. Most of the time you are expected to use scripts to modify everything (there's a wide variety of systems out there that try to build mgmt on top of kafka).
Its dependency on Zookeeper is a pain, and limits scalability of topic / partition counts. Rebalancing topics will reset retention periods because they use the last modified ts of the segment files to check for oldness, meaning if you rebalance often, you need extra disk space laying around. ZK has some bugs with its DNS handling, which affects Kafka if you try and use DNS.
It has throttling, but its by client id, what you'd like in some cases, is to say that a node has X throughput, and have the broker be able to somewhat guarantee that throughput, and create backpressure when clients are overwhelming it. Otherwise your latency can go through the roof. You also want replication to play nice with client requests, and it doesn't (if you add a new broker and move a bunch of partitions to it, you'll light up all your other brokers while it replicates, and cause timeouts).
Its replication story can cause issues when network partitions come into play.
It's highly configurable like many Apache projects, which is a blessing and a curse, as your team has to know all the knobs, both consumer / producer / broker side.
The alternative if you are at a company with the resources to do so (mine is), is to build something that fits your use case better than Kafka, or to use a hosted service like this, or Kinesis.
Only downside with Google Pubsub can be latency (which I'm working on fixing by building a gRPC driver) but Kafka has proven to be too complicated to maintain in-house. If heroku can provide the speed without the ops overhead, it'll be some good competition to Google's option.
Also want to note that Jay Kreps who helped build Kafka at LinkedIn is now behind http://www.confluent.io/ which is like a better/enterprise version of Kafka.
When creating a Kinesis consumer, I can specify whether I want to start reading a stream from a) TRIM_HORIZON (which is the earliest events in the stream which haven't yet been expired aka "trimmed"), b) LATEST which is the Cloud Pub/Sub capability, c) AT_SEQUENCE_NUMBER {x} which means from the event in the stream with the given offset ID, d) AFTER_SEQUENCE_NUMBER {x} which is the event immediately after c), e) AT_TIMESTAMP to read records from an arbitrary point in time.
A Kinesis stream (like a Kafka topic) is a very special form of database - it exists independently of any consumers. By contrast, with Google Cloud Pub/Sub [1]:
> When you create a subscription, the system establishes a sync point. That is, your subscriber is guaranteed to receive any message published after this point.
[1] https://cloud.google.com/pubsub/subscriber
So the stream is not a first class entity in Cloud Pub/Sub - it's just a consumer-tied message queue.
I think the only way in to replay events in Google Cloud Pub/Sub is to create multiple subscriptions in advance, right after topic creation. But then I think you need to pay for the storage and event traversal requests.
The biggest advantage of kafka is that all of the heroku marketplace all of a sudden becomes "plug and play"
Essentially it's the "backend data" equivalent of what segment does for "frontend data".
Example: What's the benefit of having a graphDB service in the marketplace if most people dont want to / cant invest engineering in keeping the data in (realtime) sync.
With kafka they can establish standards that all partners can adapt to, they will simply offer piping of all heroku postgres/redis changes.
A quote from the article: "At the end of the run, Kafka typically acknowledges 98–100% of writes. However, half of those writes (all those made during the partition) are lost."
My operating theory is that the people who would really make use of something like this have grown beyond managed offerings and would take it in house. For smaller operations Redis is more than enough for pub/sub. Ditto for SQS for externally triggered eventing.
I didn't find that to be so at my last job, one of those smaller operations.
With Redis you're forced to pick between two severely constrained options:
1. Use PUBLISH/SUBSCRIBE. This is nice if you want to have several listeners all receive the same message. But if a listener is down, there's no way for it to recover a message that it missed. If there is no one listening, messages are just dropped.
2. Use LPUSH/BRPOP. This is nice if you want to have several workers all pulling from the same queue, but isn't sufficient if you want to have several queues streaming from the same topic. (E.g. one listener is responsible for syncing to ElasticSearch and another one is syncing to your analytics DB.)
I strongly prefer RabbitMQ. Its model of exchanges and queues supports mixing and matching these semantics much more flexibly.
However RabbitMQ is also pretty fragile and terrible at scaling. NATS.io is another system that's got the messaging right and is working on persistence soon.
MessageHub's lead engineer Oliver Deakin gave a talk at Unified Log London recently where he explained how MessageHub was architected under the hood, was super-interesting. Slides available from here: http://www.meetup.com/unified-log-london/events/229693782/
SQS, Kinesis, and other proprietary ones not so much. You can insulate your code base but if you're really going to leverage the ecosystem of those services then you're going to be stuck there. That's why I find something like this interesting. The "out" is there so it makes it easier to accept getting in.
https://azure.microsoft.com/en-us/services/event-hubs/
They also have simpler Queues and Service Bus for RPC/lightweight message handling.
Basically, if you want to get data from one place to another and care about order, Kafka is a good solution. It acts as a middleman between services.
Redis is a database, Kakfa is a data logging system built for scale and throughput. Event processing (of any kind like stocks, ad impressions, ecommerce purchases) are a great fit. Also good as a message queue unless you need ultra low-latency RPC.
Hey, what is Kafka?
"It's a distributed logging system, not a message queue"
Ok, what's the use case?
describes a case when its used as a message queue
Kafka is a distributed logging system that can ingest large amounts of data straight to disk, then allows for multiple consumers to read this data through a simple abstraction of topics and partitions. Consumers maintain their own position of where they last read up to (or re-read things if they want) and everything is sequential I/O which creates very high throughput.
Lots of places also use it just as a message queue, some places for example write time series metrics to Kafka for monitoring.
Its better to think of Kafka as a database for events, not as a transport mechanism for those events.
As for being bloated, Kafka lives in a very empty space, that is it supports fully ordered events to all consumers (and it has good HA options). The only other tool that I've come across that gives you the same data guarantees is Kinesis and it requires AWS.
I've found that yes Kafka is complex, but its complex because its solving a complex problem, not because its bloated.
That said, if you want a non-ordered message queue, use NSQ instead of Kafka.