Flink vs Spark vs Storm vs Kafka vs Samza vs Apex
How do they compare? How would you choose which one to use?
1. Flink - Focused on stateful stream processing.
2. Spark - Focused on batch processing. Can be used for continuous streams, but approaches them as "micro-batches".
3. Kafka - A message queue system (for all practical purposes). Has an optional stream processing add-on for basic needs.
Separate use cases and strengths aside, it's worth calling out that all of these products are primarily backed by completely different companies. Apache is a consortium made of many companies, and serves as common branding for "community editions" of their "enterprise edition" products. There can quite a lot of overlap between sponsored products in this consortium.
Apache Software Foundation is not a consortium made of many companies but a single non-profit that provides organizational support for open source projects, some of which have contributors employed as such by other companies and some of which have only volunteer contributors.
I am genuinely interested as use KStreams a lot but the engineering discipline in the API leads a lot to be desired and more than happy to switch the API if Flink is that much better.
There's also Apache Beam, which is an API for streaming, and has Flink and Apex execution engines. Google's Cloud Dataflow is another implementation of Apache Beam.
As to which one to choose, you need to evaluate them, there's no simple answers. If you have Hadoop already then Apex may be a better fit than Flink; OTOH if you do Akka stuff already, then Flink might integrate better with your stack. If you have more batch than streaming use cases, maybe you want Spark. Etc.