Show HN: Jet – in-memory, fault-tolerant, distributed stream processing (opens in new tab)

(github.com)

129 pointscangencer5y ago55 comments

55 comments

31 comments · 9 top-level

jmnicolas5y ago· 7 in thread

>High-throughput, large-state stream processing. For example, tracking GPS locations of millions of users, inferring their velocity vectors.

It baffles me they're so casual about it ...

rswail5y ago

Inferring velocity vectors would be very useful for analyzing traffic flows, impacts of lane widening/reducing, signal timing, ML for adaptive traffic management, etc.

None of those things are nefarious and don't necessarily provide additional knowledge, as long as care is taken to fully deanonymize and fuzz start/stop/end locations of trips or associate trips together.

People agree to provide this information to services like Waze etc for exactly these tasks.

luminadiffusion5y ago

Hmmm... I think there are only a handful of nefarious uses for this technology but a plethora of real-world applications. Almost all of the nefarious uses revolve around GPS and individuals. If you take that out, the set of applications is enormous-1.

It’s strange to me that people read something like this and infer the absolute inverse of the actual situation. That is definitely a “thinking fast” reaction.

kgraves5y ago

Which is still a disgusting and unethical use of technology.

4 more replies

haxen5y ago

This statement comes from our benchmarking work:

https://jet-start.sh/blog/2020/06/09/jdk-gc-benchmarks-part1

The point is that Jet can track several million distinct keys, even on a single machine, and finding velocity vectors boils down to linear regression sliding window against two FP variables.

If your concern is why you would specifically want to track locations, the answer is that there are plenty location-based apps that track locations with user's consent.

jmnicolas5y ago

Yes my concern is about how casual you sound about tracking millions of GPS locations.

By user consent you mean someone clicked a button without thinking to get to the app ?

1 more reply

netgusto5y ago

See no evil :) There could be non-shady reasons to do this.

Besides, I think this statement is just meant to give a sense of the kind of processing that can be done, and the scale it can reach.

onion2k5y ago

There could be non-shady reasons to do this.

I can't think of one.

7 more replies

loremipsium5y ago· 4 in thread

spark, storm, flink, beam, hazelcast... and then there are all the vendor locked choices confluent, kinesis, azure probably has something in that space to

The whole cloud computing space got me confused. I don't know what horse to bet on and don't have the time to get familiar with every new framework. Is this the new javascript world? If so I'd like to skip the next couple of years until we found our react equivalent.

edit: Not to be read as an invitation to discuss how react is not the de-facto standard of ui web frameworks

imglorp5y ago

Distributed Systems (the OReilly trout book) has a nice overview of the streaming landscape (the first four you mentioned). The first several chapters being a general tech background of stream processing: events, watermarks, redundancy etc.

http://streamingsystems.net/fig/10-36

nwsm5y ago

I already have Designing Data-Intensive Applications (2017), do you think I would get much more out of that book?

1 more reply

cangencerOP5y ago

This is very true. Stream processing is both old and new and I think it takes time for technology like this to really mature. There's currently a standardisation effort around Streaming SQL which may bear some fruit, but probably still many years away. Right even if you want to use some standard language like SQL to describe streaming queries there's differences in each tool both in syntax and semantics.

chrisjc5y ago

Are you referring to this? https://arxiv.org/abs/1905.12133

HN Link: https://news.ycombinator.com/item?id=20059006

1 more reply

KptMarchewa5y ago· 3 in thread

Why not use Apache Flink?

cangencerOP5y ago

While Flink is a fully-featured stream processing framework I think there's some notable differences. Off the top of my mind:

- Flink uses Zookeeper for metadata and coordination, Jet doesn't require any external systems for resilience.

- Flink uses RocksDB and HDFS for checkpointing/snapshotting, Jet stores it in distributed, replicated in-memory store.

- Flink allocates operators to slots, while Jet uses green threads/cooperative multi-threading. This means you can run many concurrent streaming jobs on the same cluster, with very low overhead.

- Jet is basically a single, self-contained JAR. It's all you need to run a production-grade service (+ some connectors, if you'd like)

- Jet can scale up/down with very little friction. You start a couple of processes and they will form a cluster automatically. Kill a couple of the processes, and the cluster goes on.

That said, Flink have a great set of overall features, especially around persistence and huge states. This is another area we're currently investing in as well as SQL support.

abeppu5y ago

> Flink allocates operators to slots, while Jet uses green threads/cooperative multi-threading. This means you can run many concurrent streaming jobs on the same cluster, with very low overhead.

How does the shift to cooperative multi-threading change the way that the cluster is used? In the "slot" approach, Alice and Bob can run concurrent jobs with relatively little coordination needed to "share" effectively -- e.g. they might use different branches of the same shared repo. In exchange for the lower-overhead, does Jet's approach require that multiple use cases are more carefully planned?

1 more reply

Aeolun5y ago

I’m not really sure how to imagine SQL support for something like this. Can you point me anywhere that will give me a better idea?

1 more reply

davewritescode5y ago· 2 in thread

Jet looks really cool, I'll but I think we'll stick with Flink for the time being.

I say this as someone who got burned hard with weird bugs using Hazelcast 2.X as distributed lock manager. I'll have a hard think before adopting any part of the Hazelcast ecosystem in the future after that experience. When the analysis of Hazelcast 3.x was posted on jepson.io (https://jepsen.io/analyses/hazelcast-3-8-3) I had a good laugh because a number of issues that were exposed, we had seen in production in older versions. Locks claimed on both sides of a cluster partition, locks never getting released when a node crashed while running, memory leaks, etc. In the end, we had the option of upgrading to 3.X or dumping it entirely in favor of ZooKeeper + Curator. We chose the latter and haven't had issues with our locking system once and nobody has gotten paged in the middle of the night because of a ZooKeeper issue.

After that experience, I'll take every guarantee made by Hazelcast with a giant grain of salt. I've heard good things about later versions so I'm going to assume things have improved but I implore people to look very closely at solutions like these and in particular, the guarantees they make before picking any of them.

jerrinot5y ago

The truth is the original Hazelcast replication protocol was not a good fit for some data-structures. We took the analysis seriously. I know every project and vendor claims that. Here is what we did in recent years:

1. Re-implemented concurrency primitives on top of Raft protocol. This includes Distributed Locks, Semaphores, AtomicLong, etc. Raft provides linearizability and that's what you usually want for concurrency primitives. See our epic blog post about locking: https://hazelcast.com/blog/long-live-distributed-locks/ or our Jepsen testing story: https://hazelcast.com/blog/testing-the-cp-subsystem-with-jep...

2. Added a FlakeID generator. This is on the opposite side of the consistency spectrum - it's a k-ordered Available (wrt CAP) ID generator. It won't generate duplicates even when there is a split-brain. See: https://docs.hazelcast.org/docs/4.0.2/manual/html-single/ind...

3. PNCounter - CRDT-based eventually consistent data structure, suitable for .. well, counting things:) See: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...

4. Significantly extended documentation, to be more explicit about Hazecast replication models and guarantees. The goal is clear: Avoid Surprises. See: https://docs.hazelcast.org/docs/4.0.2/manual/html-single/ind...

Disclaimer: Obviously I am biased as I work for Hazelcast.

drej5y ago

I remember a presentation by Kyle Kingsbury (the Jepsen guy), where he talked about the various inconsistencies he found in databases. He mentioned Zookeeper and said "I found no issues, which I consider a personal failure" :-)

victor1065y ago· 2 in thread

I am new to this space. So Sorry if this is not a valid comparison. But how does this compare to Kafka?

tyingq5y ago

It's in this "Dataflow Programming" category: https://en.m.wikipedia.org/wiki/Dataflow_programming

So, more comparable to Apache Beam, like a fancy ETL. Programming via pipes, transformations, etc.

It would hook to a Kafka (or other) stream.

dominotw5y ago

compares to kafka streams which is built on top of kafka.

forgotmyhnacc5y ago· 2 in thread

How does this compare to Apache beam?

haxen5y ago

An Apache Beam Runner is already implemented in Jet: https://beam.apache.org/documentation/runners/jet/

Beam is just an API layer with different backing implementations. But you don't typically use Beam to work with Jet, instead you use its own Pipeline API which is mostly like Java Streams. Jet will also soon get an SQL API.

netgusto5y ago

Very cool! Is it possible to mix apis in a single project with Jet Beam Runner? This would make it easier to port Beam projects to Jet, as the migration could be progressive.

1 more reply

grillorafael5y ago· 1 in thread

This has ways to handle all the problems i currently manually implement. Any idea of getting a python api ?

haxen5y ago

Hazelcast Jet will get an SQL API soon, and we're actively considering first-class support from other languages as well.

drej5y ago· 1 in thread

Regarding the two licences, one for the library itself, one for the connectors - what does it mean for users, in practice? Thanks.

cangencerOP5y ago

The license is meant to prevent service-wrapping by cloud providers, other than that it doesn't have any implications for standard usage. The core library / server is Apache 2 and the rest of the connectors are community license. You can use and embed both the core module and the connectors for free.

The license itself is similar to the licenses from Confluent, Elastic among many others. You can read more about it here: https://hazelcast.org/blog/announcing-the-hazelcast-communit...

liminal5y ago

I'm a bit surprised all these systems continue to be built on the JVM. For these sorts of tasks I'd expect something without a VM like Rust to be a better choice

j / k navigate · click thread line to collapse

55 comments

31 comments · 9 top-level

jmnicolas5y ago· 7 in thread

>High-throughput, large-state stream processing. For example, tracking GPS locations of millions of users, inferring their velocity vectors.

It baffles me they're so casual about it ...

rswail5y ago

Inferring velocity vectors would be very useful for analyzing traffic flows, impacts of lane widening/reducing, signal timing, ML for adaptive traffic management, etc.

People agree to provide this information to services like Waze etc for exactly these tasks.

luminadiffusion5y ago

It’s strange to me that people read something like this and infer the absolute inverse of the actual situation. That is definitely a “thinking fast” reaction.

kgraves5y ago

Which is still a disgusting and unethical use of technology.

4 more replies

haxen5y ago

This statement comes from our benchmarking work:

https://jet-start.sh/blog/2020/06/09/jdk-gc-benchmarks-part1

The point is that Jet can track several million distinct keys, even on a single machine, and finding velocity vectors boils down to linear regression sliding window against two FP variables.

If your concern is why you would specifically want to track locations, the answer is that there are plenty location-based apps that track locations with user's consent.

jmnicolas5y ago

Yes my concern is about how casual you sound about tracking millions of GPS locations.

By user consent you mean someone clicked a button without thinking to get to the app ?

1 more reply

netgusto5y ago

See no evil :) There could be non-shady reasons to do this.

Besides, I think this statement is just meant to give a sense of the kind of processing that can be done, and the scale it can reach.

onion2k5y ago

There could be non-shady reasons to do this.

I can't think of one.

7 more replies

loremipsium5y ago· 4 in thread

spark, storm, flink, beam, hazelcast... and then there are all the vendor locked choices confluent, kinesis, azure probably has something in that space to

edit: Not to be read as an invitation to discuss how react is not the de-facto standard of ui web frameworks

imglorp5y ago

http://streamingsystems.net/fig/10-36

nwsm5y ago

I already have Designing Data-Intensive Applications (2017), do you think I would get much more out of that book?

1 more reply

cangencerOP5y ago

chrisjc5y ago

Are you referring to this? https://arxiv.org/abs/1905.12133

HN Link: https://news.ycombinator.com/item?id=20059006

1 more reply

KptMarchewa5y ago· 3 in thread

Why not use Apache Flink?

cangencerOP5y ago

While Flink is a fully-featured stream processing framework I think there's some notable differences. Off the top of my mind:

- Flink uses Zookeeper for metadata and coordination, Jet doesn't require any external systems for resilience.

- Flink uses RocksDB and HDFS for checkpointing/snapshotting, Jet stores it in distributed, replicated in-memory store.

- Flink allocates operators to slots, while Jet uses green threads/cooperative multi-threading. This means you can run many concurrent streaming jobs on the same cluster, with very low overhead.

- Jet is basically a single, self-contained JAR. It's all you need to run a production-grade service (+ some connectors, if you'd like)

- Jet can scale up/down with very little friction. You start a couple of processes and they will form a cluster automatically. Kill a couple of the processes, and the cluster goes on.

That said, Flink have a great set of overall features, especially around persistence and huge states. This is another area we're currently investing in as well as SQL support.

abeppu5y ago

> Flink allocates operators to slots, while Jet uses green threads/cooperative multi-threading. This means you can run many concurrent streaming jobs on the same cluster, with very low overhead.

1 more reply

Aeolun5y ago

I’m not really sure how to imagine SQL support for something like this. Can you point me anywhere that will give me a better idea?

1 more reply

davewritescode5y ago· 2 in thread

Jet looks really cool, I'll but I think we'll stick with Flink for the time being.

jerrinot5y ago

3. PNCounter - CRDT-based eventually consistent data structure, suitable for .. well, counting things:) See: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...

Disclaimer: Obviously I am biased as I work for Hazelcast.

drej5y ago

victor1065y ago· 2 in thread

I am new to this space. So Sorry if this is not a valid comparison. But how does this compare to Kafka?

tyingq5y ago

It's in this "Dataflow Programming" category: https://en.m.wikipedia.org/wiki/Dataflow_programming

So, more comparable to Apache Beam, like a fancy ETL. Programming via pipes, transformations, etc.

It would hook to a Kafka (or other) stream.

dominotw5y ago

compares to kafka streams which is built on top of kafka.

forgotmyhnacc5y ago· 2 in thread

How does this compare to Apache beam?

haxen5y ago

An Apache Beam Runner is already implemented in Jet: https://beam.apache.org/documentation/runners/jet/

netgusto5y ago

Very cool! Is it possible to mix apis in a single project with Jet Beam Runner? This would make it easier to port Beam projects to Jet, as the migration could be progressive.

1 more reply

grillorafael5y ago· 1 in thread

This has ways to handle all the problems i currently manually implement. Any idea of getting a python api ?

haxen5y ago

Hazelcast Jet will get an SQL API soon, and we're actively considering first-class support from other languages as well.

drej5y ago· 1 in thread

Regarding the two licences, one for the library itself, one for the connectors - what does it mean for users, in practice? Thanks.

cangencerOP5y ago

The license itself is similar to the licenses from Confluent, Elastic among many others. You can read more about it here: https://hazelcast.org/blog/announcing-the-hazelcast-communit...

liminal5y ago

I'm a bit surprised all these systems continue to be built on the JVM. For these sorts of tasks I'd expect something without a VM like Rust to be a better choice

j / k navigate · click thread line to collapse