undefined | Better HN

0 pointstoisanji14y ago0 comments

isn't stream programming what storm is supposed to be?

0 comments

8 comments · 1 top-level

scott_s14y ago· 7 in thread

Yes, it is a similar programming model. Some differences (please put "to the best of my knowledge" in front of all of these):

- Storm does not allow arbitrary state in operators (what Nathan Marz calls "bolts"). This makes implementing the runtime easier, such as being able to replay tuple sends for fault tolerance, but it limits what kinds of applications one can make. Yes, I'm on board with the idea that we should avoid mutable state as much as possible, but people who build real applications want it. Yet, fault tolerance in our system requires more work, so it's a trade-off.

- Storm programs are implemented in Java. Streams applications are implemented in our programming language, which has the rather pedestrian name Streams Programming Language, but usually just SPL. This may seem minor, but it's a big deal. Marz is working on a higher level language in Clojure. Implementing programs in a higher level language enables developers to abstract away many issues related to high performance, distributed systems. I compare it to the difference between writing assembly code and writing C code. (Or the difference between writing Python code and writing C code.) The code that we generate is similar in principle to how one writes a Storm application. Which brings me to...

- Storm runs on the JVM, we generate C++ code which gets compiled.

Neither Storm or Streams are the first or only in this area. Stream programming is also popular for hardware, but that is usually synchronous and if there's state, it's shared-memory. Storm and Streams are distributed and asynchronous. There are academic distributed streaming systems such as Borealis. The research name for Streams is System S, and there are many academic papers about it, or that use it as a platform for other research: http://dl.acm.org/results.cfm?h=1&cfid=66087472&cfto...

And for the record, I am impressed with Storm.

nathanmarz14y ago

You can have any state you want in bolts. What Storm does not provide is a persistence mechanism for that state. For that, you can just use an external database that knows how to handle distributed state and the associated tradeoffs (such as Riak or Cassandra).

The "state spout" abstraction, a future feature for Storm, will alleviate the performance problems with using an external database. Although in the time being, smart use of batching/checkpointing is sufficient for most applications.

Also, Storm topologies can be written in any language. Storm has great multi-language support.

I do agree though that higher level abstractions are important. That will come later, once we're confident that we've mastered the primitives for doing fault-tolerant realtime computation.

samstokes14y ago

Storm does not allow arbitrary state in operators (what Nathan Marz calls "bolts").

Maybe I misunderstand your point, but Storm does allow state in Bolts - a Bolt is just a Java object, so it can have member variables. That's how aggregation (e.g. counting events per user) is done. Of course, if you want a Bolt that scales horizontally, you need to account for the state being split across several instances of the Bolt class; and if you need the state to survive a restart, you need to keep it in an external database instead of in the object's memory.

scott_s14y ago

Then I stand corrected. Is there any means to declare partitioned state that the runtime then handles for the user, or does the user always have to manage it themselves? This is one of those things that is, I think, easier to do declaratively one level up in abstraction. (You can see examples of this by looking at invocations of our Aggregate operator in the above documents.)

I had assumed there was no arbitrary state because of the replay semantics. Let's say bolt A sends tuples to bolt B. B has internal state. A sends tuples t1, t2, t3 and t4. A receives acknowledgements that t1, t3 and t4 were processed. So t2 needs to be replayed. But the semantics of what that means is undefined - B has internal state that already incorporates, for certain, t3 and t4, and maybe t2. (While it's unlikely, you never know where a tuple got lost.) So replaying t2 is problematic - do you just blindly replay it, and allow potentially broken semantics? The alternative is to do rollback, which is quite hairy.

1 more reply

jwr14y ago

Also, Storm is freely available as open-source software with a permissive license.

scott_s14y ago

Yes, I meant to mention this. Streams is software that IBM sells. But there is a free academic license.

wicknicks14y ago

Very cool. Is it possible to add operators to Stream? I commonly run into the problem of batch resizing a lot of images. If there was an easy way to integrate imagemagick as an operator into a system which can push this task to different cores, that would be a big big win.

scott_s14y ago

Yes, user-defined operators are a big part of the design of the system. But, as noted below, this is software that IBM sells, and right now we're targeting large companies.

j / k navigate · click thread line to collapse

0 comments

8 comments · 1 top-level

scott_s14y ago· 7 in thread

Yes, it is a similar programming model. Some differences (please put "to the best of my knowledge" in front of all of these):

- Storm runs on the JVM, we generate C++ code which gets compiled.

And for the record, I am impressed with Storm.

nathanmarz14y ago

Also, Storm topologies can be written in any language. Storm has great multi-language support.

I do agree though that higher level abstractions are important. That will come later, once we're confident that we've mastered the primitives for doing fault-tolerant realtime computation.

samstokes14y ago

Storm does not allow arbitrary state in operators (what Nathan Marz calls "bolts").

scott_s14y ago

1 more reply

jwr14y ago

Also, Storm is freely available as open-source software with a permissive license.

scott_s14y ago

Yes, I meant to mention this. Streams is software that IBM sells. But there is a free academic license.

wicknicks14y ago

scott_s14y ago

Yes, user-defined operators are a big part of the design of the system. But, as noted below, this is software that IBM sells, and right now we're targeting large companies.

j / k navigate · click thread line to collapse