undefined | Better HN

0 pointsiskander12y ago0 comments

I'm really curious to find out in what situations Spark actually works for people. So far, no one in my lab seems to be having a terribly productive time using it. Maybe it's better for simple numerical computations? How large are the datasets you're working with?

0 comments

3 comments · 1 top-level

evancasey12y ago· 2 in thread

I did most of my benchmarking with the 10M MovieLens Dataset http://grouplens.org/datasets/movielens/ consisting of 10 million movie ratings on 10,000 movies from 72,000 users. So not necessarily "big data", but big enough to warrant a distributed approach.

Spark is ideally suited for iterative, multi-stage jobs. In theory, anything that requires doing multiple operations an a working dataset (i.e. graph processing, recommender systems, gradient descent) will do well on Spark due to the in-memory data caching model. This post explains some of the applications Spark is well-suited for: http://www.quora.com/Apache-Spark/What-are-use-cases-for-spa...

iskanderOP12y ago

So the central piece of data is something like a 10 million element RDD of (UserId, (MovieId, Rating))? If so, it sounds like that data would fit into a single in-memory sparse array, how does Spark's performance compare with a local implementation?

By comparison, I'm trying (and failing) to work with RDDs of 100+ billion elements.

platz12y ago

What is the difference between Spark and Storm? They both seem like "realtime compute engines"

*edit - from what I can see Spark is a replacement for hadoop (offline jobs), where Storm deals with online stream processing

1 more reply

j / k navigate · click thread line to collapse