Apache Spark 1.0.0 (opens in new tab)

(spark.apache.org)

157 pointssteveb12y ago39 comments

39 comments

27 comments · 8 top-level

agibsonccc12y ago· 9 in thread

Spark is an interesting technology, from what I've heard it doesn't actually have traction in industry yet though.

Anyone here actually using it in production? I know it's blazing fast etc, and I like it as a map reduce replacement. It has all the makings of a great distributed system, I'm still waiting to see a major deployment yet..

subprotocol12y ago

May be of relevance: https://cwiki.apache.org/confluence/display/SPARK/Powered+By...

I don't know what you would count as major deployment, but I've deployed a 30-node cluster on HW for running sub-second real-time adhoc queries. I've also run many smaller 10-20 node virtual clusters on open stack. It is a rock solid platform. Our hosted ops loves it because it just works.

The amazing thing about spark is how insanely expressive and hackable it is. The best way I can describe it is this:

* Hadoop: You spend all of your time telling it how to do what you want (it is the assembly language of bigdata)

* Spark: you spend your time telling it what you want, and it just does it

iskander12y ago

> It is a rock solid platform. Our hosted ops loves it because it just works.

I'm really confused by how different our experiences have been. Above you said you wrote your own shuffle implementation, presumably that was prompted by poor performance at some point. And, when you encountered that poor performance, you presumably also saw what happens to Spark when it's overwhelmed: a sea of exceptions. In a short period of time I've encountered lots of the following:

- FileNotFound exceptions when shuffle files couldn't be created

- Too many open file handles (also related to shuffle files)

- Infinite procession of out-of-memory errors on a cluster with 12TB of memory.

- Executor disconnected

- Weird akka errors

- Mysterious serialization errors (Map.values isn't serializable, making a nested partitioner class for some reason didn't work)

These errors are sometimes recoverable and other times kill all the workers on the cluster. Did none of these things happen to your team?

agibsonccc12y ago

This does help actually. And yes: it doesn't have to be a 1000 node cluster or anything crazy. I've just talked to a lot of people at bigger companies and they've all said it falls over yet.

Great to hear success stories!

1 more reply

espeed12y ago

Cloudera now includes Spark...

http://vision.cloudera.com/mapreduce-spark/

http://blog.cloudera.com/blog/category/spark/

frak_your_couch12y ago

So does Hortonworks as of 2.1

http://hortonworks.com/blog/announcing-hdp-2-1-tech-preview-...

curiousfiddler12y ago

Yahoo was initially playing around with Spark. They opted for Tez on Yarn instead: http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-bet...

nemothekid12y ago

Ooyala has a huge deployment that they use alongside their Cassandra cluster (something like ~100 nodes, and ~50TB of data IIRC)

iskander12y ago

I'd love to learn more about how they're using Spark, are there are any blog posts or tech talks floating around?

1 more reply

espeed12y ago

eBay posted this two days ago...

Using Spark to Ignite Data Analytics ( http://www.ebaytechblog.com/2014/05/28/using-spark-to-ignite...)

eranation12y ago· 6 in thread

I wonder if anyone with experience with Spark can comment / rebut this post: http://blog.explainmydata.com/2014/05/spark-should-be-better...

subprotocol12y ago

I use spark a lot and my experience has been quite the opposite. The queries I run against spark are billions of events and results are sub-second.

I could only speculate as to what this users issues were. One difference between hadoop and spark is that it is more sensitive in that you sometimes need to tell it how many tasks to use. In practice it is no big deal at all.

Perhaps the user was running into this- the data for a task in spark runs all in memory, whereas hadoop will load and spill to disk within a task. So if you give a single hadoop reducer 1TB of data, it will complete after a very long time. In spark if you did this you would need to have 1TB of memory on the executor. I wouldn't give an executor/JVM anything over 10GB. So if you have lots of memory, just be sure to balance it with cores and executors.

I have seen spark use up all the inodes on systems before. A job with 1000 map and 1000 reduce tasks would create 1M spill files on disk. However that was on an earlier version of spark and I was using ext3. I think this has since been improved.

For me spark runs circles around hadoop.

iskander12y ago

>The queries I run against spark are billions of events and results are sub-second.

This is interesting, I haven't gotten Spark to do anything at all in less than a second. How big is this dataset (what does each event consist of)? How is the data stored? How many machines / cores are running across? What sort of queries are you running?

>I could only speculate as to what this users issues were.

I'm the author of the above post and unfortunately I can also "only speculate" what my issues were. Maybe Spark doesn't like 100x growth in the size of an RDD using flatMap? Maybe large-scale joins don't work well? Who knows. The problem, however, definitely doesn't seem to be anything from the tuning guide(s).

2 more replies

evancasey12y ago

My experience using spark has also been nothing but positive. I recently built a similarity-based recommendation on Spark (https://github.com/evancasey/sparkler), and found it to be significantly faster than comparable implementations on Hadoop.

subprotocol's point about specifying the number of tasks/data partitions to use is true - you need to manually set this in order to get good results even on a small dataset. However, other than that, spark will give you good results pretty much out of the box. More advanced features such as broadcast objects, cache operations, and custom serializers will further optimize your application, but are not critical when first starting out as the author seems to believe.

iskander12y ago

I'm really curious to find out in what situations Spark actually works for people. So far, no one in my lab seems to be having a terribly productive time using it. Maybe it's better for simple numerical computations? How large are the datasets you're working with?

1 more reply

lmm12y ago

Spark's abstractions are indeed really nice; a Spark job is much more readable than the same thing expressed in raw MapReduce, as that post acknowledges at the end.

I can't really comment on or rebut "my code runs slow and I don't know why", except to say that Spark performance has been great when I've used it. But yeah, if the abstraction should fail (and again all I can say is it hasn't for me) then I can imagine it's not much fun to debug performance and there's no distributed profiler (though I think you'd be in much the same boat with vanilla Hadoop).

iskander12y ago

> that Spark performance has been great when I've used it.

Can you say more about your use case? What sort of data did you start with? What did you do with it? How large was the cluster you were running on?

1 more reply

kovrik12y ago· 3 in thread

Any active Clojure bindings?

clj-spark seems to be abandoned (last commit was a year ago)...

gphil12y ago

I'm curious about this too--clj-spark didn't work for me so I'm currently prototyping a job using the Java bindings in Clojure. If I end up wrapping the Java bindings in a useful way I would consider putting together some kind of release if there's community interest in that.

terranstyler12y ago

Me too.

However, shouldn't there be a much more flexible approach where you just send your functions to an execution server (just like an agent)? You might want to define some keywords to refer to previously used functions or data.

Then again, such an agent would be pretty much a REPL, so you might just want to ssh to a REPL that does load balancing and has sub-repls (on other machines) that fail over.

Thoughts on that?

dgrnbrg12y ago

I am interested in Clojure bindings for spark!

krallin12y ago· 1 in thread

Note that Spark 1.0.0 makes it possible to trivially submit spark jobs to an existing Hadoop cluster.

It leverages HDFS to distribute archives (e.g. your app JAR) and store results / state / logs, and YARN to schedule itself and acquire compute resources.

It's pretty amazing to see how you use Spark's API to write functional applications that are then distributed across multiple executors (e.g. when you use Spark's "filter" or a "map" operations, then the operation potentially gets distributed and distributed on totally different nodes).

Great tool — exciting to see it reach 1.0.0!

metronius12y ago

Do you mean SIMR or something another ?

MoOmer12y ago

For new entrants, here's an email I sent out to some colleagues of mine just getting into ML. I'm wrapping up a project that's using Mahout, and am getting into Spark & MLlib now. I've regurgitated this on reddit already.

I've been following Apache Spark [0], a new-ish Apache project created by UC Berkeley to replace Hadoop MapReduce [1], for about a month now; and, I finally got around to spending some time with it last night and earllllllly this morning.

Added into the Spark mix about a year ago was a strong Machine Learning library (MLlib) [2] similar to Mahout [3] that promises much better performance (comparable/better than Matlab [4]/Vowpal Wabbit [5])

MLlib is a lower level library, which offers a lot of control/power for developers. However, Berkeley's Amplab has also created a higher level abstraction layer for end users called MLI [6]. It's still being actively developed, and although updates are in the works, they haven't been made available to the public repository for a while [7]

Check out an introduction to the MLlib on youtube here: https://www.youtube.com/watch?v=IxDnF_X4M-8

Getting up to speed with Spark itself is really pain-free compared to some tools like Mahout etc. There's a quick-start guide for Scala [8], a getting started guide for Spark [9], and lots of other learning/community resources available for Spark [10] [11]

[0] http://spark.apache.org/

[1] http://hadoop.apache.org/

[2] http://spark.apache.org/mllib/

[3] https://mahout.apache.org/

[4] http://www.mathworks.com/products/matlab/

[5] https://github.com/JohnLangford/vowpal_wabbit/wiki

[6] http://www.mlbase.org/

[7] http://apache-spark-user-list.1001560.n3.nabble.com/Status-o...

[8] www.artima.com/scalazine/articles/steps.html

[9] http://spark.apache.org/docs/latest/quick-start.html

[10] http://ampcamp.berkeley.edu/4/exercises/

[11] https://spark.apache.org/community.html

stevebOP12y ago

I gave a 30-minute overview of Spark yesterday at StampedeCon. Spark is generating a lot of excitement in the big data community:

https://speakerdeck.com/stevendborrelli/introduction-to-apac...

alexatkeplar12y ago

Great to see Spark hitting 1.0.0. You can actually run Spark on Elastic MapReduce pretty easily - check out our tutorial project for how: https://github.com/snowplow/spark-example-project

Nasiruddin12y ago

Great...new era of distributed computing

j / k navigate · click thread line to collapse

39 comments

27 comments · 8 top-level

agibsonccc12y ago· 9 in thread

Spark is an interesting technology, from what I've heard it doesn't actually have traction in industry yet though.

subprotocol12y ago

May be of relevance: https://cwiki.apache.org/confluence/display/SPARK/Powered+By...

The amazing thing about spark is how insanely expressive and hackable it is. The best way I can describe it is this:

* Hadoop: You spend all of your time telling it how to do what you want (it is the assembly language of bigdata)

* Spark: you spend your time telling it what you want, and it just does it

iskander12y ago

> It is a rock solid platform. Our hosted ops loves it because it just works.

- FileNotFound exceptions when shuffle files couldn't be created

- Too many open file handles (also related to shuffle files)

- Infinite procession of out-of-memory errors on a cluster with 12TB of memory.

- Executor disconnected

- Weird akka errors

- Mysterious serialization errors (Map.values isn't serializable, making a nested partitioner class for some reason didn't work)

These errors are sometimes recoverable and other times kill all the workers on the cluster. Did none of these things happen to your team?

agibsonccc12y ago

This does help actually. And yes: it doesn't have to be a 1000 node cluster or anything crazy. I've just talked to a lot of people at bigger companies and they've all said it falls over yet.

Great to hear success stories!

1 more reply

espeed12y ago

Cloudera now includes Spark...

http://vision.cloudera.com/mapreduce-spark/

http://blog.cloudera.com/blog/category/spark/

frak_your_couch12y ago

So does Hortonworks as of 2.1

http://hortonworks.com/blog/announcing-hdp-2-1-tech-preview-...

curiousfiddler12y ago

Yahoo was initially playing around with Spark. They opted for Tez on Yarn instead: http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-bet...

nemothekid12y ago

Ooyala has a huge deployment that they use alongside their Cassandra cluster (something like ~100 nodes, and ~50TB of data IIRC)

iskander12y ago

I'd love to learn more about how they're using Spark, are there are any blog posts or tech talks floating around?

1 more reply

espeed12y ago

eBay posted this two days ago...

Using Spark to Ignite Data Analytics ( http://www.ebaytechblog.com/2014/05/28/using-spark-to-ignite...)

eranation12y ago· 6 in thread

I wonder if anyone with experience with Spark can comment / rebut this post: http://blog.explainmydata.com/2014/05/spark-should-be-better...

subprotocol12y ago

I use spark a lot and my experience has been quite the opposite. The queries I run against spark are billions of events and results are sub-second.

For me spark runs circles around hadoop.

iskander12y ago

>The queries I run against spark are billions of events and results are sub-second.

>I could only speculate as to what this users issues were.

2 more replies

evancasey12y ago

iskander12y ago

1 more reply

lmm12y ago

Spark's abstractions are indeed really nice; a Spark job is much more readable than the same thing expressed in raw MapReduce, as that post acknowledges at the end.

iskander12y ago

> that Spark performance has been great when I've used it.

Can you say more about your use case? What sort of data did you start with? What did you do with it? How large was the cluster you were running on?

1 more reply

kovrik12y ago· 3 in thread

Any active Clojure bindings?

clj-spark seems to be abandoned (last commit was a year ago)...

gphil12y ago

terranstyler12y ago

Me too.

Then again, such an agent would be pretty much a REPL, so you might just want to ssh to a REPL that does load balancing and has sub-repls (on other machines) that fail over.

Thoughts on that?

dgrnbrg12y ago

I am interested in Clojure bindings for spark!

krallin12y ago· 1 in thread

Note that Spark 1.0.0 makes it possible to trivially submit spark jobs to an existing Hadoop cluster.

It leverages HDFS to distribute archives (e.g. your app JAR) and store results / state / logs, and YARN to schedule itself and acquire compute resources.

Great tool — exciting to see it reach 1.0.0!

metronius12y ago

Do you mean SIMR or something another ?

MoOmer12y ago

Check out an introduction to the MLlib on youtube here: https://www.youtube.com/watch?v=IxDnF_X4M-8

[0] http://spark.apache.org/

[1] http://hadoop.apache.org/

[2] http://spark.apache.org/mllib/

[3] https://mahout.apache.org/

[4] http://www.mathworks.com/products/matlab/

[5] https://github.com/JohnLangford/vowpal_wabbit/wiki

[6] http://www.mlbase.org/

[7] http://apache-spark-user-list.1001560.n3.nabble.com/Status-o...

[8] www.artima.com/scalazine/articles/steps.html

[9] http://spark.apache.org/docs/latest/quick-start.html

[10] http://ampcamp.berkeley.edu/4/exercises/

[11] https://spark.apache.org/community.html

stevebOP12y ago

I gave a 30-minute overview of Spark yesterday at StampedeCon. Spark is generating a lot of excitement in the big data community:

https://speakerdeck.com/stevendborrelli/introduction-to-apac...

alexatkeplar12y ago

Great to see Spark hitting 1.0.0. You can actually run Spark on Elastic MapReduce pretty easily - check out our tutorial project for how: https://github.com/snowplow/spark-example-project

Nasiruddin12y ago

Great...new era of distributed computing

j / k navigate · click thread line to collapse