undefined | Better HN

0 pointswenc8y ago0 comments

1) Yes, PySpark is great if you're mostly just doing dataframe manipulation in Spark, using built-in functions. PySpark actually has similar performance to Scala Spark for dataframes. (We've moved away from RDDs)

However, if you use a lot of UDFs where Spark has to serialize your Python functions, you might consider rewriting those UDFs in a JVM language. Serialization overhead is still fairly substantial. Arrow is trying to address this by implementing a common in-memory format, but it's still early days.

I would still recommend PySpark to most people. It's more than good/fast enough for most data munging tasks. Scala does buy you two things: type safety and low serialization overhead (i.e. significant!), which can be critical in some situations, but not all.

Also, the Python way has always been to prototype fast, profile, and rewrite bottlenecks in a faster language, and PySpark conforms to that pattern.

2) Spark MLLib is still fairly rudimentary in its coverage of major ML algorithms, and Spark's linear algebra support, while serviceable, is currently not very sophisticated. There are a few functions that are useful in the data prep stage (encoding, tokenizers, etc.) but overall, we don't really use MLlib very much.

Companies that have simple needs (e.g. a simple recommender) and that don't have a lot of in-house expertise, might use MLlib though -- I believe someone from a startup said that they did at a recent meetup.

Most of us need better algorithmic coverage and Scikit's coverage is currently much better, plus it is more mature. We also have Numpy at our disposal, which lets us do matrix-vector manipulation easily. There is some serialization cost, but we can usually just throw cloud computational power at it.

Also note that for most workloads, the majority of the cost is incurred in training. For models in production, one is typically processing a much smaller amount of data using a trained model, so less horsepower is required.

0 comments

2 comments · 1 top-level

sandGorgon8y ago· 1 in thread

Hi, Thanks for the answer. What you said resonates with me - with a few changes. Spark 2.3 will come with Arrow UDF, that should be a significant performance boost. In that way, yes - we are taking at a forward looking bet.

About mllib - yes, we concur with you on algorithmic coverage. And yes, training is the major issue. For example, what I read of Uber's Michaelangelo infrastructure - it seems they train using Spark and save to a custom format that is deserialized (using custom code) and made available as a docker image .

There is value in consistency - using Spark thtoy2and through. Wonder what you thought of that ?

wencOP8y ago

1) I've heard about vectorized Python UDFs in Spark 2.3. Thanks for reminding of that.

https://databricks.com/blog/2017/10/30/introducing-vectorize...

2) I'm not that familiar with what Uber is doing. My take is I'd like to use Spark for as much as I can, but there are parts that are either more performant or easier to accomplish in Python.

Spark with Arrow will definitely change the game.

j / k navigate · click thread line to collapse