undefined | Better HN

0 pointsCorvusCrypto7y ago0 comments

Learned Scala, Kafka, and spark (and spark streaming/structured streaming).

I will say that after using spark, the big data community badly needs better distributed computing tools. It was just awful to work with. Also good luck debugging anything easily when you create stuff with the RDD apis.

I did come to like Scala, and Kafka as well, though.

0 comments

2 comments · 1 top-level

mywrathacademia7y ago· 1 in thread

Are you using Scala, Kafka and Spark for Machine/Deep Learning? I've been wanting to learn Spark, I only have have a vague idea if it. What's the best way to start and can you explain how Kafka ties in with Scala and Spark?

CorvusCryptoOP7y ago

For the first question, we were looking at it to perform distributed matrix factorization in the immediate use case, but were planning to use the platform for more advanced techniques later on. We also tried to implement a quick and dirty distributed ALS algorithm and that's where we used the RDD API came in and my subsequent hatred for said API. Kafka was our streaming platform through which all our customer data was sent.

For starting out, the tutorials on the spark documentation pages are pretty good actually. However for streaming sources it seemed hacky to use spark with our kafka use since we had to basically scour maven/google for a package that did anything useful with avro-serialized data. Even after finding relevant packages we had to copy and paste some avro-to-sparkstruct lib code into an internal library since it didn't play nice with our versions of spark in it's public form. We were using structured streaming so that we could leverage the dataframe API which is much easier for data scientists to use and grasp coming from pandas/R. It's a lot nicer for a data engineer as well (imo).

JSON-serialized data worked okay actually, so we ended up using kstreams to deserialize the avro data and output it to another topic that was JSON-serialized since spark handled that better (read: more easily). Then we used the dataframe APIs to implement some of the higher order operations such as building interaction matrices and factoring them out, then outputing the resulting factors to cloud storage for collaborative filtering to use as a recommendation system we had elsewhere.

I guess to me it seemed wholly immature for Kafka-sourced data and it made me think that if we dumped into Hadoop and then operated on microbatches it would have been much easier.

About Scala vs Java vs python... We saw that python still wasn't the first-class language contrary to what they say I'm the docs and that to utilize any newer, fancier algorithms that would come up it would be beneficial to use Scala since that is the language most developed against (as of our use of spark). There are plenty of libraries to connect with Kafka in any language so Kafka didn't factor into the language decision at all.

I will leave out some details about managing the resources on a cluster as well as implementing CI/CD. That could be a whole blog post on it's own.

Hope this helps. It may be the best tool at the moment for what it offers, but it was painful to use and I hope someone makes something better with the knowledge gained from the usage of spark.

j / k navigate · click thread line to collapse