undefined | Better HN

0 pointsb9a2cab54y ago0 comments

Abstractions almost always end up leaky. Spark SQL, for example, does whole-stage codegen which collapses multiple project/filter stages into a single compiled stage, but your underlying data format still needs to be memory friendly (i.e. linear accesses, low branching, etc.). The codegen is very naive and the JVM JIT can only do so much.

What I've seen is that you need people who deeply understand the system (e.g. Spark) to be able to tune for these edge cases (e.g. see [1] for examples of some of the tradeoffs between different processing schemes). Those people are expensive (think $500k+ annual salaries) and are really only cost effective when your compute spend is in the tens of millions or higher annually. Everyone else is using open source and throwing more compute at the problem or relying on their data scientists/data engineers to figure out what magic knob to turn.

[1]: https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf

0 comments

3 comments · 1 top-level

paulluuk4y ago· 2 in thread

What's wrong with relying on data engineers for data engineering?

disgruntledphd24y ago

Spark is very very odd to tune. Like, it seems (from my limited experience) to have the problems common to distributed data processing (skew, it's almost always skew) but because it's lazy, people end up really confused as to what actually drives the performance problems.

That being said, Spark is literally the only (relatively) easy way to run distributed ML that's open source. The competitors are GPU's (if you have a GPU friendly problem) and running multiple Python processes across the network.

(I'm really hoping that people will now school me, and I'll discover a much better way in the comments).

b9a2cab5OP4y ago

Data engineers should be building pipelines and delivering business value, not fidgeting with some JVM or Spark parameter that saves them runtime on a join (or for that matter, from what I've seen at a certain bigco, building their own custom join algorithms). That's why I said it's only economical for big companies to run efficient abstractions and everyone else just throws more compute at the issue.

j / k navigate · click thread line to collapse