undefined | Better HN

0 pointsramraj074y ago0 comments

My experience has been that spark actually doesn’t work with real big data without significant babysitting. It’s agorithms, be it window functions or joins, can take forever if not never finish on actual data that’s hundreds of terabytes large (even tens of tb). You immediately need to worry about crap like garbage collectors, worker memory, cache and data distribution. The majority of the data engineers out there can not actually deal with these problems but spark + actually not big data let’s them think they’re actually good at their jobs when in reality they’re not.

0 comments

1 comments · 1 top-level

gxt4y ago

This is usually the product of not having to worry about costs combined with elastic platforms. Bad or inefficient methodology will still work but cost more which may result in surprisingly slower feedback loops before their work comes back to bite them.

j / k navigate · click thread line to collapse