In several years of big data engineering work, I've believe I've seen only one application that couldn't be refactored into a simple multi-instance framework-free program. People use the big data frameworks as glorified distributed-job management tools, and the resulting systems are more fragile, more complex, more vulnerable to weird version compatibility errors, and less efficient.
Data Engineering team used it at my old work (in concert with Notebooks) and it resulted in some of the worst code I’ve ever seen, and most inappropriate use of resources:
9 node DataBricks cluster to push 200gb of JSON into an ElasticSearch cluster. This process consisted of:
* close to 5 notebooks. * things getting serialised to S3 at every possible opportunity. * a hand-rolled JSON serialisation method that would string-concat all the parts together: “but it only took me 2 minutes to write, what’s the problem?”
* hand rolled logging functions
* zero appropriate dependency management; packages were installed globally, never updated, etc
Nothing inherently about that workflow actually needed spark, which was the most egregious part. The whole thing could have been done in a python app with some job lib/multiprocessing thrown in and run as single container/etc.
Eventually it was realized that getting a larger box and just spend sometime to think about cleaning the data is enough. But that didn't sound as good.
I had a project in college where we tried to add a feature to Hadoop. Half the battle was spent trying to pass their test cases and figuring out why we couldn't build the program due to dependency issues.
Even though we were trying to build w/Hadoop's docker image, each team member had issues unique to them. The documentation definitely didn't help.
I have t used it personally, but the documentation is definitely there around distributing work and getting results back from nodes/etc and the community is very helpful.
FWIW, I feel your pain on a daily basis, but I do like Hadoop as a low cost massively distributed db.
Do you have any tools you like for job management without all the distributed-systems baggage?
I've heard folks advocate for Make for this kind of thing, perhaps that or some other orchestration tool that deals with job dependency graphs would be the unix way? (Having a nice way to visualize failed step would of course be a plus; a common use-case is "re-run the intermediate pipeline, and everything downstream".)
However, so far I didn’t switch from rundeck & make.
I think it's likely that some subsection of Spark users are the type to over engineer a project, but I'm also confident they'd over engineer a much simpler framework as well.