undefined | Better HN

0 pointsd_burfoot5y ago0 comments

The Apache big data suite (Hadoop/Spark/Yarn/Hive/HDFS/etc).

In several years of big data engineering work, I've believe I've seen only one application that couldn't be refactored into a simple multi-instance framework-free program. People use the big data frameworks as glorified distributed-job management tools, and the resulting systems are more fragile, more complex, more vulnerable to weird version compatibility errors, and less efficient.

0 comments

19 comments · 11 top-level

theptip5y ago· 3 in thread

> People use the big data frameworks as glorified distributed-job management tools

Do you have any tools you like for job management without all the distributed-systems baggage?

I've heard folks advocate for Make for this kind of thing, perhaps that or some other orchestration tool that deals with job dependency graphs would be the unix way? (Having a nice way to visualize failed step would of course be a plus; a common use-case is "re-run the intermediate pipeline, and everything downstream".)

walleeee5y ago

There's a bunch, at various levels of abstraction and slightly different primary use cases: Luigi, Dask, Airflow, Celery, Dagster, Prefect, Metaflow, Snakemake, Nextflow, etc

lixtra5y ago

Have a look at airflow.

However, so far I didn’t switch from rundeck & make.

ForHackernews5y ago

Airflow is really limiting in some non-obvious ways: https://medium.com/the-prefect-blog/why-not-airflow-4cfa4232...

Aperocky5y ago· 2 in thread

I had prior industry experience.

Eventually it was realized that getting a larger box and just spend sometime to think about cleaning the data is enough. But that didn't sound as good.

riku_iki5y ago

You can also think about two layers infra: have large big-data storage, have simple logic of extraction of aggregated/filtered data from it, and do complex work on your large box within single process.

Aperocky5y ago

It still made sense to store stuff in Hadoop - but it didn't make much sense to do anything further than extraction in the cluster, which we definitely did try with Spark + mllib.

FridgeSeal5y ago· 1 in thread

Spark is like my pet-hate.

Data Engineering team used it at my old work (in concert with Notebooks) and it resulted in some of the worst code I’ve ever seen, and most inappropriate use of resources:

9 node DataBricks cluster to push 200gb of JSON into an ElasticSearch cluster. This process consisted of:

* close to 5 notebooks. * things getting serialised to S3 at every possible opportunity. * a hand-rolled JSON serialisation method that would string-concat all the parts together: “but it only took me 2 minutes to write, what’s the problem?”

* hand rolled logging functions

* zero appropriate dependency management; packages were installed globally, never updated, etc

Nothing inherently about that workflow actually needed spark, which was the most egregious part. The whole thing could have been done in a python app with some job lib/multiprocessing thrown in and run as single container/etc.

Rapzid5y ago

Spark was the worst when I used it. Unhelpful error messages and failure scenarios. Inscrutable stack traces. Thing felt like the worst kind of black box and figuring out why a node timed out during a step or shuffle was soul crushing.

whack5y ago· 1 in thread

Are there any good articles or blog posts that describe a "multi-instance framework-free" design that would replace a Spark application? I'm having some trouble conceptualizing your suggested alternative, but am very interested in learning.

FridgeSeal5y ago

Out of the box Julia (the language) has support to run remote workers.

I have t used it personally, but the documentation is definitely there around distributing work and getting results back from nodes/etc and the community is very helpful.

caffeine5y ago· 1 in thread

Would be very interested in a blog post / further reading about this!

AdrianB15y ago

It would be an instant hit. Can't wait to read more.

Copyrighted5y ago

I never used the Apache big data suite daily.

I had a project in college where we tried to add a feature to Hadoop. Half the battle was spent trying to pass their test cases and figuring out why we couldn't build the program due to dependency issues.

Even though we were trying to build w/Hadoop's docker image, each team member had issues unique to them. The documentation definitely didn't help.

iblaine5y ago

Hadoop was never meant to be user friendly. Which is why you have commercial versions of it like Cloudera & AWS EMR, meant to fill that void in the industry.

FWIW, I feel your pain on a daily basis, but I do like Hadoop as a low cost massively distributed db.

runT1ME5y ago

Modern Spark is nice enough that I will use it to work with massive excel files locally! If someone can use Pandas they can use Spark, and you get the added benefit of distribution if necessary and more resiliancy.

I think it's likely that some subsection of Spark users are the type to over engineer a project, but I'm also confident they'd over engineer a much simpler framework as well.

beagle35y ago

codr75y ago

And I belieieve you. I designed my own on-disk log based format for my last two prorojects.

ddmichael5y ago

Someone had to mention this.

j / k navigate · click thread line to collapse