Skip to content

Top Best Ask Show New Jobs

Show HN: Arc, an open-source Databricks alternative (opens in new tab)

(arc.tripl.ai)

175 pointsseddonm15y ago36 comments

36 comments

33 comments · 12 top-level

seddonm1OP5y ago· 7 in thread

After being frustrated with building 'traditional' ETL (Extract-Transform-Load) pipelines - and around the same time as the famous 'Engineers Shouldn’t Write ETL' blog post - we started building a framework/toolkit to allow Technical Business Analysts to be able to build reliable data pipelines without much developer support: Arc. This has been implemented as a Jupyter Notebooks extension.

Arc is declarative and currently targets the Apache Spark execution engine but the abstracted API allows replacing execution engines without having to rewrite the logic or intent of the pipeline in future. It supports parameterized notebooks to build complex pipelines which can be executed in CICD environments for safe deployment.

We would be interested to hear your feedback.

> the famous 'Engineers Shouldn’t Write ETL' blog post

for anyone else wondering, this appears to be Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department, by Jeff Magnusson in 2016 https://multithreaded.stitchfix.com/blog/2016/03/16/engineer....

armanboyaci5y ago

We also inspired from the same blog post ('Engineers Shouldn’t Write ETL') and built our own internal ETL tools.

Our primary design goal was the system to be self-service for data scientists. Since our data scientists use pandas dataframes and jupyter notebooks all the time, we built the system around these two: (1) We have a library (that we call pype) acting an interface between the database and python dataframes (similar to .to_csv method), so there is no SQL queries in ETL scripts, (2) schedule (parametrized) notebooks using some special keywords.

We have a demo screencast: https://drive.google.com/file/d/1SVTduaIH_3IsJ-QoGI4mLYZE8Jv...

seddonm1OP5y ago

Looks good. It is nice to see how much influence the 'Engineers Shouldn't Write ETL' post had!

With Apache Arrow (https://arrow.apache.org/) I think the future looks very bright for both of our projects. It is important to have standard open source libraries and my early experiments have shown very good performance results.

> repeatable in that if a job is executed multiple times it will produce the same result

can you please elaborate more on this statement, does it mean you dump the data first time job is executed? Because if data is evolving, at some point execution should produce different results

seddonm1OP5y ago

The better statement would be that this facilitates the development of idempotent jobs and aims to minimise side-effects.

johnx123-up5y ago

What is your opinion about https://min.io/ (It also has S3 compatibility AFAIK)

tyingq5y ago

I think you replied to the wrong thread. Perhaps you meant here: https://news.ycombinator.com/item?id=26577176

superyesh5y ago· 3 in thread

>Arc is an opinionated framework for defining predictable, repeatable and manageable data transformation pipelines;

I am confused by the title `Arc, an open-source Databricks alternative `. One of the main benefits of Databricks is the managed Spark. This isn't replacing Databricks as such probably giving an alternative to one of the features in Databricks.

Yeah, agreed. I was a Databricks skeptic when I first came across it, but it's value goes a LONG way beyond just managing Spark.

For example, we found that Databrick's Spark (or their 'Delta engine' or whatever it's called) had 50-60% better performance on our workloads than than 'core' Spark. I guess that's not surprising when a large proportion of Spark contrionutors work for you and can performance tune! Not to mention things like MLFlow and all their data engineering stuff.

This is a cool project, and I admire it's ambition, but saying it's a real 'alternative' to Databricks is a bit disingenuous.

bostonsre5y ago

Databricks writes some good tools, but it can get pretty expensive. Kubeflow has been evolving well and is gaining lots of traction. It's pretty neat from my experience so far.

seddonm1OP5y ago

We provide multiple Docker images (https://github.com/orgs/tripl-ai/packages) that make the Spark deployment easy:

- arc-jupyter: allows you to develop on your local machine (and offline) or you can easily integrate it with a JupyterHub deployment on Kubernetes (https://zero-to-jupyterhub.readthedocs.io/en/stable/index.ht...). We have built JupyterHub on GCP Kubernetes (GKE) with full user-level auth via GCP IAM. If anyone is interested I can publish a secrets-removed version of our script.

- arc: is the execution only docker image (so is smaller than arc-jupyter). We have this orchestrated on Kubernetes too and now that Spark officially supports Kubernetes deployment it is actually really easy to create and destroy clusters on demand.

lordgroff5y ago· 2 in thread

I'm in the process of doing something like this internally, at a smaller scale, and it's interesting to see that many of the concepts I've been experimenting with and playing around with are formalized here in a similar manner. My "solution" doesn't build on Spark, as I just don't have enough data to necessitate it. I think the big difference is really the project's SQL first approach, which is probably going to polarize: personally, it's a decision I can't abide by, but I'm sure a lot of people will love that.

seddonm1OP5y ago

Cool!

Most of the time we run Spark as a single-node (i.e. --master local[*]) as now we can easily utilise large nodes like 128 core, 512GB and Spark does scale vertically well but also runs relatively well on a small node like the Docker example on the website running on a laptop. The ability to run SQL against separate storage is Spark's killer feature in my view.

Arc does support the full Scala API which you can implement as a plugin (https://arc.tripl.ai/plugins/) so for advanced teams they have full control.

The reason we went for SQL-first is that we are trying to find the balance that allows Business Analysts to develop their own logic without having to learn Scala or even Python - as they probably already know SQL.

Hopefully some of the ideas are relevant to what you are building.

lordgroff5y ago

I'm reading the docs thoroughly, many excellent ideas, and I'm sure I'll be borrowing some concepts. I also want to commend you for putting in the time in developing proper documentation, always greatly appreciated.

0x0085y ago· 2 in thread

The idea makes sense, but Databricks exposes the complete Spark API, is that true for this project as well? Spark is a lot more than Spark SQL.

seddonm1OP5y ago

Yes. Most of the simple stages just invoke the Spark Scala API - for example MLTransform invokes a pretrained SparkML model against a dataframe and returns a new one. You can see the standard Spark ML call: https://github.com/tripl-ai/arc/blob/master/src/main/scala/a.... You can add any plugin you want via the interface: https://arc.tripl.ai/plugins/

This is really defining a dialect that is more simple for Technical Business Analysts to consume that is safer than code and a notebook environment to interactively build with.

0x0085y ago

For example, we do a lot of of low-level RDD Operations through databricks. From skimming the Website I feel something like this is not in the scope of this project.

In the end, I feel, it is about wording. Databricks is a serverless spark environment with Azure integration and notebooks. Unless the product copies all the aspects (i.e. the hosting) it may not be wise to call it a databricks alternative.

If I reed the title as it is here on HN, I would think is about the infrastructure and not about a custom low-code JSON-based template language on top of spark sql.

xupybd5y ago· 1 in thread

I like the look of this but worry about adopting something as big as this. That said things tend to grow then I wish I'd started with something like this.

seddonm1OP5y ago

A completely valid concern.

You can see that all stages in the video implement the PipelineStagePlugin: https://arc.tripl.ai/plugins/#pipeline-stage-plugins. This means you can safely remove them from the code base and recompile without that stage at all. These are all dynamically loaded at runtime so it should be easy (and to implement your own custom logic).

Similarly the Dockerfile https://github.com/tripl-ai/docker/blob/master/arc/Dockerfil... just includes the relevant plugins (if not in the main Arc repository) so you can easily remove them or the Cloud SDKs/JDBC drivers to reduce your surface area.

We have endeavoured to write a large number of tests but there is always room to add more.

crimsoneer5y ago· 1 in thread

As a data person who despairs at the terrible data pipelines I have to work with, this seems cool! Shall follow with interest.

seddonm1OP5y ago

Yes I think as a community we have largely got our levels of abstraction incorrect:

- Code for pipelines without frameworks leads to huge repetition of logic - or worse people reimplementing the same 'logic' differently. Also you end up with a massive upgrade problem when new versions of underling execution engines change.

- Databricks provides a low-level API which leads to a lot of duplication of common code across notebooks (and reusability is difficult).

- The GUI based tools are often too high level so have very high reusability but are difficult to customise - and hard to source control.

We have tried to build an abstraction somewhere in between which gives you the reusability of the GUI tools, plays nicely with source control and has the power to add custom logic via the plugin interface: https://arc.tripl.ai/plugins/ if required.

(formatting)

marcinzm5y ago· 1 in thread

I'm curious how this compares to www.getdbt.com which seems to target a similar audience (technical analysts wanting to do ETL) with a similar approach (SQL first).

seddonm1OP5y ago

Thanks. dbt is very cool and evolved at the same time but focuses on the Transform step of ETL only. Unfortunately, as data engineers, we still spend a lot of time consolidating the many input sources to perform that transformation and also want to load it to places.

You can see the work we have done to build standardised methods for Extract (https://arc.tripl.ai/extract/) and Load (https://arc.tripl.ai/load/) in the documentation.

psing5y ago· 1 in thread

Can you specify between complete pulls of the source vs incremental based on a timestamp field?

seddonm1OP5y ago

Yes. We usually use a ConfigExecute (https://arc.tripl.ai/execute/#configexecute) stage to dynamically calculate a runtime parameter and pass that into the JDBCExtract query for example. There is an example here: https://arc.tripl.ai/solutions/#delta-processing

justosophy5y ago· 1 in thread

Good to see more attention to this. AWS did a presentation on it last year.

seddonm1OP5y ago

Cool! I was not aware.

robobro5y ago· 1 in thread

Remember when arc was a lisp that powered hackernews? Glad to read she's all grown up

seddonm1OP5y ago

It was actually named after an electric arc as it was initially developed at a large power company - but yes, naming is heavily overloaded now.

ozten5y ago· 1 in thread

Arc as a project name on HN ?!? OP account created November 13, 2018... okay, alright.

user-the-name5y ago

Given just how massive a flop Arc the language was, no wonder nobody would have heard of it.

glogla5y ago

I like it a lot, but how large scale can it be?

If I want to move whole JDBC-accessible database to warehouse or lakehouse (like Postgres or Oracle to S3 with Iceberg or Snowflake or something), do I have to build a set of configuration for every table, or can I do some wildcards, autodetections, etc?

j / k navigate · click thread line to collapse