Arc is declarative and currently targets the Apache Spark execution engine but the abstracted API allows replacing execution engines without having to rewrite the logic or intent of the pipeline in future. It supports parameterized notebooks to build complex pipelines which can be executed in CICD environments for safe deployment.
We would be interested to hear your feedback.
for anyone else wondering, this appears to be Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department, by Jeff Magnusson in 2016 https://multithreaded.stitchfix.com/blog/2016/03/16/engineer....
Our primary design goal was the system to be self-service for data scientists. Since our data scientists use pandas dataframes and jupyter notebooks all the time, we built the system around these two: (1) We have a library (that we call pype) acting an interface between the database and python dataframes (similar to .to_csv method), so there is no SQL queries in ETL scripts, (2) schedule (parametrized) notebooks using some special keywords.
We have a demo screencast: https://drive.google.com/file/d/1SVTduaIH_3IsJ-QoGI4mLYZE8Jv...
With Apache Arrow (https://arrow.apache.org/) I think the future looks very bright for both of our projects. It is important to have standard open source libraries and my early experiments have shown very good performance results.
can you please elaborate more on this statement, does it mean you dump the data first time job is executed? Because if data is evolving, at some point execution should produce different results
I am confused by the title `Arc, an open-source Databricks alternative `. One of the main benefits of Databricks is the managed Spark. This isn't replacing Databricks as such probably giving an alternative to one of the features in Databricks.
For example, we found that Databrick's Spark (or their 'Delta engine' or whatever it's called) had 50-60% better performance on our workloads than than 'core' Spark. I guess that's not surprising when a large proportion of Spark contrionutors work for you and can performance tune! Not to mention things like MLFlow and all their data engineering stuff.
This is a cool project, and I admire it's ambition, but saying it's a real 'alternative' to Databricks is a bit disingenuous.
- arc-jupyter: allows you to develop on your local machine (and offline) or you can easily integrate it with a JupyterHub deployment on Kubernetes (https://zero-to-jupyterhub.readthedocs.io/en/stable/index.ht...). We have built JupyterHub on GCP Kubernetes (GKE) with full user-level auth via GCP IAM. If anyone is interested I can publish a secrets-removed version of our script.
- arc: is the execution only docker image (so is smaller than arc-jupyter). We have this orchestrated on Kubernetes too and now that Spark officially supports Kubernetes deployment it is actually really easy to create and destroy clusters on demand.
Most of the time we run Spark as a single-node (i.e. --master local[*]) as now we can easily utilise large nodes like 128 core, 512GB and Spark does scale vertically well but also runs relatively well on a small node like the Docker example on the website running on a laptop. The ability to run SQL against separate storage is Spark's killer feature in my view.
Arc does support the full Scala API which you can implement as a plugin (https://arc.tripl.ai/plugins/) so for advanced teams they have full control.
The reason we went for SQL-first is that we are trying to find the balance that allows Business Analysts to develop their own logic without having to learn Scala or even Python - as they probably already know SQL.
Hopefully some of the ideas are relevant to what you are building.
This is really defining a dialect that is more simple for Technical Business Analysts to consume that is safer than code and a notebook environment to interactively build with.
In the end, I feel, it is about wording. Databricks is a serverless spark environment with Azure integration and notebooks. Unless the product copies all the aspects (i.e. the hosting) it may not be wise to call it a databricks alternative.
If I reed the title as it is here on HN, I would think is about the infrastructure and not about a custom low-code JSON-based template language on top of spark sql.
You can see that all stages in the video implement the PipelineStagePlugin: https://arc.tripl.ai/plugins/#pipeline-stage-plugins. This means you can safely remove them from the code base and recompile without that stage at all. These are all dynamically loaded at runtime so it should be easy (and to implement your own custom logic).
Similarly the Dockerfile https://github.com/tripl-ai/docker/blob/master/arc/Dockerfil... just includes the relevant plugins (if not in the main Arc repository) so you can easily remove them or the Cloud SDKs/JDBC drivers to reduce your surface area.
We have endeavoured to write a large number of tests but there is always room to add more.
- Code for pipelines without frameworks leads to huge repetition of logic - or worse people reimplementing the same 'logic' differently. Also you end up with a massive upgrade problem when new versions of underling execution engines change.
- Databricks provides a low-level API which leads to a lot of duplication of common code across notebooks (and reusability is difficult).
- The GUI based tools are often too high level so have very high reusability but are difficult to customise - and hard to source control.
We have tried to build an abstraction somewhere in between which gives you the reusability of the GUI tools, plays nicely with source control and has the power to add custom logic via the plugin interface: https://arc.tripl.ai/plugins/ if required.
(formatting)
You can see the work we have done to build standardised methods for Extract (https://arc.tripl.ai/extract/) and Load (https://arc.tripl.ai/load/) in the documentation.
If I want to move whole JDBC-accessible database to warehouse or lakehouse (like Postgres or Oracle to S3 with Iceberg or Snowflake or something), do I have to build a set of configuration for every table, or can I do some wildcards, autodetections, etc?