(I have used all of the above tools in my 15+ yr career. Code as ETL was a huge industry shift)
I think fundamentally, the problem is that any DSL lives on a design tradeoff spectrum between domain-specificity and programming-generality. There is only so much savings that a generic ETL tool will give you, because it doesn't want to be domain specific. On the other hand, the more domain specific a tool is, the more limited it will be in it's capacity to handle other domains.
No amount of graphic or textual interface is gonna save you from this fundamental design tradeoff (it's kind of complexity tradeoff - things inside domain are easy but outside the domain are hard). But then, you can as well handle this tradeoff in a decent library/framework for a general programming language, which is, effectively, a "return to code".
Other way to look at it is that we want limited programming languages, because the implicit assumptions make things easier for the reader of the code (the person who wants to understand). However, limitations make things harder for the writer of the code (the person who wants to solve some problem).
In practice you lose all the benefits of abstraction, unit testing, proper CI/CD, etc. I haven't met an engineer that likes the service. Some projects have resorted to writing code generation tools, so that they can take config files and programmatically generate the JSON serialization of the pipelines that you're supposed to develop by clicking and dragging.
This graph-based visual programming somehow fails to deliver on speed of development. Mouse has 2 buttons, the keyboard approx. 100. Not to mention that LLMs work on the language/code level, and are expected to stay so for a while. We dont' have universal means to express things visually. Except for the graph notation of edgex/virtices. But then there is no universal knowledge, people dont usually disambiguate between sequence diagram, bpmn and state diagram. these are all graphs, right, but not the same semantically.
I'd rather go for a standardized ETL langauge a-la-markdown, and only then get to appreciate the GUI.
Why? We are pretty deep into the ecosystem.
Abstraction -> the only thing data factory does not allow you is to reference a previous activity as a variable, which makes sense if you don't want to let your customer blow up your product. Parametrize all you want.
Unit testing -> test all you want each activity, pipeline, flow, resume it from where it broke. Clone the entire thing into a test data factory, then deploy that once ready.
CI/CD -> the first step it nags you about is setting up CI/CD. If you want to get fancy, you setup a dev environment and deploy that to production after testing and sign-off.
Abstracting ETL only works when you remember or have the same people on staff that abstracted that ETL process. Data factory 'could' be visual but does not let you pull the same level of non-sense that SSIS would.
For example, we call data factory via API, the pipeline is fully abstracted, it does one thing, but it's inputs and outputs are controlled by the request.
this is reminiscent of https://mikehadlow.blogspot.com/2012/05/configuration-comple...
ETL is hot again with machine learning. Companies have massive amounts of data they need to get in shape for models. The promise of a GUI anyone can use with little experience (low costs) means these tools are gaining traction again for the same lessons to be learnt from the last generation who moved on/out the industry.
That said, if this thing is customizable enough, a good data engineer can prepare canned steps that fit the general structure of the customer data process and it may have its place.
My personal pain is Tableau trying to move uo the stack into data orchestration and compute. Code is completely inaccessible.
When I transitioned to ETL as a code 12 years ago, I felt for the first time what it looks like to be a programmer/developer, since in my case the technology was literally a means to an end. To be honest, I did not even think about tech at all.
Just to give an idea, I was responsible for 5 instances of SQL Server, and with another person we were responsible for the whole end-to-end data operation (i.e. ETL via SSIS + OLAP cube via SSAS + and reporting via SSRS), and today I struggle to have the same level of scale with less than 7 people.
Fast forward to 2024, the inverse is true: I spend way more time thinking about the tools and their specificities than about the problem that I am solving at the end of the day.
I'm not telling that ETL as a code is bad (actually there's a lot of good stuff out there) but for folks in this kind of ecosystem (e.g. azure), definitely the juice to have some homemade code and all the stuff that comes with that, does not worth the squeeze.
No it’s not. Try and see what banks, retail, manufacturing, various large enterprises still use. They need scale, observability, modularity, and maintainability.
> They need scale, observability, modularity, and maintainability.
Seems orthogonal to code-vs-gui dimension.
From my experience SQL is basically the lowest common denominator everything speaks and even then the Oracle SQL dialect is subtly different to Microsoft SQL for example - things are subtly different enough it introduces frustrations.
There has been movement in last couple of years to hoist everything into a common "datalake" but my understanding has been that ingestion into this lake is not a simple process by any means and requires batch processes that need demanding compute resources and is slow (i.e. takes many hours and runs over night).
I would also add that in my 25 years in the industry I have never actually come across an ETL workflow that was complex enough that it required people working it in code. Those opinions seem to occur before proper analysis happens.
Six years ago I used Pentaho to do it. And it worked really well. It was easy and quick. Though maintenance was hard sometimes and it felt very dated: The javascript version was ancient, I could find a lot of questions answered online, but they were usually 5-10years old. I am wondering whether I should use something like Amphi for my next simple-ETLs.
If you can get a true CDC stream from the database to analytics, that would be ideal, but when that isn't available you spend 100x more time trying to bodge together an equivalent batch/retry system.
I built a solid career replacing no-code-ETL tools with Airflow.
You can open a pretty complicated large DAG in and right away you’ll understand the data flow and processing steps. If you were to do similar in code, it becomes a lot harder unless you comply to good modular design practices.
This is also why common game engine and 3d rendering tools come with a UI for flow driven scripting. It’s intuitive and much easier to organize.
https://successfulsoftware.net/2024/01/16/visual-vs-text-bas...
TLDR: Best depends on the situation and the people involved. Given the wide range of both situations and people, I think there are is plenty of room for both approaches.
Tldr: as with dashboards, self-serve ETLs don't work outside very specific use cases or very simple pipeline.
Reason for that, as with bi tools, is that the complexity is not in manipulating data. There are tons of frameworks to do that efficiently.
The issue is interpreting data and it's semantics and evolving data pools to the business needs.
In other words, if writing python and sql is the skill requirement that stops you from making an etl pipeline, maybe do something else.
For me it’s more accurate, too. At $work, there’s no difference to how an SWE vs a “DE” works. Same interview process too, DSA, distributed systems etc.
However, having done this for more than a decade, that is relatively rare. It’s usually a mix of GUI tools with zero reproducibility / infra-as-code, untyped python, copy pasted shell scripts, zero tests, zero ci/cd, no lifting/static analysis/code reviews etc., paired with generally zero understanding of the underlying tech. It’s all very formulaic with little to no actual understanding.
I will spare you my usual rant on why a language without a solid type system like python is a horrible idea for this field, too.
Which is why I much appreciate dbt. While some people scoff at the idea of “SQL with jinja templating”, their approach has certainly helped to move DE closer to SWE work, purely by virtue of their value prop mostly being exactly that. And it works out great.
They actually implied that you should try upskilling first — but if that fails, you shouldn't be doing ETL yourself.
I mostly disagree with the parent comment because there's so many things one can easily do up to a level, and then when the going gets tough, you need to call in an expert. Eg. most people can operate a screwdriver or impact driver to fix things, but to fix some problems, you really need a trained technician (or well, an experienced DIY person, but that's not everybody).
The fact that you are not strong enough to screw in an M14 bolt does not mean you should be forbidden from using an impact driver: tools are there to help you. The logic of the parent comment was seemingly that if you are not strong enough to tighten an M14 bolt, you probably don't know what you are doing regardless of the type of the bolt you are tightening, so you should simply not do it.
The point I agree with in a parent comment is that not everybody can achieve a similar level of proficiency: while upskilling and improving/simplifying tools can get you most of the way there, there's always going to be that extra bit that requires a sudden, sharp jump in knowledge, smartness or experience to be able to deal with it.
I think the point is that these tools have their own learning curve, and non-tech business people are not doing it well, either; how much different is it from learning SQL? Which one is more broadly valuable and transferrable as skill?
If this is the career you want (data or data-adjacent), why not just learn SQL? There are far more learning resources and the value of the knowledge will assuredly outlast any low-code tool.
To give some context, Amphi is a low-code ETL tool for both structured and unstructured data. The key use cases include file integration, data preparation, data migration, and creating data pipelines for AI tasks like data extraction and RAG. What sets it apart from traditional ETL tools is that it generates Python code that you own and can deploy anywhere. Amphi is available as a standalone web app or as a JupyterLab extension.
Visit the GitHub: https://github.com/amphi-ai/amphi-etl Give it a try and let me know what you think
Possibly more of a subtle miscommunication or misunderstanding than a deliberate lie.
custom extension for jupyterlab is a great way to leverage existing jupyterlab install base: not everyone will be willing to install and jump through hoops to install software X, but installing extension is one pip install away and no need to run separate process, since you are running inside jupyterlab server.
this reminds of ALTERYX (another drag and drop ETL tool)
Some users pointed out they were Alteryx users but liked the Python code generation from Amphi :)
Wikipedia:
https://en.m.wikipedia.org/wiki/Extract,_transform,_load
I'll just add:
It is a common term and practice among enterprise software users, i.e. generally medium or large companies that use packaged plus custom software for their business needs.
ETL is not common among startups, because they have a different focus, infrastructure and scale.
Extract, transform and load...
vs
ETL: Extract, transform and load data...
Extract, transform and load (ETL) data...
Extract, Transform and Load data...
what makes it not OSS?
we[0] use meltano in production and I'm happy with it. I've played around with dlt and it's great, just not a ton of sources yet.
However, Amphi is a low-code solution while those two are code-based. Also, those two focus on the ingestion part (EL) while Amphi is focusing on different ETL use-cases (file integration, data preparation, AI pipelines).
Good luck! Looks cool.
It seems Definite's use case is focused on connecting to lots of data sources. For much smaller scale, how does Amphi compare?
It looks like Amphi could handle some low code transformations (the "T" in ETL), but calling it ETL feels like a stretch.
So to rephrase a bit, if you're looking for an open source, python based Fivetran alternatives, dlt and meltano would be my picks.
Still, Amphi should not claim to be 'Open Source'.
You could just use SQL but then you'd have to develop and test the entire infrastructure to support your component-oriented architecture from scratch, and at that point you're kind of just reinventing the wheel because that's basically just pandas with less features.
Low-code is kind of just Authorware for a new generation... assuming you're old enough to remember that technology.
1) Are there any"standard"-ish (or popular-ish) file formats for node-based / low-code pipelines?
2) Is there any such format that's also reasonably human readable / writable?
3) Are there low-code ETL apps that (can) run in the browser, probably using WASM?
Thanks and sorry if these are dumb questions.
1. As far as I know, there isn't a "standard" file format for low-code pipelines.
2. Some formats are more readable than others. YAML, for example, is quite readable. However, it's often a tradeoff: the more abstracted it is, the less control you have.
3. Funny you ask, I actually tried to make Amphi run in the browser with WASM. I think it's still too early in terms of both performance and limitations. Performance will likely improve soon, but browser limitations currently prevent the use of sockets, which are indispensable for database connections, for example.
My advice is to avoid, in general, low code tools if you plan to have software engineers involved. And once there aren't any software engineers whatever gets built is going to be a mess by software engineering standards so just roll with it. Any tool is equally likely to hit your pain points (and generate an unmanageable mess).
If my data pipeline is "take this table, filter it, output it", I really don't want to use a "csv file input" or a "excel file output".
I want to say "anything here in the pipeline that I will define that behaves like a table, apply it this transformation", so that I can swap my storage later without touching the pipeline.
Same things for output. Personally I want to say "this goes to a file" at the pipeline level, and the details of the serialization should be changeable instantly.
That being said, can't complain about a free tool, kudos on making it available !
It looks nice though.
ML in ETL is needed for raw initial classification of documents received in various formats from various sources, to clean-up scanned crap, no more than that, all the effort to plug LLMs was so far and i bet will be for the next 10 years a disaster.
ETL is something that should not exists in a modern world because we should exchange data in usable formats instead of having to import the with all sort of gimmick, we do not have such acculturated world but at least we can try to simplify and teaching instead of adding entropy.