Open Source Python ETL (opens in new tab)

(amphi.ai)

247 pointsjustjico2y ago124 comments

124 comments

101 comments · 27 top-level

iblaine2y ago· 24 in thread

Low code ETL tools (informatica, Appworx, talend, pentaho, ssis) were the original services for ELT/ETL. A lot of progress was made to go towards ETL-as-code starting with Airflow/Luigi. Going back to low code seems backwards as this point.

(I have used all of the above tools in my 15+ yr career. Code as ETL was a huge industry shift)

js82y ago

It's fascinating that somebody sees these things as "original". I used to maintain a 40-year old ETL codebase in SAS. IMHO the original low code movement was the 4GL movement from the 70s (and, according to Wikipedia, rapid application development movement from the 90s, things like Visual Basic).

I think fundamentally, the problem is that any DSL lives on a design tradeoff spectrum between domain-specificity and programming-generality. There is only so much savings that a generic ETL tool will give you, because it doesn't want to be domain specific. On the other hand, the more domain specific a tool is, the more limited it will be in it's capacity to handle other domains.

No amount of graphic or textual interface is gonna save you from this fundamental design tradeoff (it's kind of complexity tradeoff - things inside domain are easy but outside the domain are hard). But then, you can as well handle this tradeoff in a decent library/framework for a general programming language, which is, effectively, a "return to code".

Other way to look at it is that we want limited programming languages, because the implicit assumptions make things easier for the reader of the code (the person who wants to understand). However, limitations make things harder for the writer of the code (the person who wants to solve some problem).

rdeboo2y ago

I work as a Data Engineer and in my country Azure is pretty big, and as a consequence their Data Factory service has become a common choice for enterprises. It's a GUI based ETL tool, architects prefer it since it is a managed cloud service and supposedly is easy to use.

In practice you lose all the benefits of abstraction, unit testing, proper CI/CD, etc. I haven't met an engineer that likes the service. Some projects have resorted to writing code generation tools, so that they can take config files and programmatically generate the JSON serialization of the pipelines that you're supposed to develop by clicking and dragging.

larodi2y ago

While visual representation of ETLs can be of great help understanding the data flow, engineers tend to eventually start using commands - either in the VSCode, or the Cisco iOS, or local shell. Applications subject to scripted automation and having command line tend to be well respected - a good example is AutoCAD which had a prompt from day one, which is like many years ago. This prompt still stays and is used by architects and alike.

This graph-based visual programming somehow fails to deliver on speed of development. Mouse has 2 buttons, the keyboard approx. 100. Not to mention that LLMs work on the language/code level, and are expected to stay so for a while. We dont' have universal means to express things visually. Except for the graph notation of edgex/virtices. But then there is no universal knowledge, people dont usually disambiguate between sequence diagram, bpmn and state diagram. these are all graphs, right, but not the same semantically.

I'd rather go for a standardized ETL langauge a-la-markdown, and only then get to appreciate the GUI.

what-the-grump2y ago

>In practice you lose all the benefits of abstraction, unit testing, proper CI/CD, etc.

Why? We are pretty deep into the ecosystem.

Abstraction -> the only thing data factory does not allow you is to reference a previous activity as a variable, which makes sense if you don't want to let your customer blow up your product. Parametrize all you want.

Unit testing -> test all you want each activity, pipeline, flow, resume it from where it broke. Clone the entire thing into a test data factory, then deploy that once ready.

CI/CD -> the first step it nags you about is setting up CI/CD. If you want to get fancy, you setup a dev environment and deploy that to production after testing and sign-off.

Abstracting ETL only works when you remember or have the same people on staff that abstracted that ETL process. Data factory 'could' be visual but does not let you pull the same level of non-sense that SSIS would.

For example, we call data factory via API, the pipeline is fully abstracted, it does one thing, but it's inputs and outputs are controlled by the request.

shoo2y ago

once those custom codegen tools become established and popular, someone will have the idea that if they slap a GUI on the front it'll be much easier for non-engineers in the business to pick up and use.

this is reminiscent of https://mikehadlow.blogspot.com/2012/05/configuration-comple...

tiew9Vii2y ago

You see this a lot. A new generation comes in to relearn all the old lessons. It’s also happening on the frontend “server side rendering is bad, it’s slow” to the latest frontend frameworks and thought leaders “you should be using SSR” (server side rendering)

ETL is hot again with machine learning. Companies have massive amounts of data they need to get in shape for models. The promise of a GUI anyone can use with little experience (low costs) means these tools are gaining traction again for the same lessons to be learnt from the last generation who moved on/out the industry.

thegiogi2y ago

But that’s the point - it’s just a promise. I have a similar career as the top comment, and turns out that Low code approaches only work when people who understand code use them. Be that Python or sql, real use cases are not the toy examples shown in all typical introductions to these tools.

That said, if this thing is customizable enough, a good data engineer can prepare canned steps that fit the general structure of the customer data process and it may have its place.

1 more reply

banku_brougham2y ago

What is your estimate of the tine required to begin building on the new(old) foundation, discover all the problems after a large investment of effort, and then declare to all that ita the wrong path?

My personal pain is Tableau trying to move uo the stack into data orchestration and compute. Code is completely inaccessible.

2 more replies

braza2y ago

I share the same enthusiasm related to ETL as a code, but as a heavy user of SQL Server 2005 (SSIS) at that time, the main advantages for me were that there was a closed and concise ecosystem around the technology, less fragmentation and multiple pieces that could fail. It was way simpler to reach people to solve the issues on the platform, and less time was needed to think about the technology itself and to think more about the problem.

When I transitioned to ETL as a code 12 years ago, I felt for the first time what it looks like to be a programmer/developer, since in my case the technology was literally a means to an end. To be honest, I did not even think about tech at all.

Just to give an idea, I was responsible for 5 instances of SQL Server, and with another person we were responsible for the whole end-to-end data operation (i.e. ETL via SSIS + OLAP cube via SSAS + and reporting via SSRS), and today I struggle to have the same level of scale with less than 7 people.

Fast forward to 2024, the inverse is true: I spend way more time thinking about the tools and their specificities than about the problem that I am solving at the end of the day.

I'm not telling that ETL as a code is bad (actually there's a lot of good stuff out there) but for folks in this kind of ecosystem (e.g. azure), definitely the juice to have some homemade code and all the stuff that comes with that, does not worth the squeeze.

hbarka2y ago

>> Code as ETL was a huge industry shift

No it’s not. Try and see what banks, retail, manufacturing, various large enterprises still use. They need scale, observability, modularity, and maintainability.

stoperaticless2y ago

Gui Etl makes a promise that you dont need a programmer to wield it. But it is a false promise.

> They need scale, observability, modularity, and maintainability.

Seems orthogonal to code-vs-gui dimension.

1 more reply

bigger_cheese2y ago

I work in manufacturing (large industrial plant) and the data processes we have are honestly not great - mostly it is because there are a heap of legacy system and not a lot of commonality between our data sources we have a hideous mashup of Oracle, DB2, Microsoft SQL Server etc and different versions of the different databases. There's also more bespoke industry stuff like time series historians and SCADA systems/PLCs (ABB, Citect etc) to complicate the process.

From my experience SQL is basically the lowest common denominator everything speaks and even then the Oracle SQL dialect is subtly different to Microsoft SQL for example - things are subtly different enough it introduces frustrations.

There has been movement in last couple of years to hoist everything into a common "datalake" but my understanding has been that ingestion into this lake is not a simple process by any means and requires batch processes that need demanding compute resources and is slow (i.e. takes many hours and runs over night).

5 more replies

happymellon2y ago

Completely agree.

I would also add that in my 25 years in the industry I have never actually come across an ETL workflow that was complex enough that it required people working it in code. Those opinions seem to occur before proper analysis happens.

1 more reply

thibautdr2y ago

Thanks for your comment! I do believe it depends on who you ask and ultimately both will co-exist. I also think low-code solutions democratize access to ETL development offering a significant productivity advantage for smaller teams. With Amphi, I'm trying to avoid the common pitfalls of other low-code ETL tools, such as scalability issues, inflexibility, and vendor lock-in, while embracing the advantages of modern ETL-as-code: - Pipelines are defined as JSON files (git workflow available) - Generates non-proprietary Python code: This means the pipelines can be deployed anywhere, such as AWS Lambda, EC2, on-premises, or Databricks.

banku_brougham2y ago

Im very leery of low code, but I like the idea of ETL defined as configuration.

1 more reply

kaon_2y ago

I would love to have your advice. What tool would you recommend to do straightforward ETL's as a single developer? Think of tasks like ETL-ing data from Production to Test or Local. Or quickly combining data from 2 databases to answer some business question.

Six years ago I used Pentaho to do it. And it worked really well. It was easy and quick. Though maintenance was hard sometimes and it felt very dated: The javascript version was ancient, I could find a lot of questions answered online, but they were usually 5-10years old. I am wondering whether I should use something like Amphi for my next simple-ETLs.

NortySpock2y ago

I've gotten some quick wins with Benthos (now RedPanda Connect) but I agree it's an unsolved problem as there are typically gotchas.

If you can get a true CDC stream from the database to analytics, that would be ideal, but when that isn't available you spend 100x more time trying to bodge together an equivalent batch/retry system.

rubslopes2y ago

I also want to know that. The BI team where I work still uses Pentaho. It's buggy and ugly, but it gets the job done most of the time. A few of them know a little of python, so a tool like Amphi could be the next stage.

hipadev232y ago

clickhouse can enable all the things you mentioned

mywittyname2y ago

Agreed. Well designed Airflow Operators (or taskflow) are basically the same level of effort as creating a box with a UI (honestly, easier IMHO), but the ability to go into code is important for every non-trivial pipeline.

I built a solid career replacing no-code-ETL tools with Airflow.

sqlcook2y ago

You’re missing the point of the benefits of solutions like these, and the original set of tools like the Informatica of the kind. Those tools come with limitations and constrains, like a box of legos you can build a very powerful pipeline without having to wire up a lot of redundant code as you pass data frames between validation stages. Tools like Airflow/Spark etc are great for what they are, but they don’t come with guidelines or best practices when it comes to reusable code at scale, your team has to establish that early on.

You can open a pretty complicated large DAG in and right away you’ll understand the data flow and processing steps. If you were to do similar in code, it becomes a lot harder unless you comply to good modular design practices.

This is also why common game engine and 3d rendering tools come with a UI for flow driven scripting. It’s intuitive and much easier to organize.

mrtranscendence2y ago

I used to be an Informatica ETL dev, briefly. I laugh at the description of it as a low-code solution, as we used it as a simple frontend for staggering amounts of hand-coded SQL. I'm not sure I ever used Informatica as it was intended.

hermitcrab2y ago

The debate about GUI vs code solutions has raged for decades, generating more heat than light. I've tried to give a summary of the pros and cons of the 2 approaches here:

https://successfulsoftware.net/2024/01/16/visual-vs-text-bas...

TLDR: Best depends on the situation and the people involved. Given the wide range of both situations and people, I think there are is plenty of room for both approaches.

gchamonlive2y ago

Somewhat related to this discussion https://news.ycombinator.com/item?id=40646312

Tldr: as with dashboards, self-serve ETLs don't work outside very specific use cases or very simple pipeline.

Reason for that, as with bi tools, is that the complexity is not in manipulating data. There are tons of frameworks to do that efficiently.

The issue is interpreting data and it's semantics and evolving data pools to the business needs.

ic_fly22y ago· 10 in thread

With all the data issues strong quality and normalisation I often get the impression that enabling more people with non CS backgrounds to do this work is not necessarily a good thing.

In other words, if writing python and sql is the skill requirement that stops you from making an etl pipeline, maybe do something else.

otter-in-a-suit2y ago

While I’m not usually on board with gatekeeping, this field already struggles with a huge amount of very non-technical folks and their respective managers, producing overall mediocre results and giving the profession a bad rep, to the point where I now avoid the Data Engineer title and just call it “SWE specialized in large data processing” or something equally as fluffy.

For me it’s more accurate, too. At $work, there’s no difference to how an SWE vs a “DE” works. Same interview process too, DSA, distributed systems etc.

However, having done this for more than a decade, that is relatively rare. It’s usually a mix of GUI tools with zero reproducibility / infra-as-code, untyped python, copy pasted shell scripts, zero tests, zero ci/cd, no lifting/static analysis/code reviews etc., paired with generally zero understanding of the underlying tech. It’s all very formulaic with little to no actual understanding.

I will spare you my usual rant on why a language without a solid type system like python is a horrible idea for this field, too.

Which is why I much appreciate dbt. While some people scoff at the idea of “SQL with jinja templating”, their approach has certainly helped to move DE closer to SWE work, purely by virtue of their value prop mostly being exactly that. And it works out great.

hermitcrab2y ago

So if Bob from accounts needs a new report generating, he has to wait for 6 month for an IT guy to do it? Who probably won't do a very good job, because he doesn't understand what Bob needs as well as Bob does? Bob is going to hack something horrific together in Excel instead. Better surely to let Bob have a GUI point and click tool more appropriate to the job?

morkalork2y ago

On the other hand, Bob keeps asking for a self-serve reporting tool but in my experience, he doesn't actually want to use it. We went the route of putting all the data into a lake and hooking up GUI reporting tools to it and what did we get? Bob doesn't understand this or that column, Bob made a report that is fundamentally flawed, Bob sent a request for the engineers to make him a report using the tool and so on. Bob wanted something, or someone else, to do the work for him. When it became apparent that the tool isn't magic and can neither read minds nor divinate the true meaning of data in the DB, it became the engineer's problem again. So why not let engineers use the tools they prefer?

1 more reply

pelasaco2y ago

Bob doesn't write a specification, because Bob doesn't know as well what he wants. He will have to explore, try out until he reaches something that he can work with. Nobody is willing to spend time planning and documenting stuff. Everyone feels one youtube away from being expert in software development.

pm902y ago

With this argument, Computer Science wouldn’t have progressed beyond assemblers.

chasd002y ago

I sort of see your point. if you’re not willing to at least try to learn something new then, yeah, probably better off doing something else.

anakaine2y ago

This is elitist and frankly, unhelpful. The answer to a skills shortage is not a practitioner lockdown, but policy, training, guidance and mentoring. If you're stuck in start up land and you have this issue, you have hired the wrong skills. If you're encountering this in enterprise land, your organisation, and potentially you depending on your position of influence, should be angling to improve compliance and literacy not through obstruction but through policy and upskilling. Failing to do so will kill your ability to innovate.

necovek2y ago

FWIW, while I disagree with the parent comment, I don't see you arguing against it.

They actually implied that you should try upskilling first — but if that fails, you shouldn't be doing ETL yourself.

I mostly disagree with the parent comment because there's so many things one can easily do up to a level, and then when the going gets tough, you need to call in an expert. Eg. most people can operate a screwdriver or impact driver to fix things, but to fix some problems, you really need a trained technician (or well, an experienced DIY person, but that's not everybody).

The fact that you are not strong enough to screw in an M14 bolt does not mean you should be forbidden from using an impact driver: tools are there to help you. The logic of the parent comment was seemingly that if you are not strong enough to tighten an M14 bolt, you probably don't know what you are doing regardless of the type of the bolt you are tightening, so you should simply not do it.

The point I agree with in a parent comment is that not everybody can achieve a similar level of proficiency: while upskilling and improving/simplifying tools can get you most of the way there, there's always going to be that extra bit that requires a sudden, sharp jump in knowledge, smartness or experience to be able to deal with it.

1 more reply

itsoktocry2y ago

>This is elitist and frankly, unhelpful. The answer to a skills shortage is not a practitioner lockdown, but policy, training, guidance and mentoring.

I think the point is that these tools have their own learning curve, and non-tech business people are not doing it well, either; how much different is it from learning SQL? Which one is more broadly valuable and transferrable as skill?

If this is the career you want (data or data-adjacent), why not just learn SQL? There are far more learning resources and the value of the knowledge will assuredly outlast any low-code tool.

pydry2y ago

Skills shortage?

thibautdr2y ago· 6 in thread

Hi everyone, thanks for posting Amphi :)

To give some context, Amphi is a low-code ETL tool for both structured and unstructured data. The key use cases include file integration, data preparation, data migration, and creating data pipelines for AI tasks like data extraction and RAG. What sets it apart from traditional ETL tools is that it generates Python code that you own and can deploy anywhere. Amphi is available as a standalone web app or as a JupyterLab extension.

Visit the GitHub: https://github.com/amphi-ai/amphi-etl Give it a try and let me know what you think

OutOfHere2y ago

You know what does not set it apart? AI-washing. Also, lying about being open source when it isn't.

isjamesalive2y ago

To be fair, the only place the words ‘open’ and ‘source‘ appear in the readme are once in a sub-heading, where it’s phrased ‘open-source’. It’s clearly labelled ELv2.

Possibly more of a subtle miscommunication or misunderstanding than a deliberate lie.

1 more reply

slt20212y ago

i liked the idea of leveraging jupyterlab as server. data engineers/scientists already use jupyter, so this is neat idea.

custom extension for jupyterlab is a great way to leverage existing jupyterlab install base: not everyone will be willing to install and jump through hoops to install software X, but installing extension is one pip install away and no need to run separate process, since you are running inside jupyterlab server.

this reminds of ALTERYX (another drag and drop ETL tool)

thibautdr2y ago

Thanks! Being based on JupyterLab also allows Amphi to benefit from the vast ecosystem of extensions already available, such as the Git extension or using different file systems (S3).

Some users pointed out they were Alteryx users but liked the Python code generation from Amphi :)

1 more reply

johhns42y ago

Wow amazing work! How does the inputs work, are they created for you or does it support custom as well?

thibautdr2y ago

Thank you! Inputs components are pre-built for now but the ability to add custom inputs is coming soon!

1 more reply

awesomebytes2y ago· 5 in thread

I was not familiar with the acronym ETL and it is not explained anywhere in the website! My feedback would be to at least write it once, on the first instance so others like me will know what they are reading :)

fuzztester2y ago

Others already replied about what ETL is.

Wikipedia:

https://en.m.wikipedia.org/wiki/Extract,_transform,_load

I'll just add:

It is a common term and practice among enterprise software users, i.e. generally medium or large companies that use packaged plus custom software for their business needs.

ETL is not common among startups, because they have a different focus, infrastructure and scale.

m4632y ago

In computing, extract, transform, load (ETL) is a three-phase process where data is extracted from an input source, transformed (including cleaning), and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations.

https://en.wikipedia.org/wiki/Extract,_transform,_load

ljouhet2y ago

Didn't understand either:

Extract, transform and load...

ETL: Extract, transform and load data...

Extract, transform and load (ETL) data...

Extract, Transform and Load data...

thibautdr2y ago

Thanks for pointing that out, it's actually mentioned (Extract, transform and load ...) in the very first sentence below the tagline, but if you didn't get it then it's not clear.

ghoshbishakh2y ago

It is written on the website: Extract, transform and load. Yes, an illustrative example description would help I agree.

jamesblonde2y ago· 4 in thread

#dang The title needs changing - it's not open-source, it is license ELv2 - Elastic License v2.

maleldil2y ago

While you're right (it's indeed not open-source), the project advertises itself as such, so the title is "right", even if it's a lie.

lma212y ago

isn't the code available here? https://github.com/amphi-ai/amphi-etl

what makes it not OSS?

uneekname2y ago

That is source available, not open source. The term "open-source" is widely used to describe software that is licensed using a specific set of software licenses that grant certain freedoms to users. You can read more here[0]

[0] https://opensource.org/licenses

mrtranscendence2y ago

It's open source if you're using language like a normal human being. If you're a bit of a pedant and wish everyone to adhere to definitions imposed from on high regardless of real-world usage, it's absolutely not open source.

1 more reply

mritchie7122y ago· 4 in thread

If you're looking for "open source Python ETL", two things that are better options:

https://dlthub.com/

https://hub.meltano.com/

we[0] use meltano in production and I'm happy with it. I've played around with dlt and it's great, just not a ton of sources yet.

0 - https://www.definite.app/

thibautdr2y ago

Hey, Amphi's developer here. Those two tools are great, big fan of dlt myself :)

However, Amphi is a low-code solution while those two are code-based. Also, those two focus on the ingestion part (EL) while Amphi is focusing on different ETL use-cases (file integration, data preparation, AI pipelines).

mritchie7122y ago

I understand that. I'd change the title / H1 though, "Open Source Python ETL" doesn't describe what you're building very well.

Good luck! Looks cool.

jdnier2y ago

Are you able to describe what makes them better? (Honest question, I'm not familiar with either or with Amphi.)

It seems Definite's use case is focused on connecting to lots of data sources. For much smaller scale, how does Amphi compare?

mritchie7122y ago

most data engineers would think of something like Fivetran when you say "ETL" (look at the ETL section here[0]).

It looks like Amphi could handle some low code transformations (the "T" in ETL), but calling it ETL feels like a stretch.

So to rephrase a bit, if you're looking for an open source, python based Fivetran alternatives, dlt and meltano would be my picks.

0 - https://mattturck.com/landscape/mad2024.pdf)

cvalka2y ago· 3 in thread

THIS IS NOT OPEN SOURCE!

maphew2y ago

It's published on GitHub under license ELv2 - Elastic License v2. This does not meet the open source definition, so indeed it's not Open Source. ELv2 is an open source sibling though, closer than many other openish licenses: https://www.elastic.co/pricing/faq/licensing

Still, Amphi should not claim to be 'Open Source'.

runningmike2y ago

IMHO this title is primarily chosen for promotion. Not needed. But unfortunately many people, young and old, have never heard of OSI.

anakaine2y ago

paulvnickerson2y ago· 2 in thread

Very cool, thanks for sharing. Does it support the pandas-like rapidsai dask_cudf framework? (https://docs.rapids.ai/api/dask-cudf/stable/)

thibautdr2y ago

Great, thanks for sharing. I was familiar with Dask and cudf separately but not this one.I was planning to implement dask support through Modin but I'll definitely take a look at dask_cudf.

paulvnickerson2y ago

Cool. We use it a lot at work for working with large data sets on a GPU cluster.

tayloramurphy2y ago· 2 in thread

I'm curious as to the story of how things like this come to be. It seems like there are already a ton of "open source python ETL" tools on the market. Was this a passion project by the author? Was this born out of academia? Was there a specific problem they were trying to solve that others didn't? It's not necessary to answer these questions in the docs but it is useful for folks who may be familiar with the other options out there.

thibautdr2y ago

Thanks for your comment, those are valid points. I come from the industry, having worked for an ETL vendor for 6 years. I've personally witnessed a need for a low-code (graphical) ETL for Python environments. In short, traditional ETLs are GUI ETLs for Java environments while modern data tools are either focusing on the EL part or are code-oriented (dbt). With Amphi, I want to offer a low-code graphical alternative to develop Python-based pipelines. I also believe that modern data stack tools don't effectively address use cases for unstructured data, which is another focus of Amphi (with extensive file integration and RAG support).

tayloramurphy2y ago

Appreciate the reply! Thanks for the context.

gregw22y ago· 2 in thread

Isn’t pandas centric ETL much more memory intensive and less compute efficient than using SQL?

rkozik19892y ago

That's kind of the tradeoff you make with any low-code/no-code technology. You leverage prebuilt components and string them together to achieve some kind of task. Which isn't most efficient thing in the world to do but it does work assuming you have enough compute resources to throw at it, and return what you generally achieve is an end product that's completed faster than the traditional development route.

You could just use SQL but then you'd have to develop and test the entire infrastructure to support your component-oriented architecture from scratch, and at that point you're kind of just reinventing the wheel because that's basically just pandas with less features.

Low-code is kind of just Authorware for a new generation... assuming you're old enough to remember that technology.

thibautdr2y ago

I wrote an article questioning the use of Pandas for ETL. I invite you to read it: https://medium.com/@thibaut_gourdel/should-you-use-pandas-fo...

Joeboy2y ago· 2 in thread

Since there are "ETL" people here, I have a couple of naive questions, in case anybody can answer:

1) Are there any"standard"-ish (or popular-ish) file formats for node-based / low-code pipelines?

2) Is there any such format that's also reasonably human readable / writable?

3) Are there low-code ETL apps that (can) run in the browser, probably using WASM?

Thanks and sorry if these are dumb questions.

thibautdr2y ago

Thanks for the great questions:

1. As far as I know, there isn't a "standard" file format for low-code pipelines.

2. Some formats are more readable than others. YAML, for example, is quite readable. However, it's often a tradeoff: the more abstracted it is, the less control you have.

3. Funny you ask, I actually tried to make Amphi run in the browser with WASM. I think it's still too early in terms of both performance and limitations. Performance will likely improve soon, but browser limitations currently prevent the use of sockets, which are indispensable for database connections, for example.

roenxi2y ago

They're good questions, but they are not answerable blind. The correct choices depend too much on what problems you are trying to solve, the formats and scale of the data involved, the tolerances for downtime and what other software is being used.

My advice is to avoid, in general, low code tools if you plan to have software engineers involved. And once there aren't any software engineers whatever gets built is going to be a mess by software engineering standards so just roll with it. Any tool is equally likely to hit your pain points (and generate an unmanageable mess).

vekker2y ago· 2 in thread

Does this also manage the infrastructure side of ETL? Usually some parts in a complex ETL process take a lot more processing power, so are run on different machines. From a quick glance at this, it seems like a WYSIWYG ETL tool for running ETL jobs on one machine?

thibautdr2y ago

Thanks for your question. Amphi generates Python code using Pandas and can scale on a single machine or even multiple machines using Modin, but the process is manual for now. Future plans include deploying pipelines on Spark clusters and other services such as Snowflake.

Kalanos2y ago

what about dask?

1 more reply

v3ss0n2y ago· 2 in thread

What's the difference compare to Windmill.

thibautdr2y ago

Hi, thanks for your question. I'm not familiar with Windmill, but after checking it seems to be an open source developer platform to build applications. Amphi is a low-code tool to develop data pipelines (or ETL pipelines).

esafak2y ago

Windmill is a low-code workflow engine: https://www.windmill.dev/flows

whalesalad2y ago· 1 in thread

Been happy with Dagster but this looks interesting.

c0brac0bra2y ago

Considering switching some ancient Talend and Airflow processes over to this if I can get the time.

C4stor2y ago· 1 in thread

It's a good idea, but from the docs it looks like the high level abstractions are wrong.

If my data pipeline is "take this table, filter it, output it", I really don't want to use a "csv file input" or a "excel file output".

I want to say "anything here in the pipeline that I will define that behaves like a table, apply it this transformation", so that I can swap my storage later without touching the pipeline.

Same things for output. Personally I want to say "this goes to a file" at the pipeline level, and the details of the serialization should be changeable instantly.

That being said, can't complain about a free tool, kudos on making it available !

thibautdr2y ago

Hey, not sure I get your point here. I believe the abstraction provides what you're describing. You can swap a file input with a table input without touching the rest of the components (provided you don't have major structural changes). Let me know what you meant :)

tiraz2y ago· 1 in thread

How does it distinguish itself from Dagster or Prefect? Both are there for quite some time, also have a GUI, but a much larger feature set.

FranzFerdiNaN2y ago

Both don’t have the drag and drop feature of this, you have to write python yourself.

anakaine2y ago· 1 in thread

Hey, I really like the design. I currently have a lot of ETL going on through various mechanisms, but the thing that is always difficult to communicate to BAs and PMs, and any other individual is a graphical "what is this thing doing and how". This is neat for those of us who are visual.

thibautdr2y ago

Thanks! Don't hesitate to give it a try and reach out if you need anything :)

mitjafelicijan2y ago· 1 in thread

This is actually exactly what I needed for my current project!

thibautdr2y ago

Thanks for your comment, don't hesitate to share your use case! Also, you can reach out on Slack if you have any questions or need help.

Kalanos2y ago· 1 in thread

Reminds me of Elyra

thibautdr2y ago

Yes, there are similarities, but Elyra allows you to develop orchestration pipelines for Python scripts and notebooks, so you still have to write your own code. With Amphi, you design your data pipelines using a graphical interface, and it generates the Python code to execute. Hope that helps.

mrwyz2y ago

Not open source. Misleading title.

whazor2y ago

"Python ETL", Github language statistics: TypeScript 87.1%

It looks nice though.

febed2y ago

Which open source Python based ETL tool would one recommend for someone starting an ETL project today? It’s a data volume heavy project with lot of interdependencies between import tasks.

nextworddev2y ago

If you are enterprise, just go with Databricks lakeflow

olavgg2y ago

This looks visually similar to Apache Nifi.

deknos2y ago

Is it true opensource / free software, or are there non opensource parts?

rldjbpin2y ago

as open source as open weights models, but will companies adopt it solely on pricing?

kkfx2y ago

Do not take me wrong, I appreciate and thanks anyone who contribute to FLOSS, but all low/no code approaches I see turn out to be garbage. IMVHO the reality is that people need to be trained and became capable of fishing alone instead of giving them fishes all days.

ML in ETL is needed for raw initial classification of documents received in various formats from various sources, to clean-up scanned crap, no more than that, all the effort to plug LLMs was so far and i bet will be for the next 10 years a disaster.

ETL is something that should not exists in a modern world because we should exchange data in usable formats instead of having to import the with all sort of gimmick, we do not have such acculturated world but at least we can try to simplify and teaching instead of adding entropy.

j / k navigate · click thread line to collapse

124 comments

101 comments · 27 top-level

iblaine2y ago· 24 in thread

(I have used all of the above tools in my 15+ yr career. Code as ETL was a huge industry shift)

js82y ago

rdeboo2y ago

larodi2y ago

I'd rather go for a standardized ETL langauge a-la-markdown, and only then get to appreciate the GUI.

what-the-grump2y ago

>In practice you lose all the benefits of abstraction, unit testing, proper CI/CD, etc.

Why? We are pretty deep into the ecosystem.

Unit testing -> test all you want each activity, pipeline, flow, resume it from where it broke. Clone the entire thing into a test data factory, then deploy that once ready.

CI/CD -> the first step it nags you about is setting up CI/CD. If you want to get fancy, you setup a dev environment and deploy that to production after testing and sign-off.

For example, we call data factory via API, the pipeline is fully abstracted, it does one thing, but it's inputs and outputs are controlled by the request.

shoo2y ago

this is reminiscent of https://mikehadlow.blogspot.com/2012/05/configuration-comple...

tiew9Vii2y ago

thegiogi2y ago

That said, if this thing is customizable enough, a good data engineer can prepare canned steps that fit the general structure of the customer data process and it may have its place.

1 more reply

banku_brougham2y ago

What is your estimate of the tine required to begin building on the new(old) foundation, discover all the problems after a large investment of effort, and then declare to all that ita the wrong path?

My personal pain is Tableau trying to move uo the stack into data orchestration and compute. Code is completely inaccessible.

2 more replies

braza2y ago

Fast forward to 2024, the inverse is true: I spend way more time thinking about the tools and their specificities than about the problem that I am solving at the end of the day.

hbarka2y ago

>> Code as ETL was a huge industry shift

No it’s not. Try and see what banks, retail, manufacturing, various large enterprises still use. They need scale, observability, modularity, and maintainability.

stoperaticless2y ago

Gui Etl makes a promise that you dont need a programmer to wield it. But it is a false promise.

> They need scale, observability, modularity, and maintainability.

Seems orthogonal to code-vs-gui dimension.

1 more reply

bigger_cheese2y ago

5 more replies

happymellon2y ago

Completely agree.

1 more reply

thibautdr2y ago

banku_brougham2y ago

Im very leery of low code, but I like the idea of ETL defined as configuration.

1 more reply

kaon_2y ago

NortySpock2y ago

I've gotten some quick wins with Benthos (now RedPanda Connect) but I agree it's an unsolved problem as there are typically gotchas.

If you can get a true CDC stream from the database to analytics, that would be ideal, but when that isn't available you spend 100x more time trying to bodge together an equivalent batch/retry system.

rubslopes2y ago

hipadev232y ago

clickhouse can enable all the things you mentioned

mywittyname2y ago

I built a solid career replacing no-code-ETL tools with Airflow.

sqlcook2y ago

This is also why common game engine and 3d rendering tools come with a UI for flow driven scripting. It’s intuitive and much easier to organize.

mrtranscendence2y ago

hermitcrab2y ago

The debate about GUI vs code solutions has raged for decades, generating more heat than light. I've tried to give a summary of the pros and cons of the 2 approaches here:

https://successfulsoftware.net/2024/01/16/visual-vs-text-bas...

TLDR: Best depends on the situation and the people involved. Given the wide range of both situations and people, I think there are is plenty of room for both approaches.

gchamonlive2y ago

Somewhat related to this discussion https://news.ycombinator.com/item?id=40646312

Tldr: as with dashboards, self-serve ETLs don't work outside very specific use cases or very simple pipeline.

Reason for that, as with bi tools, is that the complexity is not in manipulating data. There are tons of frameworks to do that efficiently.

The issue is interpreting data and it's semantics and evolving data pools to the business needs.

ic_fly22y ago· 10 in thread

With all the data issues strong quality and normalisation I often get the impression that enabling more people with non CS backgrounds to do this work is not necessarily a good thing.

In other words, if writing python and sql is the skill requirement that stops you from making an etl pipeline, maybe do something else.

otter-in-a-suit2y ago

For me it’s more accurate, too. At $work, there’s no difference to how an SWE vs a “DE” works. Same interview process too, DSA, distributed systems etc.

I will spare you my usual rant on why a language without a solid type system like python is a horrible idea for this field, too.

hermitcrab2y ago

morkalork2y ago

1 more reply

pelasaco2y ago

pm902y ago

With this argument, Computer Science wouldn’t have progressed beyond assemblers.

chasd002y ago

I sort of see your point. if you’re not willing to at least try to learn something new then, yeah, probably better off doing something else.

anakaine2y ago

necovek2y ago

FWIW, while I disagree with the parent comment, I don't see you arguing against it.

They actually implied that you should try upskilling first — but if that fails, you shouldn't be doing ETL yourself.

1 more reply

itsoktocry2y ago

>This is elitist and frankly, unhelpful. The answer to a skills shortage is not a practitioner lockdown, but policy, training, guidance and mentoring.

If this is the career you want (data or data-adjacent), why not just learn SQL? There are far more learning resources and the value of the knowledge will assuredly outlast any low-code tool.

pydry2y ago

Skills shortage?

thibautdr2y ago· 6 in thread

Hi everyone, thanks for posting Amphi :)

Visit the GitHub: https://github.com/amphi-ai/amphi-etl Give it a try and let me know what you think

OutOfHere2y ago

You know what does not set it apart? AI-washing. Also, lying about being open source when it isn't.

isjamesalive2y ago

To be fair, the only place the words ‘open’ and ‘source‘ appear in the readme are once in a sub-heading, where it’s phrased ‘open-source’. It’s clearly labelled ELv2.

Possibly more of a subtle miscommunication or misunderstanding than a deliberate lie.

1 more reply

slt20212y ago

i liked the idea of leveraging jupyterlab as server. data engineers/scientists already use jupyter, so this is neat idea.

this reminds of ALTERYX (another drag and drop ETL tool)

thibautdr2y ago

Thanks! Being based on JupyterLab also allows Amphi to benefit from the vast ecosystem of extensions already available, such as the Git extension or using different file systems (S3).

Some users pointed out they were Alteryx users but liked the Python code generation from Amphi :)

1 more reply

johhns42y ago

Wow amazing work! How does the inputs work, are they created for you or does it support custom as well?

thibautdr2y ago

Thank you! Inputs components are pre-built for now but the ability to add custom inputs is coming soon!

1 more reply

awesomebytes2y ago· 5 in thread

fuzztester2y ago

Others already replied about what ETL is.

Wikipedia:

https://en.m.wikipedia.org/wiki/Extract,_transform,_load

I'll just add:

It is a common term and practice among enterprise software users, i.e. generally medium or large companies that use packaged plus custom software for their business needs.

ETL is not common among startups, because they have a different focus, infrastructure and scale.

m4632y ago

https://en.wikipedia.org/wiki/Extract,_transform,_load

ljouhet2y ago

Didn't understand either:

Extract, transform and load...

ETL: Extract, transform and load data...

Extract, transform and load (ETL) data...

Extract, Transform and Load data...

thibautdr2y ago

Thanks for pointing that out, it's actually mentioned (Extract, transform and load ...) in the very first sentence below the tagline, but if you didn't get it then it's not clear.

ghoshbishakh2y ago

It is written on the website: Extract, transform and load. Yes, an illustrative example description would help I agree.

jamesblonde2y ago· 4 in thread

#dang The title needs changing - it's not open-source, it is license ELv2 - Elastic License v2.

maleldil2y ago

While you're right (it's indeed not open-source), the project advertises itself as such, so the title is "right", even if it's a lie.

lma212y ago

isn't the code available here? https://github.com/amphi-ai/amphi-etl

what makes it not OSS?

uneekname2y ago

[0] https://opensource.org/licenses

mrtranscendence2y ago

1 more reply

mritchie7122y ago· 4 in thread

If you're looking for "open source Python ETL", two things that are better options:

https://dlthub.com/

https://hub.meltano.com/

we[0] use meltano in production and I'm happy with it. I've played around with dlt and it's great, just not a ton of sources yet.

0 - https://www.definite.app/

thibautdr2y ago

Hey, Amphi's developer here. Those two tools are great, big fan of dlt myself :)

mritchie7122y ago

I understand that. I'd change the title / H1 though, "Open Source Python ETL" doesn't describe what you're building very well.

Good luck! Looks cool.

jdnier2y ago

Are you able to describe what makes them better? (Honest question, I'm not familiar with either or with Amphi.)

It seems Definite's use case is focused on connecting to lots of data sources. For much smaller scale, how does Amphi compare?

mritchie7122y ago

most data engineers would think of something like Fivetran when you say "ETL" (look at the ETL section here[0]).

It looks like Amphi could handle some low code transformations (the "T" in ETL), but calling it ETL feels like a stretch.

So to rephrase a bit, if you're looking for an open source, python based Fivetran alternatives, dlt and meltano would be my picks.

0 - https://mattturck.com/landscape/mad2024.pdf)

cvalka2y ago· 3 in thread

THIS IS NOT OPEN SOURCE!

maphew2y ago

Still, Amphi should not claim to be 'Open Source'.

runningmike2y ago

IMHO this title is primarily chosen for promotion. Not needed. But unfortunately many people, young and old, have never heard of OSI.

anakaine2y ago

paulvnickerson2y ago· 2 in thread

Very cool, thanks for sharing. Does it support the pandas-like rapidsai dask_cudf framework? (https://docs.rapids.ai/api/dask-cudf/stable/)

thibautdr2y ago

Great, thanks for sharing. I was familiar with Dask and cudf separately but not this one.I was planning to implement dask support through Modin but I'll definitely take a look at dask_cudf.

paulvnickerson2y ago

Cool. We use it a lot at work for working with large data sets on a GPU cluster.

tayloramurphy2y ago· 2 in thread

thibautdr2y ago

tayloramurphy2y ago

Appreciate the reply! Thanks for the context.

gregw22y ago· 2 in thread

Isn’t pandas centric ETL much more memory intensive and less compute efficient than using SQL?

rkozik19892y ago

Low-code is kind of just Authorware for a new generation... assuming you're old enough to remember that technology.

thibautdr2y ago

I wrote an article questioning the use of Pandas for ETL. I invite you to read it: https://medium.com/@thibaut_gourdel/should-you-use-pandas-fo...

Joeboy2y ago· 2 in thread

Since there are "ETL" people here, I have a couple of naive questions, in case anybody can answer:

1) Are there any"standard"-ish (or popular-ish) file formats for node-based / low-code pipelines?

2) Is there any such format that's also reasonably human readable / writable?

3) Are there low-code ETL apps that (can) run in the browser, probably using WASM?

Thanks and sorry if these are dumb questions.

thibautdr2y ago

Thanks for the great questions:

1. As far as I know, there isn't a "standard" file format for low-code pipelines.

2. Some formats are more readable than others. YAML, for example, is quite readable. However, it's often a tradeoff: the more abstracted it is, the less control you have.

roenxi2y ago

vekker2y ago· 2 in thread

thibautdr2y ago

Kalanos2y ago

what about dask?

1 more reply

v3ss0n2y ago· 2 in thread

What's the difference compare to Windmill.

thibautdr2y ago

esafak2y ago

Windmill is a low-code workflow engine: https://www.windmill.dev/flows

whalesalad2y ago· 1 in thread

Been happy with Dagster but this looks interesting.

c0brac0bra2y ago

Considering switching some ancient Talend and Airflow processes over to this if I can get the time.

C4stor2y ago· 1 in thread

It's a good idea, but from the docs it looks like the high level abstractions are wrong.

If my data pipeline is "take this table, filter it, output it", I really don't want to use a "csv file input" or a "excel file output".

I want to say "anything here in the pipeline that I will define that behaves like a table, apply it this transformation", so that I can swap my storage later without touching the pipeline.

Same things for output. Personally I want to say "this goes to a file" at the pipeline level, and the details of the serialization should be changeable instantly.

That being said, can't complain about a free tool, kudos on making it available !

thibautdr2y ago

tiraz2y ago· 1 in thread

How does it distinguish itself from Dagster or Prefect? Both are there for quite some time, also have a GUI, but a much larger feature set.

FranzFerdiNaN2y ago

Both don’t have the drag and drop feature of this, you have to write python yourself.

anakaine2y ago· 1 in thread

thibautdr2y ago

Thanks! Don't hesitate to give it a try and reach out if you need anything :)

mitjafelicijan2y ago· 1 in thread

This is actually exactly what I needed for my current project!

thibautdr2y ago

Thanks for your comment, don't hesitate to share your use case! Also, you can reach out on Slack if you have any questions or need help.

Kalanos2y ago· 1 in thread

Reminds me of Elyra

thibautdr2y ago

mrwyz2y ago

Not open source. Misleading title.

whazor2y ago

"Python ETL", Github language statistics: TypeScript 87.1%

It looks nice though.

febed2y ago

Which open source Python based ETL tool would one recommend for someone starting an ETL project today? It’s a data volume heavy project with lot of interdependencies between import tasks.

nextworddev2y ago

If you are enterprise, just go with Databricks lakeflow

olavgg2y ago

This looks visually similar to Apache Nifi.

deknos2y ago

Is it true opensource / free software, or are there non opensource parts?

rldjbpin2y ago

as open source as open weights models, but will companies adopt it solely on pricing?

kkfx2y ago

j / k navigate · click thread line to collapse