One of the most significant pitfalls of data is failing to interrogate the value it provides and assuming that if you give everyone access all the time the magic will happen. The truth is value does not simply materialize just as value does not magically spring from computers by a human powering it on (okay sure, you may have already automated the value but that’s actually the point I’m about to make). In both cases it requires an experienced practitioner who collaborates with a larger team to intersect their work with the business needs.
Data is tricky, all the more so because it’s often seen as a panacea by business leaders who aren’t connected with the work of extracting that value.
It's not that big data tools aren't useful. It's that, when you just start amassing huge piles of data without a clear up-front plan for how it will be used, and assume that a whole bunch of people who have never heard of sampling bias or multiple comparisons bias or Coase's Law [2] can figure out what to do with it later, you're setting yourself up for a Bad Time.
1: https://research.google/pubs/pub43146/
2: "If you torture the data long enough, it will confess."It's also worth noting that, over the past few decades, most academic fields have been getting increasingly skeptical of the value of correlative research on pre-existing data sets. Even among people who have been extensively trained in how to do it properly. And yet, the vast majority of big data business plans I've seen in practice boil down to "collect a huge data set and then let people do correlative research on it."
I like this but it's kinda like the payday loan of business operations.
There's much ongoing discussion about this is the data world, often revolving around "self-service analytics".
Unless you're talking about "our analysts don't have to clean data all the time", which, for a large enough organization makes sense, "self-service" for non-technical folks is futile and pointless. They need specific answers to specific questions, not the ability to infinitely explore the data. Organizations should desire that kind of focus, not prevent it.
Reality smacked that shit down hard. I left data engineering because the projects were all over the place, wildly undisciplined and unfocused.
You were lucky to have source control let alone an understanding from the business that these projects were in fact software development.
I switched back to software engineering because at least there is a faint realization that we are...building software.
I might go back when the dust clears.
"Why do we need to hire programmers...I thought we needed data engineers?"
"Because the data pipelines are all built with thousands of lines of code. Java, python, Fortran, you name it...and your job post only mentioned SQL and data modelling"
I could go on forever.
You don't need to expose more dimensions or get the users more access to the raw data. You need to understand what their business is and what their business problems are and help them answer those specific questions quickly and succinctly.
Yes, there are certainly times where people use huge amounts of raw data to uncover the answer to a question they didn't know they had. But it's rare, it's expensive to support, and most businesses are going to be able to do anything with it anyway (a whole org built to do X isn't suddenly going to shift to do Y because you discovered some insight in a random report).
I've also tried it on Cloud Composer (google managed) and automated upgrades always trashed the cluster. It's not well designed for GKE because it writes logs to files and requires stateful sets. Testing the code is a huge burden due to the vast environment and dependencies needed to make it work locally.
I'm eager to rid my life of it and test out temporal for some of the high concurrency/frequency cases we have.
It's the one thing I like about our airflow. Everything else you said is echoed.
Also, the toil of dealing with many airflow instances when you have engineers who don't want to automate it.
Airflow packages those things together and adds some additional features - UI with Graph, gantt, logs and other views of the workflow - Users and permissions - Places to store config - Mechanisms for passing small data between tasks - Various "sensors" for triggering workflows - Various operators that interact with common data-oriented systems (bigquery, snowflake, s3, you name it). These are basically libraries that expose a config-forward API.
Probably the main selling point is the pre-made operators, but in short it is a complete solution with bells and whistles that aligns itself with the data ecosystem.
My sibling comment did a good job explaining, but the UI + configurable storage + configurable triggers all out of the box make life a lot easier.
a b c d vs. a (bc) d
They make different design decisions about what to surface via UX and what to make easy as a consequence of thinking of the problems in terms of different data structures.Replace “Airflow” with “Linux,” “data engineers” with “systems programmers,” and “Astronomer” with your hypervisor of choice (Xen/VMWare/etc.), and you can see how absurd the author’s point is:
My problem is that ~Airflow~ Linux was not designed to address [high-level systems architecture] problems. We don’t need a better [Linux], but we need a higher-level one: a system that enables ~data engineers~ systems programmers to think at a platform level.
In fact, [Linux] is already displaced. [Linux] qua [Linux] is already obsolete, and it happened right within the [Linux] ecosystem. It’s called ~Astronomer~ Xen/VMWare/etc.
If it sounds like you could simply replace [Linux] with basically any other ~job execution engine~ operating system, that’s because you could.
This is where the argument falls apart. Yes, for very large, complex deployments, higher-level orchestration is important, but the choice of low-level execution engine is also still hugely relevant, just as the choice of guest OS is still hugely relevant when discussing large deployments of VMs.Furthermore, very few people actually need very large scale deployments; user experience and capabilities at the low-level are what most users actually care about.
Honestly, the article is so disingenuous that it comes off like a paid-for puff piece for Astronomer. It's the article-equivalent of the late-night infomercial guy who rips open a bag of potato chips like the hulk because he doesn't have this special tool that's just four easy payments of $9.99.
Not saying infomercials people are angels, of course, but I wanted to sharethus somewhat nonobvious context.
(To stretch the metaphor, Airflow management system that gives everyone their own Airflow might be ridiculous but make sense for companies where cooperation is difficult :))
I will admit it's not easy to figure out best practices with Airflow, but if you make bad decisions and your system doesn't scale with the problem, you didn't understand the problem or how to solve it in the first place. The tools you chose are second to that.
I'm not saying Airflow is bad (we did set up a lot of hadoop clusters and other apache products at my old job, and our clients used airflow a lot), but i think the evangelists are so good they push airflow for everything, and this is bad. OP did use airflow for something it was not really designed for, and it sucked, but i do have this impression that tech writers and apache evangelists deserve some of the blame.
In fact, I did end up doing all those things, but we opted for Dagster Cloud, because of their focus on improving developer efficiency. Their team provided pre-built Github actions for CI/CD and recently introduced PR-specific branch deployments, which has been amazing. They're moving towards serverless execution, built-in ECR repositories, managed secrets. Prefect and Astronomer I expect are moving in this direction, too, but I liked the Dagster project's energy quite a bit.
As I've waded into the MLOps world as well, it just keeps looking like every platform basically devolves into : an orchestrator that provisions compute resources and logs metadata into an opinionated data model. Catalog tools like Atlan are metadata sinks that are trying to build out orchestration/workflow capabilities. dbt Cloud of course is just an orchestrator for a specific type of data product that is aiming to operationalize metadata with its metrics layer.
Orchestration + a metadata data model is a common denominator here, and I think the fact that Airflow is so inevitable has made it really hard for people to imagine the category as anything other than a scheduler, but perhaps some of these new companies can break new ground.
One Q - it seems to me that another possible solve (and probably how the big guys tend to do it) is to use a dataflow engine like Spark/Flink. Did you compare a managed platform like Google Dataproc? They also have serverless if you don’t want a heavy managed cluster, which might make this approach more viable for non-huge companies that wouldn’t utilize a min-spec cluster. (When I last evaluated this they didn’t have serverless which was a dealbreaker for my small scale).
This sounds like an issue not with Airflow but with integration.
DAGs can be published to S3 for cutting down on like half of these dependencies. And the nice thing about MWAA is log & stats publishing over cloudwatch, which should flow into any existing amazon integrated tooling.
For our team setting up terraform for iam & mwaa, some deploy pipelines to s3, and connecting some config bits to wire up splunk logs / monitoring pieces was not that much work. Initiating a separated vendor relationship & pricing out data ingress/egress costs would blow that work out of the water but maybe it’s a difference in company size/placement.
Why would I need a glorified server-side crontab if something like MS DTS from 1998 could do the same, but better? Sure, Python is probably better than whatever DTS generated, but the ops don't care either way, since Airflow doesn't care what it's running.
Something as simple as "job A must run after job B and job C, but if it doesn't start by 2am, wake up team X. If it doesn't finish by 4am, wake up team Y" isn't Airflow's problem, it's your problem.
"What's the overall trend for job D's finish time, what is the main reason for that?" isn't Airflow's problem, it's your problem. "What jobs are on the critical path for job E?" isn't Airflow's problem, it's your problem.
"Job F failed for date T and then recursively restart everything that uses its results for date T" isn't Airflow's problem, it's your problem.
https://news.ycombinator.com/item?id=9224
>Something as simple as "job A must run after job B and job C, but if it doesn't start by 2am, wake up team X. If it doesn't finish by 4am, wake up team Y" isn't Airflow's problem, it's your problem.
I guess that's one approach to job security. And why not make data egress manual too? Why transfer data through the network, when you can print them, mail the papers, and type them back in? Data input is not the computer's problem, it's your problem!
>Something as simple as "job A must run after job B and job C, but if it doesn't start by 2am, wake up team X. If it doesn't finish by 4am, wake up team Y" isn't Airflow's problem, it's your problem. "What's the overall trend for job D's finish time, what is the main reason for that?" isn't Airflow's problem, it's your problem. "What jobs are on the critical path for job E?" isn't Airflow's problem, it's your problem. "Job F failed for date T and then recursively restart everything that uses its results for date T" isn't Airflow's problem, it's your problem.
The whole idea of writing programs is making things automatable. That is, making them the computer's problem, not our problem. We get the higher level problem of writing the automation once, and fixing any bugs in our code, then we get to enjoy putting it to work for us...
This is a baffling statement.
Airflow can certainly be frustrating and it doesn't solve _all_ workflow orchestration problems. Surely the same thing can be said of many tools? This seems mostly like a mismatch of expectations.
Obviously there are trade-offs with either approach, but then I'd argue that making Airflow solve more problems will introduce more trade-offs too.
It is rarely clear what the hard problems will be when new to a domain. Only as scale kicks in.
We are constantly pitched frameworks that sell themselves as a good approach to a domain, but then obstruct engagement with the hardest problems when it matters. The developer becomes captive of the system that claimed it would steer them right.
This is particularly true of fields where the hard problems are integration problems which, by their nature, cannot be outsourced to frameworks.
Argo is another over-engineered "CNCF" thing trying to ride the Kubernetes hype train. It's all "eventually consistent", which makes it extraordinarily difficult to see when any particular thing actually happened. Is my code deployed? Who knows, Argo is "syncing".
Check out these great docs: https://argoproj.github.io/argo-workflows/rest-api/
> API reference docs :
> Latest docs (maybe incorrect)
> Interactively in the Argo Server UI.<https://localhost:2746/apidocs> (>= v2.10)
Yes, that is a localhost URL on their website.
If you’re looking for something that’s a bit more high level and friendly to expose directly to your data team (data scientists/data engineers/data analysts) you can check out https://github.com/orchest/orchest
You can think of it as a browser UI/workbench for Argo scheduled pipelines. Disclaimer: author of the project
In the end we replaced our data orchestration with a stateless lambda that for a configured time interval 1/ looks at what output data is missing, 2/ cross-references that with running jobs (in AWS Batch), and 3/ submit jobs for missing data that has no job. Jobs themselves are essentially stateless. They are never restarted and we don't even look at their status. If one fails we notice because there will be a hole in the output and we therefore submit a new one. Some safety precautions are added to prevent a job from repeatedly failing, but that's the exception.
Maybe Airflow has moved on from when we last tried it. But this was our experience.
That sounds more like an architecture-at-scale problem than something that is Airflow's 'fault.' Airflow may never have been the right tool for the job but it's getting all the blame.
Bro I can't even get my company to the _first_ part, and we're collectively already having issues with the second? What is everyone else's read on this situation in general? Do you all have row and table level lineages for your data? For pipelines that people are actively using? Every company I've ever been in can hardly figure out where finance gets last years "magical excel sheet", let alone be close to a spot where they're actively using data lineage tools.
I also don't like Airflow, but for somewhat different reasons.
I think it couples orchestration and transformation too tightly, I don't understand the desire to integrate everything with your actual runtime Python code - I think it's markedly the wrong level of abstraction/integration and limits your engineering capacity. There's undoubtedly some good engineering, it's come a long way, and it's mighty popular, but every time I look at a repo that uses it, the only read I get is "cross-cutting-chaos".
In life sciences research to support synthetic control arms, the FDA is caring more about the lineage/manipulation of the data than the data science models used to predict X/Y/Z.
IE - what was the data originally, what did it end up as prior to ingestion into AIML, why was it changed, what steps were involved, etc.
There are not a ton of good out of the box solutions for data lineage and its driving me nuts.
We have Apache NIFI which promises data lineage out of the box and _appears_ to deliver. I've never implemented it though.
We have pachyderm which has some support here but I don't know about it.
Besides that it appears roll-your-own.
I kind of wish there was an accepted best practice for data lineage but its - surprisingly - wild west. And its completely 100% required for industry use.
I honestly have no idea how SaaS billing isn't so buggy customers leave. Those data pipelines can be pretty complicated with lots of nuances around the data, and hand-wavy consequences for getting it wrong.
On Dagster and Prefect you communicate between tasks as if you were writing pure Python. On Airflow on the other hand ...
Use airflow as cron runner for dbt.
If you don't need realtime metrics, this formula works way better than convoluted airflow dags.
I don't have time to investigate other solutions like dagster and prefect and migrate jobs to it for testing.
Trouble with Airflow starts when multiple teams and user types start to share it.
People can potentially overwrite each other's DAGs. Credential management is complicated. Broken DAG can stop whole Airflow. Slow DAG can impact performance of whole Airflow. Getting DAGs to wait for each other (like one team prepares data up to a point and then other team builds on that) is kind of a nightmare. Sometimes people want features from newer Airflow, but some other team built DAG that isn't forward compatible. Etc etc.
But I'm not sure there actually is a better solution elsewhere. At least I have not seen it yet, maybe Dagster is on a good road.
But as I said, for centralized solutions it works really well.
Of course, nothing stops Airflow or other tools fron thinking this way as well.
This is an oversimplification but IMO the easiest way of picturing it is instead thinking of defining your graph as a forward moving thing w/ the orchestrator telling things they can be run you shift to defining your graph nodes to know their dependencies and they let the orchestrator know when they're runnable.
[1] https://towardsdatascience.com/apache-airflow-in-2022-10-rul...
In particular I transpile to Airflow code (can also deploy to Lambda) because I think it's still the most robust and well supported "runtime", I just don't think the developer experience is that good.
> The tool data engineers need to be effective in this new world does not run scripts, it organizes systems. 100%. You'll still need to run independent scripts, but today's data challenges focus on "how do I connect the stages of data operations together". Teams need to figure out how to connect data ingestion -> data transformation -> data visualization -> alerting and reporting -> ML model deployment -> metadata + catalogs -> data augmentation -> API actions.
The larger goal of orchestration is to prevent downstream processes from running if the data being processed upstream fails. Each stage could be performed with a series of scripts, a SaaS tool, or a mix. Each team is responsible for their own stages, but they need to know how their work connects to the larger picture so when something goes wrong, there's ownership and clarity that drives a quick resolution. Unfortunately, this still doesn't exist in most organizations because the current tooling isn't solving the orchestration and visualization of connected systems super effectively. It's instead enabling one-off, disconnected data processes.
Disclaimer: I built Shipyard (www.shipyardapp.com) to address many of these concerns of simplifying the ability to connect data tools and quickly automate and action on data.
I've really enjoyed using taskflow (https://github.com/taskflow/taskflow) it allows us to employ our existing logging and deployment paradigms.
I'd say data consumers, such as data analysts and business users, care primarily about the production of data assets. On the other hand, data engineers with Airflow focus on modeling the dependencies between tasks (instead of data assets). How how can we reconcile both worlds?
In my latest article, I review Airflow, Prefect, and Dagster and discuss how data orchestration tools introduce data assets as first-class objects. I also cover why a declarative approach with higher-level abstractions helps with faster developer cycles, stability, and a better understanding of what’s going on pre-runtime. I explore five different abstractions (jobs, tasks, resources, triggers, and data products) and see if it all helps to build a Data Mesh. If that sounds interesting, make sure to check out https://airbyte.com/blog/data-orchestration-trends.
I'm running into the same issue with this guy's post, although a little less so. The question he seems to ask is "With a complex pattern of data flows, if something breaks, how do you recover?" His argument is that Airflow does not offer enough visibility into the full data trace nor enough tools to apply recovery rules for repairing broken bits.
I think I agree, but prometheus doesn't really solve that. Nor necessarily does better management of automated job queue backlog management and job retries.
He also complains about some syntax and design choices that predate MyPy and Pydantic and modern Async Python coding. Those seem fairly easy things to drag Airflow forward with in future releases.
I'm kind of a fan of Prefect as an alternative: https://docs-v1.prefect.io/core/about_prefect/why-not-airflo...
Not completely sure if most of the issues I've faced were resolved in the future releases, but I don't fully agree with the take of the article. Like go with the scheduler that works for your current and potential future needs. The reason why we continue to use Airflow despite the issues is because it works so well with our workflows. This does mean that I would recommend it to another team.
We had lots of lessons learned. For instance, why does PythonOperator even exist? It takes a callable and thus you're likely not going to see good coding pattern emerge for something that needs to be 1000+ LoC. Instead, we just subclassed BaseOperator and used tried-and-true OO principles.
Has anyone tried Luigi for data engineering pipelines?
Airflow successors must figure out how to distribute the cron and all dependencies should be self contained in a Docker image.
I agree airflow is old, legacy and ideally folks should not use it, reality is there is a lot of pipelines already built with it - sadly. I think as a community we have to start moving away from it for more complicated problems.
Disclaimer: I created Flyte.org and heavily believe in decentralized development of DAGs and centralized management of infrastructure
ETL seems just like one of those perennial challenges that resist humanity's efforts to categorize the world into need and tidy boxes