> We see Airflow and Luigi as complementary frameworks: Airflow and Luigi are tools that handle deployment, scheduling, monitoring and alerting. Kedro is the worker that should execute a series of tasks, and report to the Airflow and Luigi managers.
> Create the data transformation steps as pure Python functions
Personally, I feel mystified why you would use something like this rather than a more mature product like say, Spark, that natively supports clustering, etc, which is what I would really like to see in the FAQ.
Is it a processing solution? Not really, since it suggests you can offload the heavy lifting to an engine, eg. spark. An orchestrator? Apparently not, because that's a complementary product. So... it's like, a configuration management tool?
Pretty hard to see the use case to me.
1. https://kedro.readthedocs.io/en/latest/06_resources/01_faq.h...
I actually had the same questions when I was first introduced to Kedro! In my case, I didn't understand the value proposition over something like Apache Beam. After using it, I feel like Kedro provides:
1. a consistent structure across analytics pipelines. It's easy to start and pick up other Kedro projects after you've
used it once.
2. convenient and consistent I/O via the data catalog. The fact that we can configure and swap out data sources at ease
is a huge plus, and we also rely heavily on data versioning.
3. easy integration with existing frameworks (PySpark, vanilla Pandas, Dask, Airflow, Luigi, etc.)
Additionally, it aligns well with standards we have internally, like data layering. (edit: Apparently this is also part of the FAQ: https://kedro.readthedocs.io/en/latest/06_resources/01_faq.h... Who knew!)> Personally, I feel mystified why you would use something like this rather than a more mature product like say, Spark, that natively supports clustering, etc, which is what I would really like to see in the FAQ.
I'd say 80-90% of projects at QuantumBlack use (Py)Spark, so we've built out `SparkDataSet`s, `pandas_to_spark` and `spark_to_pandas` utility decorators, etc. There's a brief integration tutorial here: https://github.com/quantumblacklabs/kedro/tree/develop/kedro...
Full disclosure: I'm a data engineer at QuantumBlack (if it wasn't obvious already!)
The logic being that once you've finished experimenting and iterating it's much easier to move to AirFlow.
Likeliness of achieving commercial objectives is tied to the commercial usefulness and accuracy of your analysis and predictions, not the ease of deployment, or-even more curiously-ability to be left unattended.
Just like your website being stable and easy to update helps your business use it to make money. Of course it also needs to be tied to commercial usefulness.
Are there other techniques for data catalogs that are file based or at least open standard based that scale all the way up from developer?
[0] https://kedro.readthedocs.io/en/latest/04_user_guide/04_data...
- pipeline syntax is absolutely minimal (even supporting lambdas for simple transitions), inspired by the Clojure library core.graph https://github.com/plumatic/plumbing
- sequential and parallel runners are built-in (don't have to rely on Airflow)
- io provides wrappers for existing familiar data sources, but directly borrows arguments from Pandas, Spark APIs so no new API to learn
- flexibility in the sense you could rip out anything, for example, the whole Data Catalog replacing with another mechanism for data access like Haxl
- there's a project template which serves as a framework with built-in conventions from 50+ analytics engagements