Kedro: open-source library for production-ready machine learning code (opens in new tab)

(medium.com)

74 pointsereli17y ago18 comments

18 comments

15 comments · 6 top-level

wokwokwok7y ago· 4 in thread

tldr, if you really dig past the marketing (from the FAQ (1)):

> We see Airflow and Luigi as complementary frameworks: Airflow and Luigi are tools that handle deployment, scheduling, monitoring and alerting. Kedro is the worker that should execute a series of tasks, and report to the Airflow and Luigi managers.

> Create the data transformation steps as pure Python functions

Personally, I feel mystified why you would use something like this rather than a more mature product like say, Spark, that natively supports clustering, etc, which is what I would really like to see in the FAQ.

Is it a processing solution? Not really, since it suggests you can offload the heavy lifting to an engine, eg. spark. An orchestrator? Apparently not, because that's a complementary product. So... it's like, a configuration management tool?

Pretty hard to see the use case to me.

1. https://kedro.readthedocs.io/en/latest/06_resources/01_faq.h...

deepyaman7y ago

> Is it a processing solution? Not really, since it suggests you can offload the heavy lifting to an engine, eg. spark. An orchestrator? Apparently not, because that's a complementary product. So... it's like, a configuration management tool?

I actually had the same questions when I was first introduced to Kedro! In my case, I didn't understand the value proposition over something like Apache Beam. After using it, I feel like Kedro provides:

    1. a consistent structure across analytics pipelines. It's easy to start and pick up other Kedro projects after you've
       used it once.
    2. convenient and consistent I/O via the data catalog. The fact that we can configure and swap out data sources at ease
       is a huge plus, and we also rely heavily on data versioning.
    3. easy integration with existing frameworks (PySpark, vanilla Pandas, Dask, Airflow, Luigi, etc.)

Additionally, it aligns well with standards we have internally, like data layering. (edit: Apparently this is also part of the FAQ: https://kedro.readthedocs.io/en/latest/06_resources/01_faq.h... Who knew!)

> Personally, I feel mystified why you would use something like this rather than a more mature product like say, Spark, that natively supports clustering, etc, which is what I would really like to see in the FAQ.

I'd say 80-90% of projects at QuantumBlack use (Py)Spark, so we've built out `SparkDataSet`s, `pandas_to_spark` and `spark_to_pandas` utility decorators, etc. There's a brief integration tutorial here: https://github.com/quantumblacklabs/kedro/tree/develop/kedro...

Full disclosure: I'm a data engineer at QuantumBlack (if it wasn't obvious already!)

FridgeSeal7y ago

Because running Spark to do anything that doesn’t actually require a whole cluster is like using earthmoving equipment to assemble a series of small ikea tables?

wokwokwok7y ago

If you're doing something that trivial, you don't need anything more complicated than airflow.

3 more replies

joelschw7y ago

I think one of the big differences is that during development the pipeline DAG is inferred from the data catalog and not explicitly coded in the same way you need to do in something like Airflow.

The logic being that once you've finished experimenting and iterating it's much easier to move to AirFlow.

FridgeSeal7y ago· 2 in thread

> Machine learning models which can be deployed effortlessly and operate unattended are far more likely to achieve commercial objectives.

Likeliness of achieving commercial objectives is tied to the commercial usefulness and accuracy of your analysis and predictions, not the ease of deployment, or-even more curiously-ability to be left unattended.

IanCal7y ago

It's surely not a particularly contentious point that hard to deploy systems that require lots of attention to keep running are less likely to achieve commercial objectives.

Just like your website being stable and easy to update helps your business use it to make money. Of course it also needs to be tied to commercial usefulness.

joelschw7y ago

This is a wider point for anyone looking to take advantage of machine learning, but reproducibility is also a problem which needs to be catered for.

prepend7y ago· 1 in thread

I really like how they implemented the data catalog [0] so that it’s yaml-based and also has a paths-style cascading method of files that can be common across or within teams as well as personal for individual projects. I think this makes it easy to build up with tools for meta analysis (how many data sets are used, etc) and even viz using a variety of tools rather than having the metadata management tied to a system or product.

Are there other techniques for data catalogs that are file based or at least open standard based that scale all the way up from developer?

[0] https://kedro.readthedocs.io/en/latest/04_user_guide/04_data...

infinite8s7y ago

There's the intake project from the Anaconda folks.

domenicrosati7y ago· 1 in thread

Conjecture: production quality of ml code has mostly to do with how heuristics are designed and battle tested and almost nothing to do with how the training/inference pipeline is constructed.

stichers7y ago

Just because the challenge is relatively trivial to solve, doesn't make it any less important though. Experiment management, and the transition to production, is recognised as having potentially high impact to successful delivery. My understanding is that this takes care of details, which can otherwise get forgotten in the race for the best model. But YMMV.

bserial7y ago· 1 in thread

I’m curious as to if anyone can say how this compares to dagster since both libraries seems to rely on deploying to engines like Airflow?

Peteris7y ago

Kedro puts emphasis on seamless transition to prod without jeopardizing work in experimentation stage:

- pipeline syntax is absolutely minimal (even supporting lambdas for simple transitions), inspired by the Clojure library core.graph https://github.com/plumatic/plumbing

- sequential and parallel runners are built-in (don't have to rely on Airflow)

- io provides wrappers for existing familiar data sources, but directly borrows arguments from Pandas, Spark APIs so no new API to learn

- flexibility in the sense you could rip out anything, for example, the whole Data Catalog replacing with another mechanism for data access like Haxl

- there's a project template which serves as a framework with built-in conventions from 50+ analytics engagements

coverman7y ago

Starting to see a lot of these frameworks pop up to simplify deployment of machine learning models. I’m really hoping one or two start to stand out...but it doesn’t feel like this one.

j / k navigate · click thread line to collapse

18 comments

15 comments · 6 top-level

wokwokwok7y ago· 4 in thread

tldr, if you really dig past the marketing (from the FAQ (1)):

> Create the data transformation steps as pure Python functions

Pretty hard to see the use case to me.

1. https://kedro.readthedocs.io/en/latest/06_resources/01_faq.h...

deepyaman7y ago

    1. a consistent structure across analytics pipelines. It's easy to start and pick up other Kedro projects after you've
       used it once.
    2. convenient and consistent I/O via the data catalog. The fact that we can configure and swap out data sources at ease
       is a huge plus, and we also rely heavily on data versioning.
    3. easy integration with existing frameworks (PySpark, vanilla Pandas, Dask, Airflow, Luigi, etc.)

Full disclosure: I'm a data engineer at QuantumBlack (if it wasn't obvious already!)

FridgeSeal7y ago

Because running Spark to do anything that doesn’t actually require a whole cluster is like using earthmoving equipment to assemble a series of small ikea tables?

wokwokwok7y ago

If you're doing something that trivial, you don't need anything more complicated than airflow.

3 more replies

joelschw7y ago

I think one of the big differences is that during development the pipeline DAG is inferred from the data catalog and not explicitly coded in the same way you need to do in something like Airflow.

The logic being that once you've finished experimenting and iterating it's much easier to move to AirFlow.

FridgeSeal7y ago· 2 in thread

> Machine learning models which can be deployed effortlessly and operate unattended are far more likely to achieve commercial objectives.

IanCal7y ago

It's surely not a particularly contentious point that hard to deploy systems that require lots of attention to keep running are less likely to achieve commercial objectives.

Just like your website being stable and easy to update helps your business use it to make money. Of course it also needs to be tied to commercial usefulness.

joelschw7y ago

This is a wider point for anyone looking to take advantage of machine learning, but reproducibility is also a problem which needs to be catered for.

prepend7y ago· 1 in thread

Are there other techniques for data catalogs that are file based or at least open standard based that scale all the way up from developer?

[0] https://kedro.readthedocs.io/en/latest/04_user_guide/04_data...

infinite8s7y ago

There's the intake project from the Anaconda folks.

domenicrosati7y ago· 1 in thread

Conjecture: production quality of ml code has mostly to do with how heuristics are designed and battle tested and almost nothing to do with how the training/inference pipeline is constructed.

stichers7y ago

bserial7y ago· 1 in thread

I’m curious as to if anyone can say how this compares to dagster since both libraries seems to rely on deploying to engines like Airflow?

Peteris7y ago

Kedro puts emphasis on seamless transition to prod without jeopardizing work in experimentation stage:

- pipeline syntax is absolutely minimal (even supporting lambdas for simple transitions), inspired by the Clojure library core.graph https://github.com/plumatic/plumbing

- sequential and parallel runners are built-in (don't have to rely on Airflow)

- io provides wrappers for existing familiar data sources, but directly borrows arguments from Pandas, Spark APIs so no new API to learn

- flexibility in the sense you could rip out anything, for example, the whole Data Catalog replacing with another mechanism for data access like Haxl

- there's a project template which serves as a framework with built-in conventions from 50+ analytics engagements

coverman7y ago

Starting to see a lot of these frameworks pop up to simplify deployment of machine learning models. I’m really hoping one or two start to stand out...but it doesn’t feel like this one.

j / k navigate · click thread line to collapse