ELTP: Extending ELT for Modern AI and Analytics (opens in new tab)

(airbyte.com)

74 pointsaaronsteers2y ago15 comments

15 comments

So P here is what is often referred to as Reverse ETL where the source gets updated with the enriched data after the ETL process.

bleonard2y ago

Airbyte acquired our Reverse ETL company, Grouparoo, 1.5 years ago. There is so much to solve making just the Extract and Load work well and so much value that comes from that, we have been busy there. I'm excited to circle back to publishing next year.

I like how the article notes that the stuff we were talking about with Reverse ETL (mostly activating your data in SaaS systems like Salesforce, Zendesk, etc) is one important part of Publishing. But we are also seeing traditional use cases like file uploads and new fancy stuff like vector databases.

satyrnein2y ago

Does Airbyte have a Reverse ETL offering currently? Or just blog posts? :-)

1 more reply

mavam2y ago

We want to achieve something similar with our pipelines [1] by making the beginning and the end of the pipeline symmetric, giving you this flow:

1. Acquire bytes (void → unstructured)

2. Parse bytes to events (unstructured → structured)

3. Transform events (structured → structured)

4. Print events (structured → unstructured)

5. Send bytes (unstructured → void)

The "Publish" part is a combination of (4) and (5). Sometimes they are fused because not all APIs differentiate those steps. We're currently focusing on building blocks (engine, connectors, formats) as opposed to application-level integrations, so turnkey Reverse ETL is not near. But the main point is that the symmetry reduces cognitive effort for the user, because they worked that muscle on the "E" side already and now just need to find the dual in the docs.

[1] https://docs.tenzir.com/blog/five-design-principles-for-buil...

code_biologist2y ago

I don't do security, but I have been a data engineer for the better part of a decade and I don't understand what void and unstructured are. Am I the fool? I don't get it.

The primitives of many of these ETL systems are structured tables (snowflake, parquet, pandas dataframes, whatever) and I don't think I'd ever choose bytes over structured tables. The unstructured parts of data systems I've worked on have always chewed up an outsize portion of labor with difficult to diagnose failure modes. The biggest cognitive effort win of reverse ETL solutions has been to make external systems and applications "speak table".

mavam2y ago

The extra data type of unstructured/bytes is optional in that you don’t have to use it if you don’t need it. Just start with a table if that’s your use case.

In security, binary artifacts are common, e.g., to scan YARA rules on malware samples and produce a structured report (“table”). Turning packet traces into structured logs is another example. Typically you have to switch between a lot of tools for that, which makes the process complex.

(The “void” type is only for symmetry in that every operator has an input and output type. The presence of void makes an operator a source or sink. A “closed” pipeline invariant is one with source and sink, and only closed pipelines can execute in our mental model.)

1 more reply

datadrivenangel2y ago

Push is often better than polling for new data and then pulling, but it's all a question of interfaces. Data flows want to be a global DAG, but also want sections to be segmented.

andrewprock2y ago

It's cleaner to say that ELTP is really just two ETL steps done in sequence.

ETL1: gather the raw data from the data source, mapping it to the schema required to load it into the data store.

ETL2: pull the normalized data, process it in some way, and load into a downstream data store.

I suppose that ETL is typically bound to getting data into a warehouse, but that feels like a largely arbitrary distinction. We are just moving data from source to sink.

aaronsteersOP2y ago

Thanks for this feedback! I do agree there are some similarities as I called our as common benefits of using "EL pairs" on both sides of the process.

Here are my thoughts though on the importance of the distinction...

The first place you land the data is almost always a place you control - either a data warehouse or a data lake that you have tuned for fast and flexible data processing. The second (publish) process pushes to a location you most likely can't control, and which is not prepared to receive raw/unshaped data.

This is important because the business logic in our transformations will almost always evolve over time. Running between EL and P (the second "EL") gives us reproducibility and efficiency to innovate, using the location we have the best performance profile for running those transforms.

What do you think?

tomnipotent2y ago

> What do you think?

I'm not convinced the distinction is important enough to warrant anything other than bucketing it under Reverse ETL, and the terms introduced (ELTP and "EL Pairs") I think create less clarity, not more.

> pushes to a location you most likely can't control

Even for internal data hand-offs, this is usually the case. Unless the same team is doing both the ETL work and building the app that's using the output, then the data team is delivering something that was signed-off by the receiving team.

> not prepared to receive raw/unshaped data

So like all Reverse ETL, which requires some sort of integration boundary for data delivery. That could be an API, or a CSV file uploaded to an FTP server, or reading schema'd JSON from Kafka. In every instance, the data team needs to tailor the output specific to the receiver.

andrewprock2y ago

I do like how the end to end pipeline is captured with ELTP. Conceptually I find it lighter than: ETL + Reverse ETL. While I might personally find modular ETL to be even lighter, that moniker is particular to myself and I wouldn't ask anyone else to take it up.

Regarding control, that's something I've never felt with production data. It's such a wild beast. Once the data leaves your team/code, all bets are off.

kordlessagain2y ago

Just updated https://ai.featurebase.com, running an Open Source AI pipeline system which should be able to handle a lot of these types of uses.

chocho_prg2y ago

As a Keboola PM, I gotta say, I'm feeling a little acronym envy. But, in the spirit of innovation, let me introduce you to DELTAS—our latest tongue-in-cheek, kitchen-sink framework that's here to solve all your data woes! It stands for Discover, Extract, Load, Transform, Analyze, Share. And for those who find DELTAS too simplistic, there's always DELTAS-C, where the 'C' is for Consume. Because what's data if you can't snack on it, right? And for the futurists, DELTAS-P adds Predict, because who doesn't love a good fortune-telling with their data strategy? Frankly, the world could do with fewer acronyms, but where's the fun in that?

#AcronymAllTheThings"

1 more reply

j / k navigate · click thread line to collapse

15 comments

gigatexal2y ago

So P here is what is often referred to as Reverse ETL where the source gets updated with the enriched data after the ETL process.

bleonard2y ago

satyrnein2y ago

Does Airbyte have a Reverse ETL offering currently? Or just blog posts? :-)

1 more reply

mavam2y ago

We want to achieve something similar with our pipelines [1] by making the beginning and the end of the pipeline symmetric, giving you this flow:

1. Acquire bytes (void → unstructured)

2. Parse bytes to events (unstructured → structured)

3. Transform events (structured → structured)

4. Print events (structured → unstructured)

5. Send bytes (unstructured → void)

[1] https://docs.tenzir.com/blog/five-design-principles-for-buil...

code_biologist2y ago

I don't do security, but I have been a data engineer for the better part of a decade and I don't understand what void and unstructured are. Am I the fool? I don't get it.

mavam2y ago

The extra data type of unstructured/bytes is optional in that you don’t have to use it if you don’t need it. Just start with a table if that’s your use case.

1 more reply

datadrivenangel2y ago

Push is often better than polling for new data and then pulling, but it's all a question of interfaces. Data flows want to be a global DAG, but also want sections to be segmented.

andrewprock2y ago

It's cleaner to say that ELTP is really just two ETL steps done in sequence.

ETL1: gather the raw data from the data source, mapping it to the schema required to load it into the data store.

ETL2: pull the normalized data, process it in some way, and load into a downstream data store.

I suppose that ETL is typically bound to getting data into a warehouse, but that feels like a largely arbitrary distinction. We are just moving data from source to sink.

aaronsteersOP2y ago

Thanks for this feedback! I do agree there are some similarities as I called our as common benefits of using "EL pairs" on both sides of the process.

Here are my thoughts though on the importance of the distinction...

What do you think?

tomnipotent2y ago

> What do you think?

> pushes to a location you most likely can't control

> not prepared to receive raw/unshaped data

andrewprock2y ago

Regarding control, that's something I've never felt with production data. It's such a wild beast. Once the data leaves your team/code, all bets are off.

kordlessagain2y ago

Just updated https://ai.featurebase.com, running an Open Source AI pipeline system which should be able to handle a lot of these types of uses.

chocho_prg2y ago

#AcronymAllTheThings"

1 more reply

j / k navigate · click thread line to collapse