I like how the article notes that the stuff we were talking about with Reverse ETL (mostly activating your data in SaaS systems like Salesforce, Zendesk, etc) is one important part of Publishing. But we are also seeing traditional use cases like file uploads and new fancy stuff like vector databases.
1. Acquire bytes (void → unstructured)
2. Parse bytes to events (unstructured → structured)
3. Transform events (structured → structured)
4. Print events (structured → unstructured)
5. Send bytes (unstructured → void)
The "Publish" part is a combination of (4) and (5). Sometimes they are fused because not all APIs differentiate those steps. We're currently focusing on building blocks (engine, connectors, formats) as opposed to application-level integrations, so turnkey Reverse ETL is not near. But the main point is that the symmetry reduces cognitive effort for the user, because they worked that muscle on the "E" side already and now just need to find the dual in the docs.
[1] https://docs.tenzir.com/blog/five-design-principles-for-buil...
The primitives of many of these ETL systems are structured tables (snowflake, parquet, pandas dataframes, whatever) and I don't think I'd ever choose bytes over structured tables. The unstructured parts of data systems I've worked on have always chewed up an outsize portion of labor with difficult to diagnose failure modes. The biggest cognitive effort win of reverse ETL solutions has been to make external systems and applications "speak table".
In security, binary artifacts are common, e.g., to scan YARA rules on malware samples and produce a structured report (“table”). Turning packet traces into structured logs is another example. Typically you have to switch between a lot of tools for that, which makes the process complex.
(The “void” type is only for symmetry in that every operator has an input and output type. The presence of void makes an operator a source or sink. A “closed” pipeline invariant is one with source and sink, and only closed pipelines can execute in our mental model.)
ETL1: gather the raw data from the data source, mapping it to the schema required to load it into the data store.
ETL2: pull the normalized data, process it in some way, and load into a downstream data store.
I suppose that ETL is typically bound to getting data into a warehouse, but that feels like a largely arbitrary distinction. We are just moving data from source to sink.
Here are my thoughts though on the importance of the distinction...
The first place you land the data is almost always a place you control - either a data warehouse or a data lake that you have tuned for fast and flexible data processing. The second (publish) process pushes to a location you most likely can't control, and which is not prepared to receive raw/unshaped data.
This is important because the business logic in our transformations will almost always evolve over time. Running between EL and P (the second "EL") gives us reproducibility and efficiency to innovate, using the location we have the best performance profile for running those transforms.
What do you think?
I'm not convinced the distinction is important enough to warrant anything other than bucketing it under Reverse ETL, and the terms introduced (ELTP and "EL Pairs") I think create less clarity, not more.
> pushes to a location you most likely can't control
Even for internal data hand-offs, this is usually the case. Unless the same team is doing both the ETL work and building the app that's using the output, then the data team is delivering something that was signed-off by the receiving team.
> not prepared to receive raw/unshaped data
So like all Reverse ETL, which requires some sort of integration boundary for data delivery. That could be an API, or a CSV file uploaded to an FTP server, or reading schema'd JSON from Kafka. In every instance, the data team needs to tailor the output specific to the receiver.
Regarding control, that's something I've never felt with production data. It's such a wild beast. Once the data leaves your team/code, all bets are off.
#AcronymAllTheThings"