I'm in the same boat - looking to consolidate a few smaller data pipelines that have organically grown over the years into a single pipeline and taking the opportunity to try and get the org to rethink how we emit events in various parts of the stack. As teams operate fairly independently, we need something that enables a flexible data structure whilst still enabling the ability to join data on common well-defined fields.
The current path we are going down is looking at using a 'nested' JSON structure that enables sub-processes in the various systems to inherit values from the parent. Something similar to:
{ "type": "schemaX", "version": 1, "payload": { "k1": "v1", "type": "schemaY", "version": 3, "payload": { "kk1": "v1" } } }
The structure itself will be documented using JSONSchema and hopefully we will be able to verify the validity of events as they are processed, though this might be too expensive to do in high volume scenarios.
The goal is to then build a low latency router that takes in routing policies to forward subsets of events to further data pipelines. The policies themselves will be defined using some kind of DSL (possibly using JSONPath?).
As a whole, this seems to be a fairly common problem[0] that other companies are trying to solve, but a lot of the low level details are not spoken about. One thing that does seem to be a common component to a service like this is Apache Flink.[1]
[0] Netflix Keystone - https://www.youtube.com/watch?v=sPB8w-YXX1s
[1] https://flink.apache.org/