Skip to content

Top Best Ask Show New Jobs

Ask HN: How do you version control your event stream schema?

18 pointsstefaniabje7y ago5 comments

Anyone here built or used a data pipeline that validates event structures or pipes data to multiple destinations?

- Did you use JSON, Thrift, Protobuf, Avro or something else to define the schema for the events in the stream?

- Did you version the schema?

- Did you version each event or object in the schema?

- What type of versioning did you use? (E.g. semantic, incremental counters, git hashes, another type of hash, etc.)

- Any other cool things you'd like to share?

5 comments

5 comments · 2 top-level

mrburton7y ago· 2 in thread

We're using Avro + Kafka's Registry. We try to append to our messages to keep backwards compatibility.

That being said, there are things we intentionally do to ensure we can support this. e.g, Allow for nullable properties.

We pass in the message type, version, and additional headers. It's fairly common to create an envelope around the actual message. I would also suggest consider doing this for a few reasons. Skipping the obvious benefits, there's one powerful benefit in adding meta data to messages without having to alter the actual event. e.g, security tokens, time to live, and more.

I'm sure this might be also fairly commonly known, but I would highly recommend adding information like request correlation information.

stefaniabjeOP7y ago

Thanks for sharing.

> We try to append to our messages to keep backwards compatibility.

Do you additionally do versioning, or do you solve it by only allowing backwards compatible changes such as appending to the message?

We do - that's how Kafka Registry knows what schema to apply for the Avro message. If you like, you can ping me at my username + gmail and I can answer any other questions about what you're doing.

raidan7y ago· 1 in thread

I'm in the same boat - looking to consolidate a few smaller data pipelines that have organically grown over the years into a single pipeline and taking the opportunity to try and get the org to rethink how we emit events in various parts of the stack. As teams operate fairly independently, we need something that enables a flexible data structure whilst still enabling the ability to join data on common well-defined fields.

The current path we are going down is looking at using a 'nested' JSON structure that enables sub-processes in the various systems to inherit values from the parent. Something similar to:

{ "type": "schemaX", "version": 1, "payload": { "k1": "v1", "type": "schemaY", "version": 3, "payload": { "kk1": "v1" } } }

The structure itself will be documented using JSONSchema and hopefully we will be able to verify the validity of events as they are processed, though this might be too expensive to do in high volume scenarios.

The goal is to then build a low latency router that takes in routing policies to forward subsets of events to further data pipelines. The policies themselves will be defined using some kind of DSL (possibly using JSONPath?).

As a whole, this seems to be a fairly common problem[0] that other companies are trying to solve, but a lot of the low level details are not spoken about. One thing that does seem to be a common component to a service like this is Apache Flink.[1]

[0] Netflix Keystone - https://www.youtube.com/watch?v=sPB8w-YXX1s [1] https://flink.apache.org/

stefaniabjeOP7y ago

Thanks for sharing! So it seems you may end up using an incremental counter as a version – or was that a placeholder example?

What are the main use cases for the data pipelines you're forwarding your data into? Are the events fairly rich or fairly simple? Do you plan on doing any type validation as well?

j / k navigate · click thread line to collapse