Change Data Capture: The Magic Wand We Forgot (opens in new tab)

(martin.kleppmann.com)

40 pointstimclark10y ago13 comments

13 comments

9 comments · 3 top-level

boothead10y ago· 4 in thread

I have come to believe that storing your data as the semantic events that happen rather than the state at a given point in time is the way to go. From what I've seen change data capture is the opposite process of trying to extract an event stream from the data changes.

A_Beer_Clinked10y ago

Lots of databases are configured to do both. The tables store what we normally think of as "the data" and the log stores the changes. Tables are like the HEAD in git etc and the transaction log is like the chain of commits.

In principle you could just query the transaction log for every change to your data and compute the final state every time. Obviously this would be onerous so in normal operation we just use the latest state.

When things go wrong the transaction log is useful for understanding why and also rewinding/replaying the database to the correct state.

Some databases ship these transaction logs around between replicas to keep them all in sync.

The work presented here is an interesting application of the same basic mechanism to keep different flavours of datastores in sync.

Recently we very briefly explored the idea of using this mechanism to implement partial replication for partitioned reporting data stores. Unfortunately our current platform SQL Azure doesn't grant access to the transaction log directly. (Which on balance this is a good thing because it's handling all the replications etc)

ZenoArrow10y ago

Why do you believe capturing semantic events (update statements, delete statements, alter statements, etc...) is superior to capturing a log of the data changes?

Whilst there is an element of compactness when it comes to capturing semantic events, the benefit of using a simpler mechanism like logs means that you don't need to use a full database engine to parse the data, and may end up offering better performance (for example, no need to calculate what a commit rollback entails on every node, just do it on the master node and let the other nodes read the logs to know what to update).

boothead10y ago

I didn't explain properly. I meant that I think the things that should be stored are things like.

  CustomerCreated { stuff }
  CustomerMadeOrder { custId, stuff }
  ItemAddedToOrder { orderId, stuff }

  etc..

This is the event sourcing view of the world.

2 more replies

strictfp10y ago

Aka "event sourcing". See for instance https://geteventstore.com

baseballmerpeak10y ago· 2 in thread

Essentially, one database to rule them all?

brianxq310y ago

It is very much the opposite. With this pattern, you're going to have lots of copies of your data in different transformations in potentially many different data stores. The idea is that you take the stream of changes from something like Postgres and use that stream to populate caches, indexes, denormalizations/representations, counts, etc.

akkartik10y ago

One append-only data structure to rule all them databases.

1 more reply

adamtj10y ago

See also, "The Log: What every software engineer should know about real-time data's unifying abstraction"

http://engineering.linkedin.com/distributed-systems/log-what...

https://news.ycombinator.com/item?id=6916557

j / k navigate · click thread line to collapse