We at Datafold are excited to announce a new release of data-diff (https://github.com/datafold/data-diff), an open-source tool that efficiently compares tables within or across a wide range of SQL databases. This release includes a lot of new features, improvements and bugfixes.
We released the first version 6 months ago because we believe that diffing data is as fundamental of a capability as diffing code in data engineering workflows. Over the past few months, we have seen data-diff being adopted for a variety of use-cases, such as validating migration and replication of data between databases (diffing source and target) and tracking the effects of code changes on data (diffing staging/dev and production environments).
With this new release data-diff is significantly faster at comparing tables within the same database, especially when there are a lot of differences between the tables. We've also added the ability to materialize the diff results into a database table, in addition to (or instead of) outputting them to stdout. We've added support for DuckDB, and for diffing schemas. Improved support for alphanumerics, and threading, and generally improved the API, the command-line interface, and stability of the tool.
We believe that data-diff is a valuable addition to the open source community, and we are committed to continue growing it and the community around it. We encourage you to try it out and let us know what you think!
You can read more about data-diff on our GitHub page at the following link: https://github.com/datafold/data-diff/
To see the list of changes for the 0.3.0 release, go here: https://github.com/datafold/data-diff/releases/tag/v0.3.0
As a company, Datafold builds tools for data engineers to automate the most tedious and error-prone tasks falling through the cracks of the modern data stack, such as data testing and lineage. We launched two years ago with a tool for regression-testing changes to ETL code https://news.ycombinator.com/item?id=24071955. It compares the produced data before and after the code change and shows the impact on values, aggregate metrics, and downstream data applications.
While working with many customers on improving their data engineering experience, we kept hearing that they needed to diff their data across databases to validate data replication between systems.
There were 3 main use cases for such replication:
(1) To perform analytics on transactional data in an OLAP engine (e.g. PostgreSQL > Snowflake) (2) To migrate between transactional stores (e.g. MySQL > PostgreSQL) (3) To leverage data in a specialized engine (e.g. PostgreSQL > ElasticSearch).
Despite multiple vendors (e.g., Fivetran, Stitch) and open-source products (Airbyte, Debezium) solving data replication, there was no tooling for validating the correctness of such replication. When we researched how teams were going about this, we found that most have been either:
Running manual checks: e.g., starting with COUNT(*) and then digging into the discrepancies, which often took hours to pinpoint the inconsistencies. Using distributed MPP engines such as Spark or Trino to download the complete datasets from both databases and then comparing them in memory – an expensive process requiring complex infrastructure.
Our users wanted a tool that could:
(1) Compare datasets quickly (seconds/minutes) at a large (millions/billions of rows) scale across different databases (2) Have minimal network IO and database workload overhead. (3) Provide straightforward output: basic stats and what rows are different. (4) Be embedded into a data orchestrator such as Airflow to run right after the replication process.
So we built Data Diff as an open-source package available through pip. Data Diff can be run in a CLI or wrapped into any data orchestrator such as Airflow, Dagster, etc.
To solve for speed at scale with minimal overhead, Data Diff relies on checksumming the data in both databases and uses binary search to identify diverging records. That way, it can compare arbitrarily large datasets in logarithmic time and IO – only transferring a tiny fraction of the data over the network. For example, it can diff tables with 25M rows in ~10s and 1B+ rows in ~5m across two physically separate PostgreSQL databases while running on a typical laptop.
We've launched this tool under the MIT license so that any developer can use it, and to encourage contributions of other database connectors. We didn't want to charge engineers for such a fundamental use case. We make money by charging a license fee for advanced solutions such as column-level data lineage, CI workflow automation, and ML-powered alerts.
One of the biggest pain points in developing ETL pipelines – chains of jobs that move, clean, merge and aggregate analytical data – has been regression testing: verifying how a change in source code (mostly, SQL) affects the produced data.
Early in my career, as an on-call data engineer at Lyft, I accidentally introduced a breaking code change while attempting to ship a hotfix at 4AM to a SQL job that computed tables for core business analytics. A seemingly small change in filtering logic ended up corrupting data for all downstream pipelines and breaking dashboards for the entire company. Apart from being a silly mistake, this highlighted the lack of proper tooling for testing changes. If there had been a way to quickly compare the data computed by production code vs. the hotfix branch, I would have immediately spotted the alarming divergence and avoided merging the breaking change.
Without a diffing tool, the typical options for regression testing are: (1) Data “unit tests” (e.g. check primary key uniqueness, ensure values are within interval, etc.) – these are helpful, but costly investment. Frameworks such as dbt make it easier, but it’s often still prohibitively hard to verify all assumptions in a large table. (2) Write custom SQL queries to compare data produced by the prod and dev versions of the source code (e.g. compare counts, match primary keys). This can easily take up 100+ lines of SQL and hours of unsatisfying work, which no one really wants to do. (3) "Fuck It, Ship It" is always an option but too risky nowadays as analytical data not only powers dashboards but also production ML models.
As this problem is common in data engineering, some large organizations have built and open-sourced their solutions – for example, BigDiffy by Spotify. However, most of these tools are CLI-based and produce results in a plain-text format which is hard to comprehend when you are dealing with complex data.
To fit existing workflows of our users, we’ve built a web interface with interactive charts showing both diff summary statistics (e.g. % of different values by column) and value-level side-by-side comparison (git diff style). But since the mission of the tool is to save engineers as much time as possible, we also opened an API for automation through Airflow or other orchestrators, and built a Github workflow that runs diff on every pull request with changes to ETL code. Since billion-row-scale datasets are not uncommon nowadays, there is an optional sampling feature that helps keep compute costs low and get results within a few minutes no matter how large the dataset is.
We've found Datafold to be a good fit for the following workflows: (1) Developing data transformations – before an ETL job is shipped to production, it undergoes multiple iterations. Often it’s important to see how data changes between every iteration, and particularly useful if you have 1M+ rows and 100+ columns where “SELECT *” becomes useless. (2) Code review & testing: large organizations have hundreds of people committing to ETL codebases. Understanding the impact of even a modest SQL diff is daunting. Datafold can produce a data diff for every commit in minutes so changes are well understood. (3) Data transfer validation: moving large volumes of data between databases is error-prone, especially if done via change data capture (CDC): a single lost event can affect the resulting dataset in a way that is tricky to debug. We allow comparing datasets across different databases, e.g. PostgreSQL & Snowflake.
We've set up a sandbox at https://app.datafold.com/hackernews so you can see how diffing works. Shoot us an email (hn@datafold.com) to set up a trial and use it with your own data.
We are passionate about improving tooling for data engineers and would love to hear about your experience with developing data pipelines and ensuring data quality. Also, if you think that dataset diffing can be helpful in other domains, we are very curious to learn from you!