Tracking SQLite Database Changes in Git (opens in new tab)

(garrit.xyz)

347 pointsvldr2y ago110 comments

110 comments

84 comments · 29 top-level

redleader552y ago· 11 in thread

The filter[0] attribute with "smudge" and "clean" would be a better solution for storing SQLite into git.

[0] - https://git-scm.com/docs/gitattributes#_filter

But the key disadvantage is that now getting started with the repository requires more effort: you must have sqlite3 installed locally, and you must apply certain config values manually (and “remember to run this helper script” or “the first time you try running it it’ll⸺” are still manual).

(This is for security reasons—you don’t want commands like clone and diff to be able to execute arbitrary code without your knowledge. Hence the attributes/config split.)

Otherwise, you’ll find that your .sqlite file is SQL instead of a SQLite database, and if you try running stuff while it’s in that condition you’ll get “file is not a database” errors from SQLite.

redleader552y ago

You are right, there is a bootstrap cost and potential security implications. In general, also, .gitattributes is pretty evil, but:

1. don't store binary files in Git 2. don't run programs that keep a git file open for too long, otherwise clowntown. For example, you run git pull while GnuCash keeps the file open and then you loose all your updates when you save (also a problem with the code below, but fixable).

The correct solution to this is running the "smudge" and "clean" operations as a backup and restore operation, outside the repo, when starting and closing GnuCash, in this case:

```

#!/bin/bash

cat $1.sql > $1.sqlite3

gnucash $1.sqlite3

echo ".dump" | sqlite3 $1.sqlite3 > $1.sql

# optionally:

# git add $1.sql

# git commit -m "Update $(date) $1.sql"

```

1 more reply

JNRowe2y ago

You have to be careful with this, as using .dump as your filter will not necessarily get you an equivalent database on restore. For example, .dump will not include database metadata such as application_id¹ or user_version²(perhaps other things too but these are the ones that have bitten me before).

The custom diff driver in the OP will not reflect changes to those values either obviously, but at least it won't lose them in the actual database file stored in the git repository.

¹ https://sqlite.org/pragma.html#pragma_application_id

² https://sqlite.org/pragma.html#pragma_user_version

OskarS2y ago

Is there no capability in SQLite to dump these things as well? Seems like a pretty big thing missing if that's the case, a thing I would expect SQLite to support pretty well.

1 more reply

bambax2y ago

Ah yes, I made a similar observation here https://news.ycombinator.com/item?id=38111408

It's better to store text in git IMHO.

rewmie2y ago

> It's better to store text in git IMHO.

I think the bias towards text files in Git is only a reflection of Git configuring its defaults to handle text files, such as the diff tool and also file type. If you add a custom file type and configure it to not handle it as text, which includes specifying git attributes to prevent it to update things like newline characters, then Git works just as well.

https://git-scm.com/docs/gitattributes

2 more replies

michaelmior2y ago

This is what I was expecting. Store the dump in git but then have the binary in the working directory.

foobarbecue2y ago

Is there a web interface like GitHub that supports this? I wanted to use it with GitHub Enterprise a few years ago and found that it meant I couldn't see diffs properly in GitHub.

massysett2y ago

Why is this better?

sshine2y ago

Maintaining a big, binary blob will make the git history grow in proportion to the size of the binary blob, not the size of the logical/abstract diff, regardless of how comprehensible the diff looks.

To minimize the disk space to only the changes, you want to store the changes and not the complete, binary before/after images. By filtering, you reduce the stored object to the difference.

rjmunro2y ago

You won't be storing the sqlite file in git. So you won't be storing indexes or un-vacuumed data in git. You also won't be storing a database with some of it's rows locked or anything like that.

Git will be able to compress the text better, and compress between versions better.

Merging and resolving conflicts would have a chance of working.

simonw2y ago· 9 in thread

This approach works by storing the actual SQLite binary files in Git and then using a custom "diff" configuration to dump each file as SQL and compare the result.

It's a neat trick, but storing binary files like that in Git isn't as space efficient as using a plain text format.

I built my own tooling to solve this problem: https://datasette.io/tools/sqlite-diffable - which outputs a “diffable” copy of the data in a SQLite database, precisely so you can store it in Git and look at the differences later.

I’ve been running that for a couple of years in this repo: https://github.com/simonw/simonwillisonblog-backup - which provides a backup of my blog’s PostgreSQL Django database (first converted to SQLite and then dumped out using sqlite-diffable).

Here’s an example diff: https://github.com/simonw/simonwillisonblog-backup/commit/72...

chlorion2y ago

I don't think there will be any huge difference between a text dump and binary formats here, git doesn't really care about the underlying data when it stores and packs it.

Maybe a text format that is sorted before saved would compress a lot better though, both with zlib compression and gits delta compression. You can't really sort a binary file and put it back together!

astrobe_2y ago

I'm more wondering about whether or not this Sqlite DB is actually a "source file", that is something that is not generated by something else, which is normally the base material for both make and Git.

Obviously an SQLite DB cannot be created ex nihilo, so it is not a source file. Isn't it a bit dangerous to store the database file in the repo, and not the mean to generate it? If both are included in the repo, it then seems redundant (reproducible builds concerns, maybe?).

Seeing that many people are interested in this topic, I wonder in which cases they would however want to track that file in Git.

1 more reply

infamia2y ago

> I’ve been running that for a couple of years in this repo: https://github.com/simonw/simonwillisonblog-backup - which provides a backup of my blog’s PostgreSQL Django database (first converted to SQLite and then dumped out using sqlite-

I'm curious, what is the reason you chose not to use pgdump, but instead opted to convert to to sqlite and then dump the DB using sqlite-diffable?

On a project I'm working on, I'd like to dump our Postgres schema into individual files for each object (i.e., one file for each table, function, stored proc, etc.), but haven't spent enough time to see if pgdump could actually do that. We're just outputting files by object type for now (one tables, function, and stored procs files). It looks like sqlite-diffable does something similar to what we're looking for (separate files for each object).

simonw2y ago

Mainly it's because all of my other projects are SQLite and this felt like a good opportunity to try out some of my own tooling against the main PostgreSQL database I still use for an active project.

scorpioxy2y ago

pgdump can do that if you go with the directory format output. According to the docs, it will "create a directory with one file for each table and large object being dumped, plus a so-called Table of Contents file describing the dumped objects in a machine-readable format that pg_restore can read."

tlb2y ago

> It's a neat trick, but storing binary files like that in Git isn't as space efficient as using a plain text format.

Why is that? I would have expected git's delta compression to work well on sqlite files, since small table changes usually cause only a small number of page changes.

dunham2y ago

I would think so too. As a quick test I added chinook.db to a fresh repository, committed it, added a row to a table, committed again, and then did a git gc.

The resulting pack file, with two revisions of the database, is 329k. The raw database is 864k and gzips to 328k.

So it does look like git's delta encoding works with sqlite's blocks.

thisismyswamp2y ago

> storing binary files like that in Git isn't as space efficient as using a plain text format

Why? Because git stores deltas?

At a first glance I'd expect a binary file to be radically more compact than the plain text equivalent.

lubesGordi2y ago

Does this custom diff driver also help git do a merge of two dbs?

michaelmior2y ago· 6 in thread

It's rather annoying that part of the config is stored in your local checked out copy of the repository (.git/config) and part of it is in a configuration file that is checked into version control (.gitattributes).

tom_2y ago

That's because part of it is for you personally (the bit that dictates exactly which program gets run to render the diff), and part of it is for everybody (the bit that specifies the diff-related file type).

michaelmior2y ago

Sure, but unless I'm missing something, adding `sqlite3` into my `.gitattributes` is useless without manually doing extra work to specify the config. I get that it would be nice to be able to override this locally, but having no way to specify a default way to make this work feels broken.

1 more reply

KolmogorovComp2y ago

The usual workaround is to create a config file, check it and symlink it to .git/config.

chrismorgan2y ago

That’s not a usual workaround. That’s a crippled technique that doesn’t scale beyond single-user, single-branch usage, because .git/config includes things like the remotes and branches.

IshKebab2y ago

That sucks because then you can't have project and user-specific configs (e.g. setting your email address for a particular project).

2 more replies

liveoneggs2y ago

Even better - use an include/includeIf in your config

olvy02y ago· 4 in thread

Repeating myself from several days ago (https://news.ycombinator.com/item?id=37958006):

4 years ago I wrote an internal tool to serialize records of our hierarchical database to a bunch of hierarchical json files, which are then managed in a normal git repo. Each record is serialized to json, "root" entities in the hierarchy are serialized as to contain all their child entities. This is done as the user works - the database is updated and then the correct json file is read, updated and written back. The current SHA is always updated in the database, so there's a way to recognize users worked outside our app with git on the repo, we can detect that and automatically the database based on the diff. There's also an opposite sync.

This was/is a very specific solution to our very specific set of problems. So not applicable to the general problem of "versioning a database".

In fact users now complain that this duality of database <-> repo is confusing and we should ditch the database. Our users are unfortunately not very technical and sometimes require a lot of hand holding.

In theory they're correct but in practice this would mean ditching our ORM and rewriting one from scratch that will work on top of the json files. While keeping the same API outwards. Which is potentially a lot of work. Maybe that will happen, we haven't decided yet.

I wrote much more about it here:

https://news.ycombinator.com/item?id=25005993

scotty792y ago

From time to time I'm searching for a database engine for small databases that stores the data as diffable text, just provides up to date indexing and sql on top of that to make the operations convenient and fast.

Didn't find any. I keep encountering BerkleyDB but I don't know how good it is for that purpose.

bachmeier2y ago

You might want to look at tsv-utils, or a similar project: https://github.com/eBay/tsv-utils

For the SQL part, but maybe a lot heavier, you can use one of the projects listed on this page: https://github.com/multiprocessio/dsq (No longer maintained, but has links to lots of other projects)

o11c2y ago

HSQLDB does that, but 1. ew, Java, and 2. there are reasons people don't do that.

The major problem with text formats is that it means either you have to rewrite the whole file on every change ... or else you have to append arbitrarily and be able to survive that (if you're committing to git you probably are capable of dealing with the storage, at least).

If the former is feasible for you (because incremental updates are rare), just use your favorite database's load/dump around every session, and do some fsync testing (or just keep full backups).

If the latter compromise is acceptable there might be something you can do with streaming logical replication on some DBs, but again there's a reason people generally don't.

zffr2y ago

For what use-cases do you need this?

Have you considered Fossil (link below)? it is a SCM tool built on top of sqlite.

I have not used it myself, but I would assume it stores file revisions as diffs, and since its all just sqlite you should be able to run arbitrary queries on it.

https://www2.fossil-scm.org/home/doc/trunk/www/index.wiki

1 more reply

chrismorgan2y ago· 4 in thread

Because it’s fun, here’s an alternative spelling of `echo .dump | sqlite3`:

  <<<.dump sqlite3

This avoids invoking echo unnecessarily. Every time you do it you might just save a picowatt-hour or two! (It shouldn’t be more expensive. But be sure to spend a few watt-hours benchmarking it rigorously.)

This uses two things that shell users are commonly unfamiliar with:

• <<<: as `> filename` redirects stdout to the named file and `< filename` pipes the named file into stdin, `<<< word` pipes the word into stdin. (The whitespace is optional, and word follows normal argument splitting so you can do things like `<<<"Hello, world!"`.)

• Ordering: although most commonly written at the end of commands, redirections can be placed at the start as well. So if you do things like `2>/dev/null >&2` to suppress all output¹, that bit can go at the start or the end. I’ve become increasingly partial to putting redirections at the start of the line, especially in cases where I’m appending to a file for a while, because a leading `>>filename` supports visual alignment better than a trailing.²

And since we’re talking about different ways: as given, this is feeding the SQL to sqlite3 via stdin, but you can also pass it on the command line, after the filename. But things like this in Git (aliases, diff textconv, probably more) work by taking the string and appending the filename to the end, so you need a way of reordering the arguments. The solution is an immediately-invoked function:

  f() { sqlite3 "$@" .dump; }; f

So when you diff mydb.sqlite, it runs `sh -c 'f() { sqlite3 "$@" .dump }; f mydb.sqlite'` or equivalent, which winds up executing `sqlite3 mydb.sqlite .dump`, as desired.

I use this technique a number of times in my Git aliases, saving the bother of putting them in separate shell scripts somewhere where path management is a bother, at the cost of maintaining a one-liner with sometimes too many semicolons.³

—⁂—

¹ “Take stderr (2) and redirect it (>) to /dev/null, then take stdout (default/implicit, could also write 1 explicitly) and redirect it (>) to stderr (&2).” There are plenty of other ways of writing this!

² Lists can be a better solution for this specific case, allowing you to redirect to the file only once for a whole bunch of commands:

  {
    <<<"Line one"
    some-command
    <<<"End of $thing"
  } > filename

³ My longest is thirteen lines, though half of them barely count as lines. The line from my ~/.config/git/config, within [alias]:

  # Revise into the commit that last changed File
  rf = "!f() { if [ $# -eq 0 ]; then REV=\"$(git status --porcelain --untracked-files=no | sed '/^ /d;s/^.. //' | xargs -n1 git rev-list -1 HEAD -- | uniq)\"; NUM_REVS=\"$(echo \"$REV\" | wc -l)\"; if [ $NUM_REVS -ne 1 ]; then >&2 echo Files in the index were not all last modified in the same commit; exit 1; fi; else REV=\"$(git rev-list -1 HEAD -- \"$1\")\"; shift; fi; git revise \"$REV\" \"$@\"; }; f"

smilingemoji2y ago

Seems a bit over the top when you could also just provide it as a command line argument to sqlite3 which is much more obvious than <<< and works in every shell.

  sqlite3 db.sqlite3 .dump

chrismorgan2y ago

Review what’s going on. Git gives the file name as the last argument, so you can’t pass .dump as the last argument like that without the contortion of an immediately-invoked function.

Izkata2y ago

"echo" is typically part of the shell, it doesn't run another executable, so I don't think that would save anything.

chrismorgan2y ago

Shh! Don’t spoil things! Picowatt-hours, I say. Picowatt-hours!

But seriously, although echo is typically a shell built-in and distinctly faster than /usr/bin/echo, it’s still much slower than <<<, presumably because it still has to set up a pipe and an extra… shall we say pseudoprocess.

Comparing behaviours for feeding text into `true` (typically a shell built-in, so that process spawn times doesn’t drown the signal):

  try() {
      echo -e "\e[32;1m$1\e[m"
      for (( run = 0; run < 5; run++ )); do
          time (for (( i = 0; i < 1000; i++ )); do
              $2
          done)
      done
  }

  a() { /usr/bin/echo .dump | true; }; try /usr/bin/echo a
  b() {          echo .dump | true; }; try echo          b
  c() {            <<<.dump   true; }; try '<<<'         c
  d() {                       true; }; try "no piping"   d

My best times of the five runs, under bash/zsh, expressed in time per iteration:

• /usr/bin/echo: 750μs/845μs

• echo: 469μs/371μs

• <<<: 11μs/31μs

• No piping: 3μs/10μs

So… yeah, on a very slightly older or slower machine than mine, using <<< may save you more than half a millisecond. That’s a much bigger difference than I expected—I was expecting it to be well under 200μs, maybe under 100μs, though the more I think about it the more I realise my expectation may have been unreasonable.

2 more replies

6LLvveMx2koXfwn2y ago· 4 in thread

Are people really adding their database dumps to git? Does that scale? Without this trick, what was the point?

upon_drumhead2y ago

Version control? I have a pile of static-ish SQLite databases for unit tests to run against. I plan on using this for when we modify those databases, it’s legible what the change is.

noirscape2y ago

Not scale, no, but git also tends to work pretty well for certain data projects; given git is just a VCS it can track any kind of file (to varying degrees of usefulness - git really doesn't like files that massively reorder themselves when the editor saves them like MSWord or certain FXML generators, it completely mucks up the diffs).

For hobby projects, I've absolutely committed an SQLITE database in the past to serve as an HTTP cache; I was relying on an external API with some pretty hefty usage limits but at the same time was also certain that the data I was requesting from that API would never change, so I could just keep all responses around and reuse them forever (and it was easier to cache the response than to write a bespoke serialization format for converting them). Yeah, it's technically a build artifact (because it's not code) but it's better than getting 429 errors on every new dev machine for an hour + it was just hobbyist code, nothing meant for professional use.

skadamat2y ago

Out of the box no, but you can solve the scaling git problem by improving the deduplication: https://xethub.com/assets/docs/faq/how-xet-deduplication-wor...

We're focused on this problem at XetHub.

- Llama2 git repo with ~600 GB of model files: https://xethub.com/XetHub/Llama2

- RedPajama git repo with 3.4 terabytes of data: https://xethub.com/XetHub/RedPajama-Data-1T

samox2y ago

The entire database, no. But a table with slowly changing dimensions, or a table that "should not" change, git is a simple solution for snapshotting.

Hakkin2y ago· 3 in thread

Alternatively: https://www.sqlite.org/sqldiff.html

klabb32y ago

That takes two files as input, don’t you have to checkout, copy etc manually?

Whereas (IIUC) in the article git runs a text dump on both dbs automatically during git diff, which is an insane UX improvement. (Given that you’re like me and didn’t know about this).

Hakkin2y ago

True, I suppose the utility is slightly different. The trick in the OP article is definitely neat.

rjmunro2y ago

You can probably configure git to use this tool when it diffs sqlite files.

upon_drumhead2y ago· 3 in thread

That’s a really cool trick! I’m going to have to see if I can apply it to other things at work as well. Maybe a image to ascii and then a diff would be useful…

klabb32y ago

Here’s an idea for the AI folks: ask it to explain the differences. Imagine something like like:

“The same screenshot of an iOS app, but the Subscribe button is clear blue and more prominent”

(Although it wouldn’t work directly, since it looks like git runs “text dump” on both independently and then compares the text naively).

bambax2y ago

If the process of image generation / description was fully reversible we could store image descriptions instead of a list of pixels...

But if one feeds an image description from chatGPT to Dall-e and back in a loop, how many steps does it take to revert to pure noise? (surely this has been tried? but I couldn't find it)

1 more reply

TikolaNesla2y ago

""""AI""""

samox2y ago· 2 in thread

Sharing a personal project to display git based CSV :D

https://app.data-drift.io/41231518/samox/local-datadrift-rep...

Not sure how it could be plugged on this stuff, but sharing anyway :D

hluska2y ago

When you hijack someone else’s thread with something unrelated, I find it impossible to trust you enough to get excited about what you did. Smiling faces aside, this is kind of disrespectful to the author.

Whereas with a show HN, I would focus on you and do my best to find something positive to say.

samox2y ago

I felt it was related (database + git is quite a narrow scope), and not concurrent to the post (I did not comment on the DoltgreSQL post for instance). But, I admit I was looking for some personal gratification. Sorry it bothered you.

bambax2y ago· 2 in thread

This is an excellent and very clever trick; however what it does is store the sql database in binary format, and use the dump only for the diff.

Would it not be simpler to only store the text dump each time? (The dump can be used trivially to rebuild the db should the need arise, and it's human readable).

So maybe a pre-commit or pre-staging hook would be even better for this use case?

Cthulhu_2y ago

It kinda depends on performance, a large database may take a while to rebuild depending on indices and the like. But then, the same can be said for the output of this dump command (and diffing the result). It's probably manageable for a gnucash database (1000 - 100000 record order of magnitude if I were to guess), but for anything larger the dump command will take ages.

bambax2y ago

If the database is huge you don't want to store every version as a whole blob in git...

I don't have experience with truly huge dbs in the gigabytes or terabytes, but up to, say, 250 Mb the dump feels very fast.

tomashubelbauer2y ago· 2 in thread

I was also exploring something like this a few years back but for Office files. This exact approach seemed like an absolute win to me, but I ended up not using it, because this won't work in the GitHub web UI. This won't be a deal-breaker to many, but people should be aware of it still. In the end I ended up doing this: https://github.com/TomasHubelbauer/modern-office-git-diff/

doubloon2y ago

does it works with xlsm? (excel macro enabled?)

tomashubelbauer2y ago

I am not sure the internal format of XLSM and how the macros are represented within it, but I would assume they are stored in the form of their source code as a plain text file within the ZIP archive that is the XLSM file and if that's the case, my tool would work with them just fine.

1 more reply

z29LiTp5qUC30n2y ago· 2 in thread

does no one know about _A tables anymore? This has been a solved thing since the 1970s

jsd19822y ago

Can you elaborate? I've been coding since the 90s and have never heard the term _A table before.

z29LiTp5qUC30n2y ago

sometimes called audit or journal tables.

Every time something is updated or deleted, the entire previous record is inserted into its corresponding _A table with who did it and when (and optionally for what transaction number)

so delete from foo results in an insert into foo_A before the delete occurs. so update to bar results in an insert into bar_A before the update occurs and rolling back a delete results in picking the old values out of the _A table and rolling back an update results in picking the old values out of the _A table and inserting the current values into the _A table before the updating the main table.

1 more reply

vasergen2y ago· 2 in thread

is similar approach possible for postgress and MySQL?

simonw2y ago

Not for this technique, because this relies on the fact that SQLite databases are a single file that can be checked into Git.

MySQL and PostgreSQL use a whole directory full of files. You could try storing that whole thing in Git and then tiring a custom diff command that can load those directories into a temporary database sever and dump out SQL for comparison, but it would be very slow and brittle if you could even get it to work at all.

Instead, a better strategy would be to dump your MySQL or PostgreSQL database to plain SQL and store that in your Git repo. Or use the trick in using here for my PostgreSQL database: https://news.ycombinator.com/item?id=38114171

infamia2y ago

> Or use the trick in using here for my PostgreSQL database: https://news.ycombinator.com/item?id=38114171

I wonder how this approach handles converting decimal fields, considering SQLite does not have a decimal field type and only has floats (i.e., you could potentially lose some precision during the conversion)?

1 more reply

danirod2y ago· 1 in thread

I used to do this trick a couple of years ago to diff the credentials file in Rails codebases, which is encrypted, to keep track of changes to passwords and tokens.

Worth to keep in mind that the text representation is virtual and thus the command is run on every operation like a git-diff or a git-show. Therefore, if it takes a while to convert to text, it will also slowdown Git and other tools depending on it.

sneakerblack2y ago

We do the exact same thing to keep track of some credentials we use sops[1] and AWS KMS to separate credentials by sensitivity, then use the git differ to view the diffs between the encrypted secrets

Definitely not best practice security-wise, but it works well

[1] https://github.com/getsops/sops

hiAndrewQuinn2y ago

I had no idea you could do this. Phenomenal!

mrpf1ster2y ago

It would be nice if we finally could have a database that we could use git on to track changes.

Dolt looks very promising in this area but only works with MySQL, would be nice to have a SQLite solution.

https://www.dolthub.com/

sesm2y ago

I wonder if Fossil (SCM used to develop SQLite and written by SQLite devs) has a similar feature built-in

foobarbecue2y ago

The tricky thing about keeping files in git that don't diff well (but can be converted into a representation that does) for me has always been lack of support for filter or hooks in the GitHub web interface. This is a frequent problem for me at work with big xml files.

1 more reply

bspammer2y ago

You can also use this trick to diff encrypted files, if you have an automatic way of decrypting them

emadda2y ago

You can also convert the data to JSON using sqlite_to_json

https://github.com/emadda/transform-x#clis

trashburger2y ago

The really neat part about this is that it still uses the original SQLite files, so it will work retroactively once you add the configuration to your repo. Great article!

smilingemoji2y ago

I love these sort of articles. No new tools are needed here, it shows you how to get the most out of the tools you already know and use. Awesome!

devnonymous2y ago

Ooh, that's a neat trick! The head fake, so to speak is this is a git trick rather than anything special to sqlite. Thanks for sharing.

nedt2y ago

But it's not really a diff no the database itself, just a diff of a full dump that you can use to rebuild the db, but not change an existing one. For example when you do a DELETE the diff does instead have an INSERT less in the dump, which is not exactly a database diff. Depending on the use case that might still be ok.

skadamat2y ago

This is a super neat trick! At work, we're scaling git to terabyte scale and I'm constantly on the search for new ways we can showcase diffs for different file types:

https://news.ycombinator.com/item?id=38112345

kackiel2y ago

Oh wow. This is indeed a very neat use of git config!

Although just storing the text output would probably be nicer? You could probably achieve similar with some git hooks?

setheron2y ago

Is the dump order preserving ? It technically doesn't have to be....

lfconsult2y ago

Awesome...! Thanks for sharing!

0x0732y ago

Is this only visual?

j / k navigate · click thread line to collapse

110 comments

84 comments · 29 top-level

redleader552y ago· 11 in thread

The filter[0] attribute with "smudge" and "clean" would be a better solution for storing SQLite into git.

[0] - https://git-scm.com/docs/gitattributes#_filter

chrismorgan2y ago

(This is for security reasons—you don’t want commands like clone and diff to be able to execute arbitrary code without your knowledge. Hence the attributes/config split.)

redleader552y ago

You are right, there is a bootstrap cost and potential security implications. In general, also, .gitattributes is pretty evil, but:

The correct solution to this is running the "smudge" and "clean" operations as a backup and restore operation, outside the repo, when starting and closing GnuCash, in this case:

```

#!/bin/bash

cat $1.sql > $1.sqlite3

gnucash $1.sqlite3

echo ".dump" | sqlite3 $1.sqlite3 > $1.sql

# optionally:

# git add $1.sql

# git commit -m "Update $(date) $1.sql"

```

1 more reply

JNRowe2y ago

The custom diff driver in the OP will not reflect changes to those values either obviously, but at least it won't lose them in the actual database file stored in the git repository.

¹ https://sqlite.org/pragma.html#pragma_application_id

² https://sqlite.org/pragma.html#pragma_user_version

OskarS2y ago

Is there no capability in SQLite to dump these things as well? Seems like a pretty big thing missing if that's the case, a thing I would expect SQLite to support pretty well.

1 more reply

bambax2y ago

Ah yes, I made a similar observation here https://news.ycombinator.com/item?id=38111408

It's better to store text in git IMHO.

rewmie2y ago

> It's better to store text in git IMHO.

https://git-scm.com/docs/gitattributes

2 more replies

michaelmior2y ago

This is what I was expecting. Store the dump in git but then have the binary in the working directory.

foobarbecue2y ago

Is there a web interface like GitHub that supports this? I wanted to use it with GitHub Enterprise a few years ago and found that it meant I couldn't see diffs properly in GitHub.

massysett2y ago

Why is this better?

sshine2y ago

To minimize the disk space to only the changes, you want to store the changes and not the complete, binary before/after images. By filtering, you reduce the stored object to the difference.

rjmunro2y ago

You won't be storing the sqlite file in git. So you won't be storing indexes or un-vacuumed data in git. You also won't be storing a database with some of it's rows locked or anything like that.

Git will be able to compress the text better, and compress between versions better.

Merging and resolving conflicts would have a chance of working.

simonw2y ago· 9 in thread

This approach works by storing the actual SQLite binary files in Git and then using a custom "diff" configuration to dump each file as SQL and compare the result.

It's a neat trick, but storing binary files like that in Git isn't as space efficient as using a plain text format.

Here’s an example diff: https://github.com/simonw/simonwillisonblog-backup/commit/72...

chlorion2y ago

I don't think there will be any huge difference between a text dump and binary formats here, git doesn't really care about the underlying data when it stores and packs it.

astrobe_2y ago

Seeing that many people are interested in this topic, I wonder in which cases they would however want to track that file in Git.

1 more reply

infamia2y ago

I'm curious, what is the reason you chose not to use pgdump, but instead opted to convert to to sqlite and then dump the DB using sqlite-diffable?

simonw2y ago

Mainly it's because all of my other projects are SQLite and this felt like a good opportunity to try out some of my own tooling against the main PostgreSQL database I still use for an active project.

scorpioxy2y ago

tlb2y ago

> It's a neat trick, but storing binary files like that in Git isn't as space efficient as using a plain text format.

Why is that? I would have expected git's delta compression to work well on sqlite files, since small table changes usually cause only a small number of page changes.

dunham2y ago

I would think so too. As a quick test I added chinook.db to a fresh repository, committed it, added a row to a table, committed again, and then did a git gc.

The resulting pack file, with two revisions of the database, is 329k. The raw database is 864k and gzips to 328k.

So it does look like git's delta encoding works with sqlite's blocks.

thisismyswamp2y ago

> storing binary files like that in Git isn't as space efficient as using a plain text format

Why? Because git stores deltas?

At a first glance I'd expect a binary file to be radically more compact than the plain text equivalent.

lubesGordi2y ago

Does this custom diff driver also help git do a merge of two dbs?

michaelmior2y ago· 6 in thread

tom_2y ago

michaelmior2y ago

1 more reply

KolmogorovComp2y ago

The usual workaround is to create a config file, check it and symlink it to .git/config.

chrismorgan2y ago

That’s not a usual workaround. That’s a crippled technique that doesn’t scale beyond single-user, single-branch usage, because .git/config includes things like the remotes and branches.

IshKebab2y ago

That sucks because then you can't have project and user-specific configs (e.g. setting your email address for a particular project).

2 more replies

liveoneggs2y ago

Even better - use an include/includeIf in your config

olvy02y ago· 4 in thread

Repeating myself from several days ago (https://news.ycombinator.com/item?id=37958006):

This was/is a very specific solution to our very specific set of problems. So not applicable to the general problem of "versioning a database".

I wrote much more about it here:

https://news.ycombinator.com/item?id=25005993

scotty792y ago

Didn't find any. I keep encountering BerkleyDB but I don't know how good it is for that purpose.

bachmeier2y ago

You might want to look at tsv-utils, or a similar project: https://github.com/eBay/tsv-utils

For the SQL part, but maybe a lot heavier, you can use one of the projects listed on this page: https://github.com/multiprocessio/dsq (No longer maintained, but has links to lots of other projects)

o11c2y ago

HSQLDB does that, but 1. ew, Java, and 2. there are reasons people don't do that.

If the former is feasible for you (because incremental updates are rare), just use your favorite database's load/dump around every session, and do some fsync testing (or just keep full backups).

If the latter compromise is acceptable there might be something you can do with streaming logical replication on some DBs, but again there's a reason people generally don't.

zffr2y ago

For what use-cases do you need this?

Have you considered Fossil (link below)? it is a SCM tool built on top of sqlite.

I have not used it myself, but I would assume it stores file revisions as diffs, and since its all just sqlite you should be able to run arbitrary queries on it.

https://www2.fossil-scm.org/home/doc/trunk/www/index.wiki

1 more reply

chrismorgan2y ago· 4 in thread

Because it’s fun, here’s an alternative spelling of `echo .dump | sqlite3`:

  <<<.dump sqlite3

This uses two things that shell users are commonly unfamiliar with:

  f() { sqlite3 "$@" .dump; }; f

So when you diff mydb.sqlite, it runs `sh -c 'f() { sqlite3 "$@" .dump }; f mydb.sqlite'` or equivalent, which winds up executing `sqlite3 mydb.sqlite .dump`, as desired.

—⁂—

² Lists can be a better solution for this specific case, allowing you to redirect to the file only once for a whole bunch of commands:

  {
    <<<"Line one"
    some-command
    <<<"End of $thing"
  } > filename

³ My longest is thirteen lines, though half of them barely count as lines. The line from my ~/.config/git/config, within [alias]:

  # Revise into the commit that last changed File
  rf = "!f() { if [ $# -eq 0 ]; then REV=\"$(git status --porcelain --untracked-files=no | sed '/^ /d;s/^.. //' | xargs -n1 git rev-list -1 HEAD -- | uniq)\"; NUM_REVS=\"$(echo \"$REV\" | wc -l)\"; if [ $NUM_REVS -ne 1 ]; then >&2 echo Files in the index were not all last modified in the same commit; exit 1; fi; else REV=\"$(git rev-list -1 HEAD -- \"$1\")\"; shift; fi; git revise \"$REV\" \"$@\"; }; f"

smilingemoji2y ago

Seems a bit over the top when you could also just provide it as a command line argument to sqlite3 which is much more obvious than <<< and works in every shell.

  sqlite3 db.sqlite3 .dump

chrismorgan2y ago

Review what’s going on. Git gives the file name as the last argument, so you can’t pass .dump as the last argument like that without the contortion of an immediately-invoked function.

Izkata2y ago

"echo" is typically part of the shell, it doesn't run another executable, so I don't think that would save anything.

chrismorgan2y ago

Shh! Don’t spoil things! Picowatt-hours, I say. Picowatt-hours!

Comparing behaviours for feeding text into `true` (typically a shell built-in, so that process spawn times doesn’t drown the signal):

  try() {
      echo -e "\e[32;1m$1\e[m"
      for (( run = 0; run < 5; run++ )); do
          time (for (( i = 0; i < 1000; i++ )); do
              $2
          done)
      done
  }

  a() { /usr/bin/echo .dump | true; }; try /usr/bin/echo a
  b() {          echo .dump | true; }; try echo          b
  c() {            <<<.dump   true; }; try '<<<'         c
  d() {                       true; }; try "no piping"   d

My best times of the five runs, under bash/zsh, expressed in time per iteration:

• /usr/bin/echo: 750μs/845μs

• echo: 469μs/371μs

• <<<: 11μs/31μs

• No piping: 3μs/10μs

2 more replies

6LLvveMx2koXfwn2y ago· 4 in thread

Are people really adding their database dumps to git? Does that scale? Without this trick, what was the point?

upon_drumhead2y ago

Version control? I have a pile of static-ish SQLite databases for unit tests to run against. I plan on using this for when we modify those databases, it’s legible what the change is.

noirscape2y ago

skadamat2y ago

Out of the box no, but you can solve the scaling git problem by improving the deduplication: https://xethub.com/assets/docs/faq/how-xet-deduplication-wor...

We're focused on this problem at XetHub.

- Llama2 git repo with ~600 GB of model files: https://xethub.com/XetHub/Llama2

- RedPajama git repo with 3.4 terabytes of data: https://xethub.com/XetHub/RedPajama-Data-1T

samox2y ago

The entire database, no. But a table with slowly changing dimensions, or a table that "should not" change, git is a simple solution for snapshotting.

Hakkin2y ago· 3 in thread

Alternatively: https://www.sqlite.org/sqldiff.html

klabb32y ago

That takes two files as input, don’t you have to checkout, copy etc manually?

Whereas (IIUC) in the article git runs a text dump on both dbs automatically during git diff, which is an insane UX improvement. (Given that you’re like me and didn’t know about this).

Hakkin2y ago

True, I suppose the utility is slightly different. The trick in the OP article is definitely neat.

rjmunro2y ago

You can probably configure git to use this tool when it diffs sqlite files.

upon_drumhead2y ago· 3 in thread

That’s a really cool trick! I’m going to have to see if I can apply it to other things at work as well. Maybe a image to ascii and then a diff would be useful…

klabb32y ago

Here’s an idea for the AI folks: ask it to explain the differences. Imagine something like like:

“The same screenshot of an iOS app, but the Subscribe button is clear blue and more prominent”

(Although it wouldn’t work directly, since it looks like git runs “text dump” on both independently and then compares the text naively).

bambax2y ago

If the process of image generation / description was fully reversible we could store image descriptions instead of a list of pixels...

But if one feeds an image description from chatGPT to Dall-e and back in a loop, how many steps does it take to revert to pure noise? (surely this has been tried? but I couldn't find it)

1 more reply

TikolaNesla2y ago

""""AI""""

samox2y ago· 2 in thread

Sharing a personal project to display git based CSV :D

https://app.data-drift.io/41231518/samox/local-datadrift-rep...

Not sure how it could be plugged on this stuff, but sharing anyway :D

hluska2y ago

Whereas with a show HN, I would focus on you and do my best to find something positive to say.

samox2y ago

bambax2y ago· 2 in thread

This is an excellent and very clever trick; however what it does is store the sql database in binary format, and use the dump only for the diff.

Would it not be simpler to only store the text dump each time? (The dump can be used trivially to rebuild the db should the need arise, and it's human readable).

So maybe a pre-commit or pre-staging hook would be even better for this use case?

Cthulhu_2y ago

bambax2y ago

If the database is huge you don't want to store every version as a whole blob in git...

I don't have experience with truly huge dbs in the gigabytes or terabytes, but up to, say, 250 Mb the dump feels very fast.

tomashubelbauer2y ago· 2 in thread

doubloon2y ago

does it works with xlsm? (excel macro enabled?)

tomashubelbauer2y ago

1 more reply

z29LiTp5qUC30n2y ago· 2 in thread

does no one know about _A tables anymore? This has been a solved thing since the 1970s

jsd19822y ago

Can you elaborate? I've been coding since the 90s and have never heard the term _A table before.

z29LiTp5qUC30n2y ago

sometimes called audit or journal tables.

Every time something is updated or deleted, the entire previous record is inserted into its corresponding _A table with who did it and when (and optionally for what transaction number)

1 more reply

vasergen2y ago· 2 in thread

is similar approach possible for postgress and MySQL?

simonw2y ago

Not for this technique, because this relies on the fact that SQLite databases are a single file that can be checked into Git.

infamia2y ago

> Or use the trick in using here for my PostgreSQL database: https://news.ycombinator.com/item?id=38114171

1 more reply

danirod2y ago· 1 in thread

I used to do this trick a couple of years ago to diff the credentials file in Rails codebases, which is encrypted, to keep track of changes to passwords and tokens.

sneakerblack2y ago

We do the exact same thing to keep track of some credentials we use sops[1] and AWS KMS to separate credentials by sensitivity, then use the git differ to view the diffs between the encrypted secrets

Definitely not best practice security-wise, but it works well

[1] https://github.com/getsops/sops

hiAndrewQuinn2y ago

I had no idea you could do this. Phenomenal!

mrpf1ster2y ago

It would be nice if we finally could have a database that we could use git on to track changes.

Dolt looks very promising in this area but only works with MySQL, would be nice to have a SQLite solution.

https://www.dolthub.com/

sesm2y ago

I wonder if Fossil (SCM used to develop SQLite and written by SQLite devs) has a similar feature built-in

foobarbecue2y ago

1 more reply

bspammer2y ago

You can also use this trick to diff encrypted files, if you have an automatic way of decrypting them

emadda2y ago

You can also convert the data to JSON using sqlite_to_json

https://github.com/emadda/transform-x#clis

trashburger2y ago

The really neat part about this is that it still uses the original SQLite files, so it will work retroactively once you add the configuration to your repo. Great article!

smilingemoji2y ago

I love these sort of articles. No new tools are needed here, it shows you how to get the most out of the tools you already know and use. Awesome!

devnonymous2y ago

Ooh, that's a neat trick! The head fake, so to speak is this is a git trick rather than anything special to sqlite. Thanks for sharing.

nedt2y ago

skadamat2y ago

This is a super neat trick! At work, we're scaling git to terabyte scale and I'm constantly on the search for new ways we can showcase diffs for different file types:

https://news.ycombinator.com/item?id=38112345

kackiel2y ago

Oh wow. This is indeed a very neat use of git config!

Although just storing the text output would probably be nicer? You could probably achieve similar with some git hooks?

setheron2y ago

Is the dump order preserving ? It technically doesn't have to be....

lfconsult2y ago

Awesome...! Thanks for sharing!

0x0732y ago

Is this only visual?

j / k navigate · click thread line to collapse