(This is for security reasons—you don’t want commands like clone and diff to be able to execute arbitrary code without your knowledge. Hence the attributes/config split.)
Otherwise, you’ll find that your .sqlite file is SQL instead of a SQLite database, and if you try running stuff while it’s in that condition you’ll get “file is not a database” errors from SQLite.
1. don't store binary files in Git 2. don't run programs that keep a git file open for too long, otherwise clowntown. For example, you run git pull while GnuCash keeps the file open and then you loose all your updates when you save (also a problem with the code below, but fixable).
The correct solution to this is running the "smudge" and "clean" operations as a backup and restore operation, outside the repo, when starting and closing GnuCash, in this case:
```
#!/bin/bash
cat $1.sql > $1.sqlite3
gnucash $1.sqlite3
echo ".dump" | sqlite3 $1.sqlite3 > $1.sql
# optionally:
# git add $1.sql
# git commit -m "Update $(date) $1.sql"
```
The custom diff driver in the OP will not reflect changes to those values either obviously, but at least it won't lose them in the actual database file stored in the git repository.
It's better to store text in git IMHO.
I think the bias towards text files in Git is only a reflection of Git configuring its defaults to handle text files, such as the diff tool and also file type. If you add a custom file type and configure it to not handle it as text, which includes specifying git attributes to prevent it to update things like newline characters, then Git works just as well.
To minimize the disk space to only the changes, you want to store the changes and not the complete, binary before/after images. By filtering, you reduce the stored object to the difference.
Git will be able to compress the text better, and compress between versions better.
Merging and resolving conflicts would have a chance of working.
It's a neat trick, but storing binary files like that in Git isn't as space efficient as using a plain text format.
I built my own tooling to solve this problem: https://datasette.io/tools/sqlite-diffable - which outputs a “diffable” copy of the data in a SQLite database, precisely so you can store it in Git and look at the differences later.
I’ve been running that for a couple of years in this repo: https://github.com/simonw/simonwillisonblog-backup - which provides a backup of my blog’s PostgreSQL Django database (first converted to SQLite and then dumped out using sqlite-diffable).
Here’s an example diff: https://github.com/simonw/simonwillisonblog-backup/commit/72...
Maybe a text format that is sorted before saved would compress a lot better though, both with zlib compression and gits delta compression. You can't really sort a binary file and put it back together!
Obviously an SQLite DB cannot be created ex nihilo, so it is not a source file. Isn't it a bit dangerous to store the database file in the repo, and not the mean to generate it? If both are included in the repo, it then seems redundant (reproducible builds concerns, maybe?).
Seeing that many people are interested in this topic, I wonder in which cases they would however want to track that file in Git.
I'm curious, what is the reason you chose not to use pgdump, but instead opted to convert to to sqlite and then dump the DB using sqlite-diffable?
On a project I'm working on, I'd like to dump our Postgres schema into individual files for each object (i.e., one file for each table, function, stored proc, etc.), but haven't spent enough time to see if pgdump could actually do that. We're just outputting files by object type for now (one tables, function, and stored procs files). It looks like sqlite-diffable does something similar to what we're looking for (separate files for each object).
Why is that? I would have expected git's delta compression to work well on sqlite files, since small table changes usually cause only a small number of page changes.
The resulting pack file, with two revisions of the database, is 329k. The raw database is 864k and gzips to 328k.
So it does look like git's delta encoding works with sqlite's blocks.
Why? Because git stores deltas?
At a first glance I'd expect a binary file to be radically more compact than the plain text equivalent.
4 years ago I wrote an internal tool to serialize records of our hierarchical database to a bunch of hierarchical json files, which are then managed in a normal git repo. Each record is serialized to json, "root" entities in the hierarchy are serialized as to contain all their child entities. This is done as the user works - the database is updated and then the correct json file is read, updated and written back. The current SHA is always updated in the database, so there's a way to recognize users worked outside our app with git on the repo, we can detect that and automatically the database based on the diff. There's also an opposite sync.
This was/is a very specific solution to our very specific set of problems. So not applicable to the general problem of "versioning a database".
In fact users now complain that this duality of database <-> repo is confusing and we should ditch the database. Our users are unfortunately not very technical and sometimes require a lot of hand holding.
In theory they're correct but in practice this would mean ditching our ORM and rewriting one from scratch that will work on top of the json files. While keeping the same API outwards. Which is potentially a lot of work. Maybe that will happen, we haven't decided yet.
I wrote much more about it here:
Didn't find any. I keep encountering BerkleyDB but I don't know how good it is for that purpose.
For the SQL part, but maybe a lot heavier, you can use one of the projects listed on this page: https://github.com/multiprocessio/dsq (No longer maintained, but has links to lots of other projects)
The major problem with text formats is that it means either you have to rewrite the whole file on every change ... or else you have to append arbitrarily and be able to survive that (if you're committing to git you probably are capable of dealing with the storage, at least).
If the former is feasible for you (because incremental updates are rare), just use your favorite database's load/dump around every session, and do some fsync testing (or just keep full backups).
If the latter compromise is acceptable there might be something you can do with streaming logical replication on some DBs, but again there's a reason people generally don't.
Have you considered Fossil (link below)? it is a SCM tool built on top of sqlite.
I have not used it myself, but I would assume it stores file revisions as diffs, and since its all just sqlite you should be able to run arbitrary queries on it.
<<<.dump sqlite3
This avoids invoking echo unnecessarily. Every time you do it you might just save a picowatt-hour or two! (It shouldn’t be more expensive. But be sure to spend a few watt-hours benchmarking it rigorously.)This uses two things that shell users are commonly unfamiliar with:
• <<<: as `> filename` redirects stdout to the named file and `< filename` pipes the named file into stdin, `<<< word` pipes the word into stdin. (The whitespace is optional, and word follows normal argument splitting so you can do things like `<<<"Hello, world!"`.)
• Ordering: although most commonly written at the end of commands, redirections can be placed at the start as well. So if you do things like `2>/dev/null >&2` to suppress all output¹, that bit can go at the start or the end. I’ve become increasingly partial to putting redirections at the start of the line, especially in cases where I’m appending to a file for a while, because a leading `>>filename` supports visual alignment better than a trailing.²
And since we’re talking about different ways: as given, this is feeding the SQL to sqlite3 via stdin, but you can also pass it on the command line, after the filename. But things like this in Git (aliases, diff textconv, probably more) work by taking the string and appending the filename to the end, so you need a way of reordering the arguments. The solution is an immediately-invoked function:
f() { sqlite3 "$@" .dump; }; f
So when you diff mydb.sqlite, it runs `sh -c 'f() { sqlite3 "$@" .dump }; f mydb.sqlite'` or equivalent, which winds up executing `sqlite3 mydb.sqlite .dump`, as desired.I use this technique a number of times in my Git aliases, saving the bother of putting them in separate shell scripts somewhere where path management is a bother, at the cost of maintaining a one-liner with sometimes too many semicolons.³
—⁂—
¹ “Take stderr (2) and redirect it (>) to /dev/null, then take stdout (default/implicit, could also write 1 explicitly) and redirect it (>) to stderr (&2).” There are plenty of other ways of writing this!
² Lists can be a better solution for this specific case, allowing you to redirect to the file only once for a whole bunch of commands:
{
<<<"Line one"
some-command
<<<"End of $thing"
} > filename
³ My longest is thirteen lines, though half of them barely count as lines. The line from my ~/.config/git/config, within [alias]: # Revise into the commit that last changed File
rf = "!f() { if [ $# -eq 0 ]; then REV=\"$(git status --porcelain --untracked-files=no | sed '/^ /d;s/^.. //' | xargs -n1 git rev-list -1 HEAD -- | uniq)\"; NUM_REVS=\"$(echo \"$REV\" | wc -l)\"; if [ $NUM_REVS -ne 1 ]; then >&2 echo Files in the index were not all last modified in the same commit; exit 1; fi; else REV=\"$(git rev-list -1 HEAD -- \"$1\")\"; shift; fi; git revise \"$REV\" \"$@\"; }; f" sqlite3 db.sqlite3 .dumpBut seriously, although echo is typically a shell built-in and distinctly faster than /usr/bin/echo, it’s still much slower than <<<, presumably because it still has to set up a pipe and an extra… shall we say pseudoprocess.
Comparing behaviours for feeding text into `true` (typically a shell built-in, so that process spawn times doesn’t drown the signal):
try() {
echo -e "\e[32;1m$1\e[m"
for (( run = 0; run < 5; run++ )); do
time (for (( i = 0; i < 1000; i++ )); do
$2
done)
done
}
a() { /usr/bin/echo .dump | true; }; try /usr/bin/echo a
b() { echo .dump | true; }; try echo b
c() { <<<.dump true; }; try '<<<' c
d() { true; }; try "no piping" d
My best times of the five runs, under bash/zsh, expressed in time per iteration:• /usr/bin/echo: 750μs/845μs
• echo: 469μs/371μs
• <<<: 11μs/31μs
• No piping: 3μs/10μs
So… yeah, on a very slightly older or slower machine than mine, using <<< may save you more than half a millisecond. That’s a much bigger difference than I expected—I was expecting it to be well under 200μs, maybe under 100μs, though the more I think about it the more I realise my expectation may have been unreasonable.
For hobby projects, I've absolutely committed an SQLITE database in the past to serve as an HTTP cache; I was relying on an external API with some pretty hefty usage limits but at the same time was also certain that the data I was requesting from that API would never change, so I could just keep all responses around and reuse them forever (and it was easier to cache the response than to write a bespoke serialization format for converting them). Yeah, it's technically a build artifact (because it's not code) but it's better than getting 429 errors on every new dev machine for an hour + it was just hobbyist code, nothing meant for professional use.
We're focused on this problem at XetHub.
- Llama2 git repo with ~600 GB of model files: https://xethub.com/XetHub/Llama2
- RedPajama git repo with 3.4 terabytes of data: https://xethub.com/XetHub/RedPajama-Data-1T
Whereas (IIUC) in the article git runs a text dump on both dbs automatically during git diff, which is an insane UX improvement. (Given that you’re like me and didn’t know about this).
“The same screenshot of an iOS app, but the Subscribe button is clear blue and more prominent”
(Although it wouldn’t work directly, since it looks like git runs “text dump” on both independently and then compares the text naively).
But if one feeds an image description from chatGPT to Dall-e and back in a loop, how many steps does it take to revert to pure noise? (surely this has been tried? but I couldn't find it)
https://app.data-drift.io/41231518/samox/local-datadrift-rep...
Not sure how it could be plugged on this stuff, but sharing anyway :D
Whereas with a show HN, I would focus on you and do my best to find something positive to say.
Would it not be simpler to only store the text dump each time? (The dump can be used trivially to rebuild the db should the need arise, and it's human readable).
So maybe a pre-commit or pre-staging hook would be even better for this use case?
I don't have experience with truly huge dbs in the gigabytes or terabytes, but up to, say, 250 Mb the dump feels very fast.
Every time something is updated or deleted, the entire previous record is inserted into its corresponding _A table with who did it and when (and optionally for what transaction number)
so delete from foo results in an insert into foo_A before the delete occurs. so update to bar results in an insert into bar_A before the update occurs and rolling back a delete results in picking the old values out of the _A table and rolling back an update results in picking the old values out of the _A table and inserting the current values into the _A table before the updating the main table.
MySQL and PostgreSQL use a whole directory full of files. You could try storing that whole thing in Git and then tiring a custom diff command that can load those directories into a temporary database sever and dump out SQL for comparison, but it would be very slow and brittle if you could even get it to work at all.
Instead, a better strategy would be to dump your MySQL or PostgreSQL database to plain SQL and store that in your Git repo. Or use the trick in using here for my PostgreSQL database: https://news.ycombinator.com/item?id=38114171
I wonder how this approach handles converting decimal fields, considering SQLite does not have a decimal field type and only has floats (i.e., you could potentially lose some precision during the conversion)?
Worth to keep in mind that the text representation is virtual and thus the command is run on every operation like a git-diff or a git-show. Therefore, if it takes a while to convert to text, it will also slowdown Git and other tools depending on it.
Definitely not best practice security-wise, but it works well
Dolt looks very promising in this area but only works with MySQL, would be nice to have a SQLite solution.
Although just storing the text output would probably be nicer? You could probably achieve similar with some git hooks?