Show HN: Reduce SQLite database size by up to 80% with transparent compression (opens in new tab)

(github.com)

420 pointsphiresky3y ago90 comments

90 comments

70 comments · 17 top-level

avereveard3y ago· 16 in thread

I don't want to diss on the effort, I have an honest question: wouldn't this something that would be more appropriate to be done at the filesystem level?

marginalia_nu3y ago

Context-aware compression can be a lot of faster than blind filesystem level compression.

Some data, like plain text, JSON, or XML, compresses extremely well. You're almost certainly looking at compression rates of up to 90%. Dense binary formats, especially formats that are already compressed like most image formats won't compress barely at all (sometimes they can even get bigger), and attempting to compress them is an enormous waste of processing power.

zimpenfish3y ago

Tried this on my twitter database's `json` column (ie the JSON response from the API fetching a singular tweet) with a `dict_chooser` of '%Y-%W' to bundle weeks together and it took 985k rows down from 6.6G to 1.5G. Not as good as compressing the whole thing obviously but really handy for a queryable backup.

cmurf3y ago

At least in btrfs there's compress vs compress-force mount options. compress uses a cheap estimator of compressibility, whereas compress-force always passes data through the compression algorithm.

ZSTD has its own estimator to avoid doing unnecessary work, but I haven't found a comparison of these estimators (which is cheaper).

compress results in max extent size of 128K for compressed extents, and 128M for uncompressed data. Whereas compress-force results in 512K max extent size for uncompressed data. While there's a cost to tracking more extents, it might be less than the cost of frequent modification of large extents, particularly if snapshots are used. Of course there's no one answer, it'd be workload and hardware dependant. Hard drives have significant latency impact from non-contiguous extents. NVMe much less impact.

1 more reply

JulianWasTaken3y ago

Continuing the hypothetical without really knowing about today's filesystems, such a system could presumably calculate a moving per page, fd or process- compression factor and turn it down or up if it started seeing lots of incompressible data?

1 more reply

phireskyOP3y ago

It would be interesting to try something like this on the file system level, yes.

Basically you could heuristically find interesting files, then train a compression dictionary for e.g. each file or each directory. Then you'd compress 32kB blocks of each file with the corresponding dictionary. Note that this would be pretty different from how existing file system compression works (by splitting everything into blocks of e.g. 128kB and just compressing those normally).

I think it would be hard to find heuristics that work well for this method. The result would be pretty different than what I'm doing here with different tradeoffs. sqlite-zstd is able to use application-level information of which data is similar and will compress well together. You couldn't do that within the filesystem. It also works "in cooperation" with the file format of the database instead of against it - e.g. you probably wouldn't want to compress the inner B-tree nodes.

On the other hand, it would then work for any file format not just SQLite databases. E.g. a huge folder of small json files

Someone3y ago

No. If you did that, SQLite would have to (logically) decompress about your entire database on every query. For example, how would it read the database schema to determine whether “select foo from bar” is valid sql for the database? If it figures that out, how would it figure out there’s an appropriate index to use? How would it read that index?

Databases do a lot of seeking into database files and typically only read a small part of them. That’s not a scenario that’s a good fit with whole-file compression.

Compressing individual pages is a bit better, but I think this can still be better.

Your file system probably would cache the decompressed database file, but that would use memory, lots of it if your database is large.

formerly_proven3y ago

FS-level compression is usually limited to compressing relatively small blocks (think ~dozens of sectors), they don't normally compress a large file as a single solid block.

1 more reply

fnord1233y ago

Great idea. zfs can compress data on disk. But it's also useful to keep the data compressed in memory so you can fit more data in memory. There's a lot of cpu/memory tradeoffs so it's worth having the options to do it on the data.

It's also worth noting that compression on the application layer works way better on column stores than row stores.

ciupicri3y ago

> Btrfs also supports transparent file compression. There are three algorithms available: ZLIB, LZO and ZSTD (since v4.14). Basically, compression is on a file by file basis. You can have a single btrfs mount point that has some files that are uncompressed, some that are compressed with LZO, some with ZLIB, for instance (though you may not want it that way, it is supported).

https://btrfs.wiki.kernel.org/index.php/Compression

gred3y ago

> wouldn't this something that would be more appropriate to be done at the filesystem level?

Sometimes? You've gotten a ton of replies digging into the "why" and "when", and I only wanted to add that your question brought to mind this quote from the official SQLite website:

> "Think of SQLite not as a replacement for Oracle but as a replacement for fopen()"

https://www.sqlite.org/about.html

hexmiles3y ago

It depends on the data and application,

I use a similar approach for compressing data that go in a couple of database/store. Having control on when and how to compress data can significantly improve performance, also, it sometimes makes sense to keep compress data even in memory since it can allow you to make better use of it.

Also keep in mind that often you can't control the filesystem.

One use case for which I am working right now is that I have is storing highly repetitive data on a very slow flash storage, (that technically support compression, but it is slow and with not great compression ratio), I can significantly improve performance.

edit: grammar

xdfgh11123y ago

Yeah but your app doesn't have control over the filesystem. Usually an sqlite db is bundled with an application or created by it.

That said, it looks like you could get the same results by doing client side compression and decompression of data yourself, and storing compressed blobs in sqlite. It only compresses text column data, not entire pages.

Gehinnn3y ago

But it uses a shared zstd dictionary among rows!

The details are described here: https://phiresky.github.io/blog/2022/sqlite-zstd/

d1l3y ago

https://www.sqlite.org/vfs.html

antman3y ago

With this you can selectively compress per column

ajconway3y ago

This kind of solution should work fine event if the database is encrypted.

d1l3y ago· 9 in thread

I am happy for the author, but want to warn the credulous that this is a very strange way to approach compression with SQLite. The much more common way is to use SQLite vfs:

https://www.sqlite.org/vfs.html

Briefly search for SQLite compressed vfs and you will find many extensions. Some are quite mature.

I get the impression that the author doesn't know about the vfs, as it's not even mentioned in the readme. The clickbait title similarly seems amateurish. Caveat emptor.

phireskyOP3y ago

I did actually previously write a VFS, though it did something else entirely:

https://phiresky.github.io/blog/2021/hosting-sqlite-database...

You're right that a comparison to compression VFS is completely missing. I knew about their existence but I have to admit I didn't look too closely.

Note that the main "novelty" here is training a shared / partitioned compression dictionary specifically to take advantage of redundancy that wouldn't appear within only one row or even within a database page / block . The compression happens at the row level and can additionally use application knowledge - you couldn't do that in the VFS level. For example, you can have separate compression dictionaries per different columns and per groups of rows with some commonality.

I'll have to compare to a compression vfs (do you have a favorite?) and see if maybe these two methods can even be combined.

Edit: I see that https://github.com/mlin/sqlite_zstd_vfs does actually also train dictionaries. It's still at the database-page level so can't take application knowledge into account or compress only parts of the data, but that's still pretty smart then.

kevingadd3y ago

How does this compare with modern compression algorithms like Brotli that do context modeling, etc? I've found that they manage to aggressively compress types of data you wouldn't expect to compress well, to the point that investing energy into doing compression yourself doesn't provide big returns anymore. The downside is that codecs like Brotli tend to compress very slowly, but I can imagine being able to do a setup where you only compress old rows so it would be cool to see an experiment with just compressing rows or columns and comparing the sizes with your method.

1 more reply

skerit3y ago

Let's not water down the meaning of the word "clickbait". I have no problem with this title.

skrebbel3y ago

> amateurish

There's no need to be this dismissive.

867-53093y ago

welcome to HN

1 more reply

Dowwie3y ago

To call it strange or unusual is fine but it isn't a compelling argument why not to use it. Is it better or worse than a vfs approach and why?

kevingadd3y ago

Taking an amateurish or fancy or experimental approach to compressing a sqlite database is a great way to end up with data loss. Doesn't make it a worthless or "bad" project but it's worthy of a warning.

d1l3y ago

Why not read the sqlite teams own explanation:

https://www.sqlite.org/zipvfs/doc/trunk/www/howitworks.wiki

zimpenfish3y ago

> Briefly search for SQLite compressed vfs and you will find many extensions.

I found 3 with a brief search - CEVFS, sqlite_zstd_vfs, and ZIPVFS (not free, therefore discounting.)

Neither CEVFS nor sqlite_zstd_vfs support WAL mode (which means, e.g., no litestream replication).

Can you recommend a mature one that does support WAL mode?

wmanley3y ago· 8 in thread

This is really neat. I like how the compression can be applied to specific columns rather than whole rows - or whole pages. I could imagine storing JSON from an HTTP request here and using Generated Columns[1] to extract the bits my application is interested in.

I also like how simple and powerful the custom partitioning is.

[1]: https://www.sqlite.org/gencol.html

klysm3y ago

I’ve been following that pattern in PostgreSQL and love it. It cleanly decouples the ingest and keeps the stack simple.

masklinn3y ago

Do you mean gencols or compression?

1 more reply

KingOfCoders3y ago

Can you elaborate?

2 more replies

wmanley3y ago

On that subject I've often thought of writing a combination decompressor/JSON parser that can parse the contents of the dictionary before decompression for faster JSON parsing and lower memory usage.

aabbcc12413y ago

Maybe the compression algorithm used in https://github.com/beenotung/compress-json can inspire you.

1 more reply

gcr3y ago

a lot of document key/value databases that store arbitrary JSON tend to work this way (for example Cassandra does this, IIRC). some will even automatically build indices for you (though hand-tuning is usually better)

gcr3y ago

wait, are you saying that the columns of view tables could be compressed? that doesn't make sense to me cause views don't use extra space to begin with....?

storing raw JSON in sqlite means the entire blob would be compressed and live in a single column, right?

wmanley3y ago

You'd compress the JSON, and leave the data that you extract from the JSON uncompressed using Generated Columns[1]. That way you can save the original JSON data, but still have fast querying, etc.

It would look something like this:

    CREATE TABLE t1(
        json TEXT,
        data INT GENERATED ALWAYS AS (json_extract(body, '$.d')) STORED
    );

If all you need is fast querying you could make `data` a `VIRTUAL` column and set up an index on it.

You can read more about this kind of technique here: https://dgl.cx/2020/06/sqlite-json-support

[1]: https://www.sqlite.org/gencol.html

aabbcc12413y ago· 4 in thread

I'm using sqlite on zfs and btrfs, with transparent compression supported by the file system, the query speed is very good and compression rate also very good. (database below 3GB uncompressed, 22% after compressed)

cmurf3y ago

Did you enable WAL? Any casual comparison in performance between WAL and no-WAL?

aabbcc12413y ago

I enable WAL. It performs better than the default (which is to keep a rollback journal)

usgroup3y ago

What do you think about zfs vs btrfs for your use case? Why do you use both?

aabbcc12413y ago

I used zfs because I heard it first. But I realized zfs partition cannot shrink in size, one have to copy it to another smaller partition if they have enough spare volume.

So I experiment with btrfs on another device and both are working well after some daily usage.

Joel_Mckay3y ago· 4 in thread

sudo apt-get install archivemount

archivemount mydbimage.tgz /mnt/mydbimage -o big_writes

// -o readonly disable write support

// -o nosave do not save changes upon unmount.

//open/cache /mnt/mydbimage/mydbimage.db

//flush

fusermount -u /mnt/mydbimage

//No LGPL/Rust/cargo dependency issues...

//as SQLite's best use-case is embedding ;)

xyzzy_plugh3y ago

I understand why you suggest this, even though archivemount is LGPL, but I digress.

FUSE is cool and all but given it's "userspace" and we're just reading a single file here, I don't find this compelling versus reading from a block device in-process.

TFA is compelling because it's compression over specific data sets without being in the hot path for every operation. Putting a sqlite.db in a tarball is... problematic. There is no random access, so if your database doesn't fit in memory/cache, performance will be abysmal.

Joel_Mckay3y ago

LGPL is not great for porting, and linux block caching is far from the performance bottleneck if you know what you are doing.

I think we can agree it depends on the use-case. =)

marginalia_nu3y ago

Eh, tar is probably not an amazing format for this, since it's designed for tape drives, and has ridiculously slow random access.

Joel_Mckay3y ago

archivemount supports numerous formats as it is using libarchive.

My point was it doesn't contaminate the parent projects license.

quietbritishjim3y ago· 3 in thread

One option not listed (under "Other options for compressing data in a database") but probably closest to what this solution does under the hood, is to do row-interdependent compression at application level.

That is, while saving rows to the database, pick out every (say) 10th row, and compress it by itself but keep the compression state in memory. Then for the next 9 rows, compress that row based on that dictionary, and a reference to the row it's based on. (For example, in Python's zlib module, you'd use the zdict parameter [1] of compress and decompress.) A bit like keyframes in video compression.

You can potentially get better results this way than a generic solution like the article, because you might have some application knowledge about what rows are likely to be similar to others. But it obviously requires application level code (e.g. fetching a logical row requires fetching two physical rows so you have enough information to decode), and it would be a nightmare if you need to remove or update the rows.

[1] https://docs.python.org/3/library/zlib.html#zlib.compressobj

phireskyOP3y ago

> You can potentially get better results this way than a generic solution like the article because you might have some application knowledge about what rows are likely to be similar to others

The linked project actually does exactly this - you can basically specify an SQL expression (such as strftime(date, '%Y-%m')) that is then used to group rows together in order to decide what to train dictionaries on. What you're describing as picking every 10th row would be the expression (id/10). So you can use application knowledge, you just don't have to.

The difference is that in my project the dictionary is stored separately so not affected by row deletions, and the trained dictionary is based on all samples not just one of them. Since the dictionary is stored separately it's only worth it if you combine at least maybe 100-1000 rows under one dictionary.

> Python's zlib module, you'd use the zdict parameter

Yes, the training feature of zstd is basically a better version of keeping another sample of data around and using it as a prefix. In fact you can use a data sample instead of a trained dictionary as the `-D` parameter of the zstd command line, zstd just has the additional feature of reducing a set of samples down to the most essential (data-saving) parts.

quietbritishjim3y ago

Thanks for making this project, it looks fantastic! I might end up using it if I can figure out how to from Python.

I did mention that this is what you're doing, I just thought it wasn't clear from the blog post. But I did miss that you can use an SQL expression to choose the grouping. This is ideal for the application I'm thinking of, which a "message type" field that corresponds pretty closely with which rows are similar.

1 more reply

goldenkey3y ago

Your explanation is great. I like the key frame codec analogy. Like you said though, this really belongs at the DB layer. An application level solution would be much more than a nightmare.

Plugin/extension/module development is severely underused in our field. Extensibility, if possible, is often the best way to handle edge cases or increments in a dependency, without forking it.

See "Benefits of Being an Extension to PostgreSQL": https://www.citusdata.com/blog/2017/10/25/what-it-means-to-b...

Some great software that is extensible in brilliant ways through plugins, that comes to mind, is:

postgres: https://www.postgresql.org/docs/current/external-extensions....

nginx: https://www.nginx.com/resources/wiki/modules/

sqlite: https://www.sqlite.org/loadext.html

redis: https://redis.io/docs/modules/

babel: https://babeljs.io/docs/en/plugins/

josephernest3y ago· 3 in thread

Let's say I have a web visits analytics log database, with a column "user-agent" and a column "url" that have most of the time the same values (99.9% of rows have one of 100 typical values).

Would this work well with compression here?

ie. from 50 bytes on average for these columns... to a few bytes only?

Scaevolus3y ago

Yes, if you use the "dictionary" option it will compute build a buffer of common strings and use that during compression, so your user-agent columns will largely become references to the dictionary. Make sure you set compact to true! See this for more information: https://facebook.github.io/zstd/#small-data

josephernest3y ago

Will this mode require a separate file along the data.db Sqlite file, to store the compression dictionary?

From your last link:

> The result of this training is stored in a file called "dictionary", which must be loaded before compression and decompression.

Also, what happens if a column has been compressed based on 100 typical values, and then later we insert thousands of new rows with 500 new frequent values. Does the compression dictionary get automatically updated? Then do old rows need to be rewritten?

PS what is compact mode?

1 more reply

hnarn3y ago

Probably yes, but presumably you could achieve this by compressing the data on the file system level as well (for example with ZFS).

cryptonector3y ago· 2 in thread

The image near the top of the readme implies that compressing the whole file means not having random access, which is indeed true, but it also implies that there's nothing between whole file compression and row/column level compression, and that is not true. You could compress pages, kinda like ZFS compresses blocks, and still maintain random access, and get pretty close to whole-file compression.

dataangel3y ago

That requires changing your filesystem to a compressing one though, and you lose the compression when you copy the file to another filesystem. It does have the advantage that the page cache could have the uncompressed copy though.

cryptonector3y ago

No, the database itself can compress in the same way as ZFS. Now, don't layer compression on compression, otherwise it should work.

yababa_y3y ago· 2 in thread

This makes me wonder how sqlite augmented with blosc or some similar faster-than-memory compressor might behave. Size and speed wins?

alecco3y ago

AFAIK Blosc is just a container. Not a good fit for row-level compression.

yababa_y3y ago

the vfs isn’t row level, though!

mansoor_3y ago· 1 in thread

What does "transparent" mean in this context?

phireskyOP3y ago

Transparent to the application - basically the SQL queries (select, insert, update) in your code base don't have to change at all.

This would be in contrast to e.g. compressing the data in your code before inserting it into the database, then decompressing after you load it.

logicallee3y ago· 1 in thread

This mentions something I've been wondering about:

>Depending on the data, this can reduce the size of the database by 80% while keeping performance mostly the same (or even improving it, since the data to be read from disk is smaller).

"Even improving it" is a bold claim!

Modern CPU's are often hungry for data, aren't they? While being incredibly fast and mostly idle?

Is reading a smaller compressed file and decompressing it in the CPU's hyperfast cache before use faster than reading it uncompressed?

Maybe you can speed up an entire server, even one with lots of free space and no need for compression, by adding full disk compression anyway!

Unless the CPU's are at 100% utilization this could always be used. Since the CPU will then use the data, it might be fast enough to uncompress it at home faster than the bus can keep dropping it off at file read speeds.

This chart I found that goes up to 2000 suggests cache access had an increasing performance gap with even RAM bandwidth, to say nothing of SSD's[1]

It is from here[2]

(However, compressing a full disk can only help if there are extra CPU cycles.

For activity spikes when the CPU doesn't have extra cycles or the cache for it, both compressed and uncompressed versions could be available, at most doubling disk usage, so that the uncompressed file could be read instead.

This can ensure the CPU isn't burdened with a decompression task, since this only helps if the fastest caches are used for it and they might have better things to do than uncompress files.

However, whenever the CPU has idle cores and lots of cache, which is most of the time, it could request the compressed version. If it is busy it could ask for the original version which saves the decompression work.)

Can this help solve the problem of waiting for the disk?

[1] http://www.extremetech.com/wp-content/uploads/2014/08/CPU-DR...

[2] https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu...

outworlder3y ago

> Maybe you can speed up an entire server, even one with lots of free space and no need for compression, by adding full disk compression anyway!

Yes. Although not all filesystems support compression. One of the filesystems that does support and this is regularly benchmarked, is ZFS. It is also interesting that while you are definitely using more CPU to read a compressed stream, if that causes you to finish your task earlier, then it can be a net gain even from a CPU usage standpoint.

eg.

https://www.servethehome.com/the-case-for-using-zfs-compress...

leobg3y ago

This looks great. How would I apply this retroactively? I.e. tell SQLite to compress column x in an existing database, say, of type TEXT?

andrewmcwatters3y ago

Oh, interesting. I scrape GitHub's APIs to generate user and repositories statistics data, then compress it with ZStandard to allow the databases to fit within GitHub's file size restrictions.[1] So, this is very relevant to me.

It's pleasantly surprising how much data you can fit within a ZStandard-compressed SQLite database. The two technologies fit very well together.

[1]: https://github.com/andrewmcwattersandco/github-statistics

malkia3y ago

SQLite allows you to override the low-level paging routines, wonder if it won't work by just having a different pager (though maybe not that efficient compression, unless the pages are big (64kb)).

AtNightWeCode3y ago

LZ77+ ftw. I am a heavy user of Zstd. I think Brotli might be a better choice here. Or at least for my use-cases.

password43213y ago

Awesome!

Hopefully this will encourage SQLite to support compression natively in some fashion.

hda1113y ago

I love the idea. Can I use it with Django projects somehow?

j / k navigate · click thread line to collapse

90 comments

70 comments · 17 top-level

avereveard3y ago· 16 in thread

I don't want to diss on the effort, I have an honest question: wouldn't this something that would be more appropriate to be done at the filesystem level?

marginalia_nu3y ago

Context-aware compression can be a lot of faster than blind filesystem level compression.

zimpenfish3y ago

cmurf3y ago

At least in btrfs there's compress vs compress-force mount options. compress uses a cheap estimator of compressibility, whereas compress-force always passes data through the compression algorithm.

ZSTD has its own estimator to avoid doing unnecessary work, but I haven't found a comparison of these estimators (which is cheaper).

1 more reply

JulianWasTaken3y ago

1 more reply

phireskyOP3y ago

It would be interesting to try something like this on the file system level, yes.

On the other hand, it would then work for any file format not just SQLite databases. E.g. a huge folder of small json files

Someone3y ago

Databases do a lot of seeking into database files and typically only read a small part of them. That’s not a scenario that’s a good fit with whole-file compression.

Compressing individual pages is a bit better, but I think this can still be better.

Your file system probably would cache the decompressed database file, but that would use memory, lots of it if your database is large.

formerly_proven3y ago

FS-level compression is usually limited to compressing relatively small blocks (think ~dozens of sectors), they don't normally compress a large file as a single solid block.

1 more reply

fnord1233y ago

It's also worth noting that compression on the application layer works way better on column stores than row stores.

ciupicri3y ago

https://btrfs.wiki.kernel.org/index.php/Compression

gred3y ago

> wouldn't this something that would be more appropriate to be done at the filesystem level?

Sometimes? You've gotten a ton of replies digging into the "why" and "when", and I only wanted to add that your question brought to mind this quote from the official SQLite website:

> "Think of SQLite not as a replacement for Oracle but as a replacement for fopen()"

https://www.sqlite.org/about.html

hexmiles3y ago

It depends on the data and application,

Also keep in mind that often you can't control the filesystem.

edit: grammar

xdfgh11123y ago

Yeah but your app doesn't have control over the filesystem. Usually an sqlite db is bundled with an application or created by it.

Gehinnn3y ago

But it uses a shared zstd dictionary among rows!

The details are described here: https://phiresky.github.io/blog/2022/sqlite-zstd/

d1l3y ago

https://www.sqlite.org/vfs.html

antman3y ago

With this you can selectively compress per column

ajconway3y ago

This kind of solution should work fine event if the database is encrypted.

d1l3y ago· 9 in thread

I am happy for the author, but want to warn the credulous that this is a very strange way to approach compression with SQLite. The much more common way is to use SQLite vfs:

https://www.sqlite.org/vfs.html

Briefly search for SQLite compressed vfs and you will find many extensions. Some are quite mature.

I get the impression that the author doesn't know about the vfs, as it's not even mentioned in the readme. The clickbait title similarly seems amateurish. Caveat emptor.

phireskyOP3y ago

I did actually previously write a VFS, though it did something else entirely:

https://phiresky.github.io/blog/2021/hosting-sqlite-database...

You're right that a comparison to compression VFS is completely missing. I knew about their existence but I have to admit I didn't look too closely.

I'll have to compare to a compression vfs (do you have a favorite?) and see if maybe these two methods can even be combined.

kevingadd3y ago

1 more reply

skerit3y ago

Let's not water down the meaning of the word "clickbait". I have no problem with this title.

skrebbel3y ago

> amateurish

There's no need to be this dismissive.

867-53093y ago

welcome to HN

1 more reply

Dowwie3y ago

To call it strange or unusual is fine but it isn't a compelling argument why not to use it. Is it better or worse than a vfs approach and why?

kevingadd3y ago

d1l3y ago

Why not read the sqlite teams own explanation:

https://www.sqlite.org/zipvfs/doc/trunk/www/howitworks.wiki

zimpenfish3y ago

> Briefly search for SQLite compressed vfs and you will find many extensions.

I found 3 with a brief search - CEVFS, sqlite_zstd_vfs, and ZIPVFS (not free, therefore discounting.)

Neither CEVFS nor sqlite_zstd_vfs support WAL mode (which means, e.g., no litestream replication).

Can you recommend a mature one that does support WAL mode?

wmanley3y ago· 8 in thread

I also like how simple and powerful the custom partitioning is.

[1]: https://www.sqlite.org/gencol.html

klysm3y ago

I’ve been following that pattern in PostgreSQL and love it. It cleanly decouples the ingest and keeps the stack simple.

masklinn3y ago

Do you mean gencols or compression?

1 more reply

KingOfCoders3y ago

Can you elaborate?

2 more replies

wmanley3y ago

On that subject I've often thought of writing a combination decompressor/JSON parser that can parse the contents of the dictionary before decompression for faster JSON parsing and lower memory usage.

aabbcc12413y ago

Maybe the compression algorithm used in https://github.com/beenotung/compress-json can inspire you.

1 more reply

gcr3y ago

wait, are you saying that the columns of view tables could be compressed? that doesn't make sense to me cause views don't use extra space to begin with....?

storing raw JSON in sqlite means the entire blob would be compressed and live in a single column, right?

wmanley3y ago

You'd compress the JSON, and leave the data that you extract from the JSON uncompressed using Generated Columns[1]. That way you can save the original JSON data, but still have fast querying, etc.

It would look something like this:

    CREATE TABLE t1(
        json TEXT,
        data INT GENERATED ALWAYS AS (json_extract(body, '$.d')) STORED
    );

If all you need is fast querying you could make `data` a `VIRTUAL` column and set up an index on it.

You can read more about this kind of technique here: https://dgl.cx/2020/06/sqlite-json-support

[1]: https://www.sqlite.org/gencol.html

aabbcc12413y ago· 4 in thread

cmurf3y ago

Did you enable WAL? Any casual comparison in performance between WAL and no-WAL?

aabbcc12413y ago

I enable WAL. It performs better than the default (which is to keep a rollback journal)

usgroup3y ago

What do you think about zfs vs btrfs for your use case? Why do you use both?

aabbcc12413y ago

I used zfs because I heard it first. But I realized zfs partition cannot shrink in size, one have to copy it to another smaller partition if they have enough spare volume.

So I experiment with btrfs on another device and both are working well after some daily usage.

Joel_Mckay3y ago· 4 in thread

sudo apt-get install archivemount

archivemount mydbimage.tgz /mnt/mydbimage -o big_writes

// -o readonly disable write support

// -o nosave do not save changes upon unmount.

//open/cache /mnt/mydbimage/mydbimage.db

//flush

fusermount -u /mnt/mydbimage

//No LGPL/Rust/cargo dependency issues...

//as SQLite's best use-case is embedding ;)

xyzzy_plugh3y ago

I understand why you suggest this, even though archivemount is LGPL, but I digress.

FUSE is cool and all but given it's "userspace" and we're just reading a single file here, I don't find this compelling versus reading from a block device in-process.

Joel_Mckay3y ago

LGPL is not great for porting, and linux block caching is far from the performance bottleneck if you know what you are doing.

I think we can agree it depends on the use-case. =)

marginalia_nu3y ago

Eh, tar is probably not an amazing format for this, since it's designed for tape drives, and has ridiculously slow random access.

Joel_Mckay3y ago

archivemount supports numerous formats as it is using libarchive.

My point was it doesn't contaminate the parent projects license.

quietbritishjim3y ago· 3 in thread

[1] https://docs.python.org/3/library/zlib.html#zlib.compressobj

phireskyOP3y ago

> You can potentially get better results this way than a generic solution like the article because you might have some application knowledge about what rows are likely to be similar to others

> Python's zlib module, you'd use the zdict parameter

quietbritishjim3y ago

Thanks for making this project, it looks fantastic! I might end up using it if I can figure out how to from Python.

1 more reply

goldenkey3y ago

Your explanation is great. I like the key frame codec analogy. Like you said though, this really belongs at the DB layer. An application level solution would be much more than a nightmare.

Plugin/extension/module development is severely underused in our field. Extensibility, if possible, is often the best way to handle edge cases or increments in a dependency, without forking it.

See "Benefits of Being an Extension to PostgreSQL": https://www.citusdata.com/blog/2017/10/25/what-it-means-to-b...

Some great software that is extensible in brilliant ways through plugins, that comes to mind, is:

postgres: https://www.postgresql.org/docs/current/external-extensions....

nginx: https://www.nginx.com/resources/wiki/modules/

sqlite: https://www.sqlite.org/loadext.html

redis: https://redis.io/docs/modules/

babel: https://babeljs.io/docs/en/plugins/

josephernest3y ago· 3 in thread

Let's say I have a web visits analytics log database, with a column "user-agent" and a column "url" that have most of the time the same values (99.9% of rows have one of 100 typical values).

Would this work well with compression here?

ie. from 50 bytes on average for these columns... to a few bytes only?

Scaevolus3y ago

josephernest3y ago

Will this mode require a separate file along the data.db Sqlite file, to store the compression dictionary?

From your last link:

> The result of this training is stored in a file called "dictionary", which must be loaded before compression and decompression.

PS what is compact mode?

1 more reply

hnarn3y ago

Probably yes, but presumably you could achieve this by compressing the data on the file system level as well (for example with ZFS).

cryptonector3y ago· 2 in thread

dataangel3y ago

cryptonector3y ago

No, the database itself can compress in the same way as ZFS. Now, don't layer compression on compression, otherwise it should work.

yababa_y3y ago· 2 in thread

This makes me wonder how sqlite augmented with blosc or some similar faster-than-memory compressor might behave. Size and speed wins?

alecco3y ago

AFAIK Blosc is just a container. Not a good fit for row-level compression.

yababa_y3y ago

the vfs isn’t row level, though!

mansoor_3y ago· 1 in thread

What does "transparent" mean in this context?

phireskyOP3y ago

Transparent to the application - basically the SQL queries (select, insert, update) in your code base don't have to change at all.

This would be in contrast to e.g. compressing the data in your code before inserting it into the database, then decompressing after you load it.

logicallee3y ago· 1 in thread

This mentions something I've been wondering about:

>Depending on the data, this can reduce the size of the database by 80% while keeping performance mostly the same (or even improving it, since the data to be read from disk is smaller).

"Even improving it" is a bold claim!

Modern CPU's are often hungry for data, aren't they? While being incredibly fast and mostly idle?

Is reading a smaller compressed file and decompressing it in the CPU's hyperfast cache before use faster than reading it uncompressed?

Maybe you can speed up an entire server, even one with lots of free space and no need for compression, by adding full disk compression anyway!

This chart I found that goes up to 2000 suggests cache access had an increasing performance gap with even RAM bandwidth, to say nothing of SSD's[1]

It is from here[2]

(However, compressing a full disk can only help if there are extra CPU cycles.

This can ensure the CPU isn't burdened with a decompression task, since this only helps if the fastest caches are used for it and they might have better things to do than uncompress files.

Can this help solve the problem of waiting for the disk?

[1] http://www.extremetech.com/wp-content/uploads/2014/08/CPU-DR...

[2] https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu...

outworlder3y ago

> Maybe you can speed up an entire server, even one with lots of free space and no need for compression, by adding full disk compression anyway!

eg.

https://www.servethehome.com/the-case-for-using-zfs-compress...

leobg3y ago

This looks great. How would I apply this retroactively? I.e. tell SQLite to compress column x in an existing database, say, of type TEXT?

andrewmcwatters3y ago

It's pleasantly surprising how much data you can fit within a ZStandard-compressed SQLite database. The two technologies fit very well together.

[1]: https://github.com/andrewmcwattersandco/github-statistics

malkia3y ago

SQLite allows you to override the low-level paging routines, wonder if it won't work by just having a different pager (though maybe not that efficient compression, unless the pages are big (64kb)).

AtNightWeCode3y ago

LZ77+ ftw. I am a heavy user of Zstd. I think Brotli might be a better choice here. Or at least for my use-cases.

password43213y ago

Awesome!

Hopefully this will encourage SQLite to support compression natively in some fashion.

hda1113y ago

I love the idea. Can I use it with Django projects somehow?

j / k navigate · click thread line to collapse