CRFS: Container Registry Filesystem (opens in new tab)

(github.com)

104 pointshelper7y ago21 comments

21 comments

18 comments · 8 top-level

koolba7y ago· 5 in thread

> Fortunately, we can fix the fact that tar.gz files are unindexed and unseekable, while still making the file a valid tar.gz file by taking advantage of the fact that two gzip streams can be concatenated and still be a valid gzip stream. So you can just make a tar file where each tar entry is its own gzip stream.

I'm surprised nobody came up with this idea till now. It's brilliantly simple.

_wmd7y ago

> This makes images a few percent larger

Teehee. The method is not new at all, for example compressors like xzip do this out of the box, and almost the exact thing they're doing is basically how ZIP files work

The trouble with discarding state on every file is that it really hurts performance with small files, or when using anything like a modern codec, which gzip/deflate is not. Gzip maintains a 32kb dictionary which is quite easy to exceed with contemporary data, but with a modern compressor (like lzma2) losing that window will absolutely devastate ratios

The usual solution is so-called 'solid' compression, where the uncompressed input is partitioned into blocks spanning file boundaries. It configurably trades seek efficiency for reliably preserving compressor context -- including allowing seeking within files. Their format could be modified to support this while retaining backwards compatibility as good as the current method. What they have is already pretty much solid compression, except it only chunks large files. This is basically a weird special case of a simpler and more general design everyone else uses.

Finally on the compatibility angle, the end of stream is visible at an API level, so this isn't going to be 100% perfect. I'd expect one or more obscure implementations (maybe Windows apps? Java?) to potentially break

xyzzy_plugh7y ago

Well, for one, it's not obviously useful in traditional applications:

> This makes images a few percent larger (due to more gzip headers and loss of compression context between files), but it's plenty acceptable.

Compressing an entire image is generally great. Compressing all of the individual files in an image, is generally not great.

bradfitz7y ago

Maybe not great, but like I said: acceptable.

About 7.6% bigger: https://github.com/golang/build/commit/8a5a4d227f08eb1d889fa...

1 more reply

CSDude7y ago

Compressing individual files would probably result in a bigger file, ofc depending on the content mostly.

david_ar7y ago

dict uses something similar:

https://linux.die.net/man/1/dictzip

tsurkoprt7y ago· 3 in thread

Why not directly use www.lucidlink.com, same result but read/write

helperOP7y ago

Because it doesn't solve the problem the author is trying to solve? The goal of this is to be able to produce backwards compatible tar.gz files that can be served from a docker registry that also can be on demand streamed instead of predownloaded.

If they just wanted a S3/GCS fuse filesystem there are plenty of open source options out there.

tsurkoprt7y ago

Yeah, it does, one needs just to read instead of reactively comment.

helperOP7y ago

What do you want me to read?

catern7y ago· 2 in thread

In the introduction:

>Currently, however, starting a container in many environments requires doing a pull operation from a container registry to read the entire container image from the registry and write the entire container image to the local machine's disk. It's pretty silly (and wasteful) that a read operation becomes a write operation.

What's silly is to claim that this is the problem. Any read is going to be a write operation, at multiple levels, thanks to systems of transparent caching: To a nearby CDN, to local disk, to local memory, to your CPU cache, etc. These are optimizations, they aren't making your container startup any slower.

The real problem, which this tool indeed helps to solve, is that reading the entire image must complete before you're able to start further processes which read specific parts of the image. Not anything to do with "reads causing writes".

bradfitz7y ago

Hi, author here.

The unnecessary writes I care about are to my cloud VM's small block device, which is I/O limited. The best way to not wait for those is to not do the writes in the first place.

catern7y ago

If you don't want those writes to a block device to happen, then you could store your images in a tmpfs instead.

1 more reply

bradfitz7y ago

Author here.

I just moved this to https://github.com/google/crfs if people want to track that repo instead of Go's build system (which is relatively boring for most people).

toomuchtodo7y ago

This is very cool! I've been waiting to see someone enable tar.gz files to be seekable so they could be object bundles stored in remote blob storage systems that a client could mount and seek through on demand by byte range (so you could treat data in a similar fashion to containers, or like a Mac DMG file that had an open standard for remote mounting).

maxmcd7y ago

Conceivably this could be leveraged to allow docker for mac to only push deltas to the build virtual machine when running docker build, correct?

Currently docker build compressed everything in the working directory on every build. This is fine for building images for deploy/upload but is annoying for a local dev situation where you're frequently rebuilding.

Seems like it wouldn't be too hard to write an alternate docker build that checks a previously built "Stargz" and just sends the additional files? (There would be some complexity here reassembling a valid tar within hyperkit).

I might be missing something here, it might be misplacing the bottleneck during build, but every time I'm annoying by this problem it seems part of the issue is the single fat tar that needs to be created every time.

edit: this strategy could also work with docker-machine building on remote machines

fulafel7y ago

If the bottleneck of pulling was eliminated by this, it means the test runs didn't need to access most of the image, right? I wonder what this says about carrying unnecessary stuff or test coverage. Especially since the base distro layers were probably cached.

Edit: " For isolation and other reasons, we run all our containers in a single-use fresh VMs." So they had no caching for the base layers unless those were primed in the vm image?

whalesalad7y ago

TaaS – Tar as a Service.

j / k navigate · click thread line to collapse

21 comments

18 comments · 8 top-level

koolba7y ago· 5 in thread

I'm surprised nobody came up with this idea till now. It's brilliantly simple.

_wmd7y ago

> This makes images a few percent larger

Teehee. The method is not new at all, for example compressors like xzip do this out of the box, and almost the exact thing they're doing is basically how ZIP files work

xyzzy_plugh7y ago

Well, for one, it's not obviously useful in traditional applications:

> This makes images a few percent larger (due to more gzip headers and loss of compression context between files), but it's plenty acceptable.

Compressing an entire image is generally great. Compressing all of the individual files in an image, is generally not great.

bradfitz7y ago

Maybe not great, but like I said: acceptable.

About 7.6% bigger: https://github.com/golang/build/commit/8a5a4d227f08eb1d889fa...

1 more reply

CSDude7y ago

Compressing individual files would probably result in a bigger file, ofc depending on the content mostly.

david_ar7y ago

dict uses something similar:

https://linux.die.net/man/1/dictzip

tsurkoprt7y ago· 3 in thread

Why not directly use www.lucidlink.com, same result but read/write

helperOP7y ago

If they just wanted a S3/GCS fuse filesystem there are plenty of open source options out there.

tsurkoprt7y ago

Yeah, it does, one needs just to read instead of reactively comment.

helperOP7y ago

What do you want me to read?

catern7y ago· 2 in thread

In the introduction:

bradfitz7y ago

Hi, author here.

The unnecessary writes I care about are to my cloud VM's small block device, which is I/O limited. The best way to not wait for those is to not do the writes in the first place.

catern7y ago

If you don't want those writes to a block device to happen, then you could store your images in a tmpfs instead.

1 more reply

bradfitz7y ago

Author here.

I just moved this to https://github.com/google/crfs if people want to track that repo instead of Go's build system (which is relatively boring for most people).

toomuchtodo7y ago

maxmcd7y ago

Conceivably this could be leveraged to allow docker for mac to only push deltas to the build virtual machine when running docker build, correct?

edit: this strategy could also work with docker-machine building on remote machines

fulafel7y ago

Edit: " For isolation and other reasons, we run all our containers in a single-use fresh VMs." So they had no caching for the base layers unless those were primed in the vm image?

whalesalad7y ago

TaaS – Tar as a Service.

j / k navigate · click thread line to collapse