Git partial clone lets you fetch only the large file you need (opens in new tab)

(about.gitlab.com)

229 pointsmoyer6y ago86 comments

86 comments

54 comments · 14 top-level

derefr6y ago· 8 in thread

Has anyone used Git submodules to isolate large binary assets into their own repos? Seems like the obvious solution to me. You already get fine-grained control over which submodules you initialize. And, unlike Git LFS, it might be something you’re already using for other reasons.

jniedrauer6y ago

Using submodules require that everyone on your team has at least a vague idea of what's going on and how to not foot-gun themselves. That's hard enough with git itself. I don't think I've ever seen submodules used without become a major pain point.

hinkley6y ago

That is a straight up nonstarter.

Someone was trying to talk me into git subtrees though...

matheusmoreira6y ago

The problem with git submodules is they can't be used like a hyperlink to another repository. Updating the submodule requires updating the superproject as well. The new commits are invisible to the superproject until that is done.

It'd be great if they worked like Python's editable package installations.

sjburt6y ago

Then the state of the superproject would depend on when the checkout occurred. That would be disastrous for consistency, you’d be unable to replicate a checkout later or elsewhere. The state of a repo after a checkout should only depend on the commit that was checked out.

2 more replies

ComputerGuru6y ago

They can now, with the new-ish submodule update/init --remote. But the problem with sub modules is that you cannot do a shallow fetch (depth 1) because most hosts won’t serve unadvertised refs.

jfkebwjsbx6y ago

Submodules are almost always the wrong answer. If you need to version huge files, use Git LFS.

Ididntdothis6y ago

I have tried sub modules but it’s way too easy to shoot yourself in the foot. Not very sustainable in a team with different levels of git knowledge.

lettergram6y ago

I’ve done that. Especially if you want specific versions of data to build ML models, this makes a nice audit log for reproducibility

vvanders6y ago· 7 in thread

Also known as workspace views in P4.

It's interesting to see the wheel reinvented. We used to run a 500gb art sync/200gb code sync with ~2tb back end repo back when I was in gamedev. P4 also has proper locking, it is really the is right tool if you've got large assets that need to be coordinated and versioned.

Only downside of course is that it isn't free.

dijit6y ago

> Only downside of course is that it isn't free.

Another downside is that it consumes insane resources (our servers are in the dozens of TiB of ram, with huge NVMe based storage arrays directly attached)

Another downside is that you have to maintain connection to p4 to do any VCS operations (stashing included).

Another downside is that branches are very "expensive" (often taking days) and are impossible to reconcile. We never re-merge to MAIN.

jasondclinton6y ago

This kind of comment isn't helpful. Of course, there have been ways to copy large files around since there were networks. What's new in this protocol enhancement is that this works within the context of a Merkle tree-based technology (upon which all DVCS's are based). To use your analogy, yes this is a wheel but it's built with rubber instead of wood and iron.

vvanders6y ago

I guess I should have expanded more.

DVCS is in direct opposition of workflows that include binary files(yes I'm aware that git lfs has locking, it's also centrally orchestrated) because you can't merge almost every binary format.

We were using P4 ~15 years ago for these workflows and rather than understanding what made them work people are just rediscovering the same problems that have already been solved.

My guess is that we'll next see a solution that dynamically caches most downloaded files in a geographic friendly way, heck we may even call it "P4Proxy".

I've seen so much FUD around how git is the "one true workflow" because other solutions "don't scale" when they don't understand the constraints that certain workflows impose. Git/DVCS is great for a lot of things but sometimes you should use the right tool for the job rather than hack something together.

[Edit] These reasons are exactly why you see Unreal supporting P4/SVN out of the box[1] and no mention of git.

[1] https://docs.unrealengine.com/en-US/Engine/UI/SourceControl/...

1 more reply

011000116y ago

I love P4 for just working, but I absolutely can't stand the limited shelving ability. I ended up writing a helper program that lets me shuffle local changes off to a git repo just so I could manage working on several overlapping changelists. Perforce would be so much more usable if they would include this sort of basic functionality right out of the box. The thing git gets right is that you often need to juggle several threads of change at the same time, and those threads may have complex branching as you try out different approaches and combine the best pieces at the end.

jfkebwjsbx6y ago

P4 was great for its time if you could pay for it, but it is definitely not a competitor anymore.

Git LFS has been just fine for multiterabyte repositories for years.

w0m6y ago

p4 is still great, that there are workarounds to be usable on similar workloads on git doesn't takeaway p4s inherent advantages. they're two very different tools.

1 more reply

jayd166y ago

P4 is fine on paper. I just wish the the client didn't crash and the server didn't lock up as often as it does. P4V is a mess.

piliberto6y ago· 6 in thread

> One reason projects with large binary files don't use Git is because, when a Git repository is cloned, Git will download every version of every file in the repository.

Wrong? There's a --depth option for the git fetch command which allows the user to specify how many commits they want to fetch from the repository

ComputerGuru6y ago

depth is broken; it cannot be used for submodules/recursive submodules dependably because most hosts will refuse to serve unadvertised refs. We learned this the hard way. Or maybe it is submodules that are broken. Or git itself.

chrismorgan6y ago

Depth, submodules and multiple work trees are all half-baked features that work fine up to a point, then start falling over frantically—most notably if you try to use them together.

shaklee36y ago

depth just let you control the amount of history. It will not let you exclude files that are at the highest depth that you don't want. So while that statement was not accurate, it's not what this feature is intended for.

sewer_bird6y ago

Yes, but 95% of devs, even fairly talented ones, don't really know how to use Git.

colonwqbang6y ago

Author seems to be a manager, not necessarily a dev.

phaemon6y ago

Git is fundamentally very simple. Any dev who doesn't understand exactly how it works is not even remotely talented.

danbolt6y ago· 4 in thread

In the AAA games industry git has been a bit slower on the uptake (although that’s changing quickly) as large warehouses of data are often required (eg: version history of video files, 3D audio, music, etc.). It’s nice to see git have more options for this sort of thing.

Keverw6y ago

Surprised this new idea doesn’t support object storage. Sounds like Git LFS would still be the right way to go for repos with assets for games like meshes, sounds, etc.

However I’ve heard many studios use Perforce instead. However not being open source is a downside to some, but I don’t really know too much about it personally.

Then if working with a lot of non code files, sounds like some solutions have locking. I guess not two people could edit the same Blender or PSD file at the same time and then merge them later on.

Kinda wouldn’t surprise me if some companies actually run multiple versioning control systems. Code on one system, game assets on another.

danbolt6y ago

I think in terms of game production, software licensing usually isn’t the largest cost center for a project. Proprietary software isn’t a concern as much, given that games traditionally are “shipped” and then completed. (Note that this changes as games that are more online service-based with live operations, rather than a specific release date and a “final” copy sent for production; the internet has changed things a lot)

You’re more right than you think about multiple versioning systems, although keeping synchronized becomes an issue. Perforce is a bit of a boon for management, as they get a GUI for versioning across a multidisciplinary team.

jfkebwjsbx6y ago

Git LFS has been a thing for years, though.

danbolt6y ago

You’re absolutely right, but larger developers and publishers have been slower to adopt.

P4’s GUI/model is also intuitive for non-programming roles to learn and use historically compared to git, so a team with wide skills can ramp up quickly with a unified toolset. A less-technical manager gets a GUI that has versioning across changes from a multidisciplinary team. You can probably guess what inertia that has in a space with higher turnover compared to other industries.

As mentioned, things are changing though. git and GitHub have become a mainstay and are what new programmers likely learn in schools. This has a trickle effect on new projects with smaller teams and results in more investment into git setups. I use git in a AAA context at work, and it’s not uncommon to find sentiments from more seasoned game programmers on git that are similar to HN comments about the latest fad in web frameworks.

1 more reply

beagle36y ago· 3 in thread

There is one note piece to the puzzle to make git perfect for every use case I can think of: store large files as a list of blobs broken down by some rolling hash a-la rsync/borg/bup.

That would e.g. make it reasonable to check in virtual machine images or iso images into a repository. Extra storage (and by extension, network bandwidth) would be proportional to change size.

git has delta compression for text as an optimization but it’s not used on big binary files and is not even online (only on making a pack). This would provide it online for large files.

Junio posted a patch that did that ages ago, but it was pushed back until after the sha1->sha256 extension.

pas6y ago

Do ISOs and other large blob types support only partial (block) modification? Wouldn't all subsequent blocks change too?

beagle36y ago

Sometimes they do - e.g. if you replace a file in the ISO that is the same size up to block alignment, which is common when e.g. editing a text file or recompiling an executable with a minor change. They almost always do when it's a VM image representing a disk - only some blocks change every write.

However, with self synchronizing hashes of the kind used by rsync bup and borg, it doesn't matter - you could have a 1TB file, delete a single byte at position 100 - and you only need to store or transfer one new block (with average size 8KB for rsync, configurable for borg) if you already have a copy of the version before the change.

It's somewhat comparable with diff/patch but not exactly; it's worse in that change granularity is only specified on average; It's better in that it works well on binary files, does not require a specific reference diff (can reference all previous history), and efficiently supports reordering as well small changes - if you divide a 4000 line text file to four 1000-line sections and reorder them 1,2,3,4 -> 3,1,4,2 you will find the diff/patch to be as long as a new copy, whereas a self synchronizing hash decomposition will hardly take any space for the reordered file given the original.

1 more reply

adrianmonk6y ago

It really depends on the type of file. ("Other large blob types" is a rather broad category.)

One obvious example where you could have a lot of common blocks (even following the offset where a change was made) is zip files. The zip format basically compresses each file individually and then concatenates all that together.

Let's say you have a build and it packages the results up as a big zip file. (Java builds often do this. A jar is a special type of zip file.) If you change a few source files and rebuild, and if your build is deterministic (and/or incremental), then the new zip file will contain a lot of the same stuff as the previous version. And if your zip archiver is deterministic (pretty safe assumption), it should produce a zip file that is mostly the same sequences of bytes as the previous zip file, even if there are changed files in the middle.

If you write a .tar.gz archive, then one change in the middle will throw everything off from that point on because it compresses the whole archive instead of individual files. In theory a binary diff can work around this by first undoing the gzip that was done to create each large blobs, then doing a binary diff on that, and then arranging to be able to recreate what gzip did. Obviously that's messy.

Of course, not every file is an archive. Some are filesystems. But any writable filesystem (notably not including ISOs) that is capable of being used on a hard disk will of necessity not rewrite everything. If it did, changing on one file on a filesystem would take hours because the rest of the partition would have to be rewritten.

Another obvious type of big blob is multimedia. I don't know a lot of specifics, but I would think file formats meant for editors would keep changes localized for reducing IO (for example, so that changes in a non-linear video editor don't need to write a giant file), but formats meant for export and delivery might change the whole file since they're aiming for small size.

1 more reply

itroot6y ago· 3 in thread

Also --reference (or --shared) is a good parameter to speed-up cloning (for build, for example), if you have your repository cached in some other place. I was using it a long time ago when I was working on system that required to clone 20-40 repos to build. This approach decreased clone timings by an order of magnitude.

mikepurvis6y ago

Do you actually need clones in that scenario? I worked on a build system that grabbed source from several hundred repos at the starting point, and it turned out to be way faster to just grab it all as tarballs with aria2c.

madsbuch6y ago

Grapping the tarbell from where? To my best knowledge, tarbell export is not a part of git, but something git hosts provide.

Git is a distributed VCS, and we should support keeping it that way.

1 more reply

andrewshadura6y ago

Careful, with extra large repositories it actually slows down the cloning while, obviously, significantly reducing the space usage.

nikivi6y ago· 3 in thread

Is it possible given a git repo (hosted on say GitHub) to only 'clone' (download) certain files from it? Without `.git`

fizixer6y ago

I believe you're looking for the 'working tree' only. You could do the following:

git archive --remote=<your-URL> | tar -t

source: https://stackoverflow.com/questions/3946538

calvinlh6y ago

If you only want a subset of the repo's files, you can use Github's Subversion interface: https://stackoverflow.com/a/18194523

bspammer6y ago

Short answer is, not easily: https://stackoverflow.com/a/14610427

You can get the most recent tree for a repository (no history, just the current state of the repo) with `git clone --depth=1`. That's often good enough for slow connections.

microtherion6y ago· 2 in thread

That seems quite useful, though Git LFS mostly does the job.

One of my biggest remaining pain points is resumable clone/fetch. I find it near impossible to clone large repos (or fetch if there were lots of new commits) over a slow, unstable link, so almost always I end up cloning a copy to a machine closer to the repo, and rsyncing it over to my machine.

hinkley6y ago

What’s your take on this line?

> Partial Clone is a new feature of Git that replaces Git LFS and makes working with very large repositories better by teaching Git how to work without downloading every file.

microtherion6y ago

I believe partial clone makes the situation a little better, but it's not nearly as good as resumable cloning, because you have to partition your repo in advance.

jniedrauer6y ago· 2 in thread

This could actually be a really good solution to the maximum supported size of a Go module. If you place a go.mod in the root of your repo, then every file in the repo becomes part of the module. There's also a hardcoded maximum size for a module: 500M. Problem is, I've got 1G+ of vendored assets in one of my repos. I had to trick Go into thinking that the vendored assets were a different Go module[0]. Go would have to add support for this, but it would be a pretty elegant solution to the problem.

[0]: https://github.com/golang/go/issues/37724

lima6y ago

That does sound like a "you're holding it wrong" issue. As one of the Go team members pointed out, defining a separate module is not a hack, but the intended way of doing it.

How would a partial checkout help?

jniedrauer6y ago

Go modules are built around git, unlike many other languages package systems. That means you don't get to pick and choose what goes into them. Imagine if you had to put an empty package.json in every (non-node) directory of your git repo to exclude it from an NPM package, or an install.py in every (non-python) directory to exclude it from a PyPI package. Multi-language repos would get ridiculous pretty quickly.

2 more replies

shaklee36y ago· 1 in thread

This is great. We use get lfs extensively, and one of the biggest complaints we have is users have to clone 7GB of data just to get the source files. There's a work around in that you don't have to enter your username and password from the lfs repo, and let it timeout, but that's a kluge.

elephantum6y ago

There’s an option for that: GIT_LFS_SKIP_SMUDGE=1 git clone SERVER-REPOSITORY

vicosity6y ago· 1 in thread

I'm still unconvinced. Will this provide a user friendly approach to managing design assets.

madsbuch6y ago

My impression is that it will use the normal git experience managing design assets. Ie. with this there should be no need for additional tooling. If it works, that would be so great!

scarecrow1126y ago

This is interesting and could be a savior for Machine Learning(ML) engineering teams. In a typical ML workflow, there are three main entities to be managed: 1. Code 2. Data 3. Models Systems like Data Version Control(DVC) [1], are useful for versioning 2 & 3. DVC improves on usability by residing inside the project's main git repo while maintaining versions of the data/models that reside on a remote. With Git partial clone, it seems like the gap could still be reduced between 1 & 2/3.

[1] - https://dvc.org/

krupan6y ago

I started a project recently and for the first time ever I've wanted to keep large files in my repo. I looked into git LFS and was disappointed to learn that it requires either third party hosting or setting up a git LFS server myself. I looked into git annex and it seems decent. This, once it is ready for prime time, will hopefully be even better

smitty1e6y ago

In AWS, it's worth considering putting those large files in an S3 bucket.

j / k navigate · click thread line to collapse

86 comments

54 comments · 14 top-level

derefr6y ago· 8 in thread

jniedrauer6y ago

hinkley6y ago

That is a straight up nonstarter.

Someone was trying to talk me into git subtrees though...

matheusmoreira6y ago

It'd be great if they worked like Python's editable package installations.

sjburt6y ago

2 more replies

ComputerGuru6y ago

They can now, with the new-ish submodule update/init --remote. But the problem with sub modules is that you cannot do a shallow fetch (depth 1) because most hosts won’t serve unadvertised refs.

jfkebwjsbx6y ago

Submodules are almost always the wrong answer. If you need to version huge files, use Git LFS.

Ididntdothis6y ago

I have tried sub modules but it’s way too easy to shoot yourself in the foot. Not very sustainable in a team with different levels of git knowledge.

lettergram6y ago

I’ve done that. Especially if you want specific versions of data to build ML models, this makes a nice audit log for reproducibility

vvanders6y ago· 7 in thread

Also known as workspace views in P4.

Only downside of course is that it isn't free.

dijit6y ago

> Only downside of course is that it isn't free.

Another downside is that it consumes insane resources (our servers are in the dozens of TiB of ram, with huge NVMe based storage arrays directly attached)

Another downside is that you have to maintain connection to p4 to do any VCS operations (stashing included).

Another downside is that branches are very "expensive" (often taking days) and are impossible to reconcile. We never re-merge to MAIN.

jasondclinton6y ago

vvanders6y ago

I guess I should have expanded more.

DVCS is in direct opposition of workflows that include binary files(yes I'm aware that git lfs has locking, it's also centrally orchestrated) because you can't merge almost every binary format.

We were using P4 ~15 years ago for these workflows and rather than understanding what made them work people are just rediscovering the same problems that have already been solved.

My guess is that we'll next see a solution that dynamically caches most downloaded files in a geographic friendly way, heck we may even call it "P4Proxy".

[Edit] These reasons are exactly why you see Unreal supporting P4/SVN out of the box[1] and no mention of git.

[1] https://docs.unrealengine.com/en-US/Engine/UI/SourceControl/...

1 more reply

011000116y ago

jfkebwjsbx6y ago

P4 was great for its time if you could pay for it, but it is definitely not a competitor anymore.

Git LFS has been just fine for multiterabyte repositories for years.

w0m6y ago

p4 is still great, that there are workarounds to be usable on similar workloads on git doesn't takeaway p4s inherent advantages. they're two very different tools.

1 more reply

jayd166y ago

P4 is fine on paper. I just wish the the client didn't crash and the server didn't lock up as often as it does. P4V is a mess.

piliberto6y ago· 6 in thread

> One reason projects with large binary files don't use Git is because, when a Git repository is cloned, Git will download every version of every file in the repository.

Wrong? There's a --depth option for the git fetch command which allows the user to specify how many commits they want to fetch from the repository

ComputerGuru6y ago

chrismorgan6y ago

Depth, submodules and multiple work trees are all half-baked features that work fine up to a point, then start falling over frantically—most notably if you try to use them together.

shaklee36y ago

sewer_bird6y ago

Yes, but 95% of devs, even fairly talented ones, don't really know how to use Git.

colonwqbang6y ago

Author seems to be a manager, not necessarily a dev.

phaemon6y ago

Git is fundamentally very simple. Any dev who doesn't understand exactly how it works is not even remotely talented.

danbolt6y ago· 4 in thread

Keverw6y ago

Surprised this new idea doesn’t support object storage. Sounds like Git LFS would still be the right way to go for repos with assets for games like meshes, sounds, etc.

However I’ve heard many studios use Perforce instead. However not being open source is a downside to some, but I don’t really know too much about it personally.

Then if working with a lot of non code files, sounds like some solutions have locking. I guess not two people could edit the same Blender or PSD file at the same time and then merge them later on.

Kinda wouldn’t surprise me if some companies actually run multiple versioning control systems. Code on one system, game assets on another.

danbolt6y ago

jfkebwjsbx6y ago

Git LFS has been a thing for years, though.

danbolt6y ago

You’re absolutely right, but larger developers and publishers have been slower to adopt.

1 more reply

beagle36y ago· 3 in thread

There is one note piece to the puzzle to make git perfect for every use case I can think of: store large files as a list of blobs broken down by some rolling hash a-la rsync/borg/bup.

That would e.g. make it reasonable to check in virtual machine images or iso images into a repository. Extra storage (and by extension, network bandwidth) would be proportional to change size.

git has delta compression for text as an optimization but it’s not used on big binary files and is not even online (only on making a pack). This would provide it online for large files.

Junio posted a patch that did that ages ago, but it was pushed back until after the sha1->sha256 extension.

pas6y ago

Do ISOs and other large blob types support only partial (block) modification? Wouldn't all subsequent blocks change too?

beagle36y ago

1 more reply

adrianmonk6y ago

It really depends on the type of file. ("Other large blob types" is a rather broad category.)

1 more reply

itroot6y ago· 3 in thread

mikepurvis6y ago

madsbuch6y ago

Grapping the tarbell from where? To my best knowledge, tarbell export is not a part of git, but something git hosts provide.

Git is a distributed VCS, and we should support keeping it that way.

1 more reply

andrewshadura6y ago

Careful, with extra large repositories it actually slows down the cloning while, obviously, significantly reducing the space usage.

nikivi6y ago· 3 in thread

Is it possible given a git repo (hosted on say GitHub) to only 'clone' (download) certain files from it? Without `.git`

fizixer6y ago

I believe you're looking for the 'working tree' only. You could do the following:

git archive --remote=<your-URL> | tar -t

source: https://stackoverflow.com/questions/3946538

calvinlh6y ago

If you only want a subset of the repo's files, you can use Github's Subversion interface: https://stackoverflow.com/a/18194523

bspammer6y ago

Short answer is, not easily: https://stackoverflow.com/a/14610427

You can get the most recent tree for a repository (no history, just the current state of the repo) with `git clone --depth=1`. That's often good enough for slow connections.

microtherion6y ago· 2 in thread

That seems quite useful, though Git LFS mostly does the job.

hinkley6y ago

What’s your take on this line?

> Partial Clone is a new feature of Git that replaces Git LFS and makes working with very large repositories better by teaching Git how to work without downloading every file.

microtherion6y ago

I believe partial clone makes the situation a little better, but it's not nearly as good as resumable cloning, because you have to partition your repo in advance.

jniedrauer6y ago· 2 in thread

[0]: https://github.com/golang/go/issues/37724

lima6y ago

That does sound like a "you're holding it wrong" issue. As one of the Go team members pointed out, defining a separate module is not a hack, but the intended way of doing it.

How would a partial checkout help?

jniedrauer6y ago

2 more replies

shaklee36y ago· 1 in thread

elephantum6y ago

There’s an option for that: GIT_LFS_SKIP_SMUDGE=1 git clone SERVER-REPOSITORY

vicosity6y ago· 1 in thread

I'm still unconvinced. Will this provide a user friendly approach to managing design assets.

madsbuch6y ago

My impression is that it will use the normal git experience managing design assets. Ie. with this there should be no need for additional tooling. If it works, that would be so great!

scarecrow1126y ago

[1] - https://dvc.org/

krupan6y ago

smitty1e6y ago

In AWS, it's worth considering putting those large files in an S3 bucket.

j / k navigate · click thread line to collapse