Someone was trying to talk me into git subtrees though...
It'd be great if they worked like Python's editable package installations.
It's interesting to see the wheel reinvented. We used to run a 500gb art sync/200gb code sync with ~2tb back end repo back when I was in gamedev. P4 also has proper locking, it is really the is right tool if you've got large assets that need to be coordinated and versioned.
Only downside of course is that it isn't free.
Another downside is that it consumes insane resources (our servers are in the dozens of TiB of ram, with huge NVMe based storage arrays directly attached)
Another downside is that you have to maintain connection to p4 to do any VCS operations (stashing included).
Another downside is that branches are very "expensive" (often taking days) and are impossible to reconcile. We never re-merge to MAIN.
DVCS is in direct opposition of workflows that include binary files(yes I'm aware that git lfs has locking, it's also centrally orchestrated) because you can't merge almost every binary format.
We were using P4 ~15 years ago for these workflows and rather than understanding what made them work people are just rediscovering the same problems that have already been solved.
My guess is that we'll next see a solution that dynamically caches most downloaded files in a geographic friendly way, heck we may even call it "P4Proxy".
I've seen so much FUD around how git is the "one true workflow" because other solutions "don't scale" when they don't understand the constraints that certain workflows impose. Git/DVCS is great for a lot of things but sometimes you should use the right tool for the job rather than hack something together.
[Edit] These reasons are exactly why you see Unreal supporting P4/SVN out of the box[1] and no mention of git.
[1] https://docs.unrealengine.com/en-US/Engine/UI/SourceControl/...
Git LFS has been just fine for multiterabyte repositories for years.
Wrong? There's a --depth option for the git fetch command which allows the user to specify how many commits they want to fetch from the repository
However I’ve heard many studios use Perforce instead. However not being open source is a downside to some, but I don’t really know too much about it personally.
Then if working with a lot of non code files, sounds like some solutions have locking. I guess not two people could edit the same Blender or PSD file at the same time and then merge them later on.
Kinda wouldn’t surprise me if some companies actually run multiple versioning control systems. Code on one system, game assets on another.
You’re more right than you think about multiple versioning systems, although keeping synchronized becomes an issue. Perforce is a bit of a boon for management, as they get a GUI for versioning across a multidisciplinary team.
P4’s GUI/model is also intuitive for non-programming roles to learn and use historically compared to git, so a team with wide skills can ramp up quickly with a unified toolset. A less-technical manager gets a GUI that has versioning across changes from a multidisciplinary team. You can probably guess what inertia that has in a space with higher turnover compared to other industries.
As mentioned, things are changing though. git and GitHub have become a mainstay and are what new programmers likely learn in schools. This has a trickle effect on new projects with smaller teams and results in more investment into git setups. I use git in a AAA context at work, and it’s not uncommon to find sentiments from more seasoned game programmers on git that are similar to HN comments about the latest fad in web frameworks.
That would e.g. make it reasonable to check in virtual machine images or iso images into a repository. Extra storage (and by extension, network bandwidth) would be proportional to change size.
git has delta compression for text as an optimization but it’s not used on big binary files and is not even online (only on making a pack). This would provide it online for large files.
Junio posted a patch that did that ages ago, but it was pushed back until after the sha1->sha256 extension.
However, with self synchronizing hashes of the kind used by rsync bup and borg, it doesn't matter - you could have a 1TB file, delete a single byte at position 100 - and you only need to store or transfer one new block (with average size 8KB for rsync, configurable for borg) if you already have a copy of the version before the change.
It's somewhat comparable with diff/patch but not exactly; it's worse in that change granularity is only specified on average; It's better in that it works well on binary files, does not require a specific reference diff (can reference all previous history), and efficiently supports reordering as well small changes - if you divide a 4000 line text file to four 1000-line sections and reorder them 1,2,3,4 -> 3,1,4,2 you will find the diff/patch to be as long as a new copy, whereas a self synchronizing hash decomposition will hardly take any space for the reordered file given the original.
One obvious example where you could have a lot of common blocks (even following the offset where a change was made) is zip files. The zip format basically compresses each file individually and then concatenates all that together.
Let's say you have a build and it packages the results up as a big zip file. (Java builds often do this. A jar is a special type of zip file.) If you change a few source files and rebuild, and if your build is deterministic (and/or incremental), then the new zip file will contain a lot of the same stuff as the previous version. And if your zip archiver is deterministic (pretty safe assumption), it should produce a zip file that is mostly the same sequences of bytes as the previous zip file, even if there are changed files in the middle.
If you write a .tar.gz archive, then one change in the middle will throw everything off from that point on because it compresses the whole archive instead of individual files. In theory a binary diff can work around this by first undoing the gzip that was done to create each large blobs, then doing a binary diff on that, and then arranging to be able to recreate what gzip did. Obviously that's messy.
Of course, not every file is an archive. Some are filesystems. But any writable filesystem (notably not including ISOs) that is capable of being used on a hard disk will of necessity not rewrite everything. If it did, changing on one file on a filesystem would take hours because the rest of the partition would have to be rewritten.
Another obvious type of big blob is multimedia. I don't know a lot of specifics, but I would think file formats meant for editors would keep changes localized for reducing IO (for example, so that changes in a non-linear video editor don't need to write a giant file), but formats meant for export and delivery might change the whole file since they're aiming for small size.
Git is a distributed VCS, and we should support keeping it that way.
git archive --remote=<your-URL> | tar -t
You can get the most recent tree for a repository (no history, just the current state of the repo) with `git clone --depth=1`. That's often good enough for slow connections.
One of my biggest remaining pain points is resumable clone/fetch. I find it near impossible to clone large repos (or fetch if there were lots of new commits) over a slow, unstable link, so almost always I end up cloning a copy to a machine closer to the repo, and rsyncing it over to my machine.
> Partial Clone is a new feature of Git that replaces Git LFS and makes working with very large repositories better by teaching Git how to work without downloading every file.
How would a partial checkout help?
[1] - https://dvc.org/