The author is using microsoft's git fork, they've added this new command just this summer: https://github.com/microsoft/git/pull/667
It more or less replaces the --full-name-hash option (again a very good cover letter that explains the differences and pros/cons of each very well!)
What's up with folks in Europe that they can't clone a big repo, but others can? Also it sounds like they still won't be able to clone, until the change is implemented on the server side?
> This meant we were in many occasions just pushing the entire file again and again, which could be 10s of MBs per file in some cases, and you can imagine in a repo
The sentence seems to be cut off.
Also, the gifs are incredibly distracting while trying to read the article, and they are there even in reader mode.
I read that as an anecdote, a more complete sentence would be "We had a story where someone from Europe couldn't clone the whole repo on his laptop for him to use on a journey across Europe because his disk is full at the time. He has since cleared up the disk and able to clone the repo".
I don't think it points to a larger issue with Europe not being able to handle 180GB files...I surely hope so.
Every once in a while, my router used to go crazy with seemingly packet loss (I think a memory issue).
Normal websites would become super slow for any pc or phone in the house.
But git… git would fail to clone anything not really small.
My fix was to unplug the modem and router and plug back in. :)
It took a long time to discover the router was reporting packet loss, and that the slowness the browsers were experiencing has to do with some retries, and that git just crapped out.
Eventually when git started misbehaving I restarted the router to fix.
And now I have a new router. :)
After COVID I had to set up a compressing proxy for Artifactory and file a bug with JFrog about it because some of my coworkers with packet loss were getting request timeouts that npm didn’t handle well at all. Npm of that era didn’t bother to check bytes received versus content-length and then would cache the wrong answer. One of my many, many complaints about what total garbage npm was prior to ~8 when the refactoring work first started paying dividends.
Thankfully, we do our work almost entirely in shallow clones inside codespaces so it's not a big deal. I hope the problems presented in the 1JS repro from this blog post are causing similar size blowout in our repo and can be fixed.
They might be in a country with underdeveloped internet infrastructure, e.g. Germany))
The explanation probably got lost among all the gifs, but the last 16 chars here are different:
> was actually only checking the last 16 characters of a filename > For example, if you changed repo/packages/foo/CHANGELOG.md, when git was getting ready to do the push, it was generating a diff against repo/packages/bar/CHANGELOG.md!
(See also the path-walk API cover letter: https://lore.kernel.org/all/pull.1786.git.1725935335.gitgitg...)
The example in the blog post isn't super clear, but Git was essentially taking all the versions of all the files in the repo, putting the last 16 bytes of the path (not filename) in a hash table, and using that to group what they expected to be different versions of the same file together for delta compression.
Indeed in the blog it doesn't work, because foo/CHANGELOG.md and bar/CHANGELOG.md is only 13 chars, but you have to imagine the paths have a longer common suffix. That part is fixed by the --full-name-hash option, now you compare the full path instead of just 16 bytes.
Then they talk about increasing the window size. That's kind of a hack to workaround bad file grouping, but it's not the real fix. You're still giving terrible inputs to the compressor and working around it by consuming huge amounts of memory. So that was a bit confusing to present it as the solution. The path walk API and/or --full-name-hash are the real interesting parts here =)
No, it is the full path that's considered. Look at the commit message on the first commit in the `--full-name-hash` PR:
https://github.com/git-for-windows/git/pull/5157/commits/d5c...
Excerpt: "/CHANGELOG.json" is 15 characters, and is created by the beachball [1] tool. Only the final character of the parent directory can differntiate different versions of this file, but also only the two most-significant digits. If that character is a letter, then this is always a collision. Similar issues occur with the similar "/CHANGELOG.md" path, though there is more opportunity for differences in the parent directory.
The grouping algorithm puts less weight on each character the further it is from the right-side of the name:
hash = (hash >> 2) + (c << 24)
Hash is 32-bits. Each 8-bit char (from the full path) in turn is added to the 8-most significant bits of hash, after shifting any previous hash bits to the right by two bits (which is why only the final 16 chars affect the final hash). Look at what happens in practice:https://go.dev/play/p/JQpdUGXdQs7
Here I've translated it to Go and compared the final value of "aaa/CHANGELOG.md" to "zzz/CHANGELOG.md". Plug in various values for "aaa" and "zzz" and see how little they influence the final value.
If we interpret it that way, that also explains why the filepathwalk solution solves the problem.
But if it’s really based on the last 16 characters of just the file name, not the whole path, then it feels like this problem should be a lot more common. At least in monorepos.
The first option mentioned in the post (--window 250) reduced the size to 1.7GB. The new --path-walk option from the Microsoft git fork was less effective, resulting in 1.9GB total size.
Both of these are less than half of the initial size. Would be great if there was a way to get Github to run these, and even greater if people started hosting stuff in a way that gives them control over this ...
https://github.blog/author/dstolee/
See also his website:
Kudos to Derrick, I learnt so much from those!
> Retroactively, once the file is there though, it's semi stuck in history.
Arguably, the fix for that is to run filter-branch, remove the offending binary, teach and get everyone setup to use git-lfs for binaries, force push, and help everyone get their workstation to a good place.
Far from ideal, but better than having a large not-even-used file in git.
As someone else noted, this is about small, frequently changing files, so you could remove old versions from the history to save space, and use LFS going forward.
IME it takes less time to go from 100 modules to 200 than it takes to go from 50 to 100.
If you really dig down into why we code the way we do, the “best practices” in software development, about half of them are heavily influenced by merge conflict, if not the primary cause.
If I group like functions together in a large file, then I (probably) won’t conflict with another person doing an unrelated ticket that touches the same file. But if we both add new functions at the bottom of the file, we’ll conflict. As long as one of us does the right thing everything is fine.
I've been watching all the recent GitMerge talks put up by GitButler and following the monorepo / scaling developments - lots of great things being put out there by Microsoft, Github, and Gitlab.
I'd like to understand this last 16 char vs full path check issue better. How does this fit in with delta compression, pack indexes, multi-pack indexes etc ... ?
Officer, I'd like to report a murder committed in a side note!
> We work in a very large Javascript monorepo at Microsoft we colloquially call 1JS.
I used to call it office.com.. Teams is the worst offender there. Even a website with a cryptominer on it runs faster than that junk.Collaborative editing between a web app, two mobile anpps and a desktop app with 30 years of backwards compatibility and it pretty much just works. No wonder that took a lot of JavaScript!
We've totally given up any kind of collaborative document editing because it's too frustrating, or we use Notion instead, which for all it's fault, at least the basic stuff like loading a bloody file works...
I beg to differ. Last time I had to use PowerPoint (granted, that was ~3 years ago), math on the slides broke when you touched it with a client that wasn't of the same type as the one that initially put it there. So you would need to use either the web app or the desktop app to edit it, but you couldn't switch between them. Since we were working on the slides with multiple people you also never knew what you had to use if someone else wrote that part initially.
To the point where they quickly found the flaws in JS for large codebases and came up with Typescript. I think. It makes sense that TS came out of the office for web project.
Just a note OMR (the office monorepo) is a different (and actually much larger) monorepo than 1JS (which is big on its own)
To be fair I suspect a lot of the bloat in both originates from the amount of home grown tooling.
What is it about Europe that makes it more difficult? That internet in Europe isn't as good? Actually, I have heard that some primary schools in Europe lack internet. My grandson's elementary school in rural California (population <10k) had internet as far back as 1998.
first of all "internet in Europe" makes close to zero sense to argue about. The article just uses it as a shortcut to not start listing countries.
I live in a country where I have 10Gbps full-duplex and I pay 50$ / month, in "Europe".
The issue is that some countries have telecom lobbies which are still milking their copper networks. Then the "competition committees" in most of these countries are actually working AGAINST the benefit of the public, because they don't allow 1 single company to start offering fiber, because that would be a competition advantage. So the whole system is kinda in a deadlock. In order to unblock, at least 2 telecoms have to agree to release fiber deals together. It has happened in some countries.
//Confused swede with 10G fiber all over the place. Writing from literally the countryside next to nowhere.
This sort of thing has been a problem on every project I've worked on that's involved people in America. (I'm in the UK.) Throughput is inconsistent, latency is inconsistent, and long-running downloads aren't reliable. Perhaps I'm over-simplifying, but I always figured the problem was fairly obvious: it's a lot of miles from America to Europe, west coast America especially, and a lot of them are underwater, and your're sharing the conduit with everybody else in Europe. Many ways for packets to get lost (or get held up long enough to count), and frankly it's quite surprising more of them don't.
(Usual thing for Perforce is to leave it running overnight/weekend with a retry count of 1 million. I'm not sure what you'd do with Git, though? it seems to do the whole transfer as one big non-retryable lump. There must be something though.)
I'm from the Netherlands where over 90% of households now have fiber connections, for example. Here in Berlin it's very hard to get that. They are starting to roll it out in some areas but it's taking very long and each building has to then get connected, which is up to the building owners.
Some EU is still suffering from Telekom copper barons.
FWIW every school I've seen (and I recently toured a bunch looking at them for my kids to start at) all had the internet and the kids were using iPads etc for various things.
Anecdotally my secondary school (11-18y in UK) in rural Hertfordshire was online in the 1995 region. It was via I think a 14.4 modem and there actually wasn't that much useful material for kids then to be honest. I remember looking at the "non-professional style" NASA website for instance (the current one is obviously quite fancy in comparison, but it used to be very rustic and at some obscure domain). CD-based encyclopedias we're all the rage instead around that time IIRC - Encarta et al.
* in America, peering between ISPs is great, but the last-mile connection is terrible
* In Europe, the last-mile connection is great, but peering between the ISPs is terrible (ISPs are at war with each other). Often you could massively improve performance by renting a VPS in the correct city and routing your traffic manually.
> I have heard that some primary schools in Europe lack internet.
Maybe they lack internet but teach their pupils how to write "its".
What aspects of Azure DevOps are hell to you?
Hampering the productivity:
- Review messages get sent out before review is actually finished. It should be sent out only once the reviewer has finished the work.
- Code reviews are implemented in a terrible way compared to GitHub or GitLab.
- Re-requesting a review once you did implemented proposed changes? Takes a single click on GitHub, but can not be done in Azure DevOps. I need to e.g. send a Slack message to the reviewer or remove and re-add them as reviewer.
- Knowing to what line of code a reviewer was giving feedback to? Not possible after the PR got updated, because the feedback of the reviewer sticks to the original line number, which might now contain something entirely different.
- Reviewing the commit messages in a PR takes way too many clicks. This causes people to not review the commit messages, letting bad commit messages pass and thus making it harder for future developers trying to figure out why something got implemented the way it did. Examples: - Too many clicks to review a commit message: PR -> Commits -> Commit -> Details
- Comments on a specific commit does not shown in the commits PR
- Unreliable servers. E.g. "remote: TF401035: The object '<snip>' does not exist.\nfatal: the remote end hung up unexpectedly" happens too often on git fetch. Usually works on a 2nd try.- Interprets IPv6 addresses in commit messages as emoji. E.g. fc00::6:100:0:0 becomes fc00::60:0.
- Can not cancel a stage before it actually has started (Wasting time, cycles)
- Terrible diffs (can not give a public example)
- Network issues. E.g. checkouts that should take a few seconds take 15+ minutes (can not give a public example)
- Step "checkout": Changes working folder for following steps (shitty docs, shitty behaviour)
- The documentation reads as if their creators get paid by the number of words, but not for actually being useful. Whereas GitHub for example has actually useful documentation.
- PR are always "Show everything", instead of "Active comments" (what I want). Resets itself on every reload.
- Tabs are hardcoded (?) to be displayed as 4 chars - but we want 8 (Zephyr)
- Re-running a pipeline run (manually) does not retain the resources selected in the last run
Security:
- DevOps does not support modern SSH keys, one has to use RSA keys (https://developercommunity.visualstudio.com/t/support-non-rs...). It took them multiple years to allow RSA keys which are not deprecated by OpenSSH due to security concerns (https://devblogs.microsoft.com/devops/ssh-rsa-deprecation/), yet no support for modern algos. This also rules out the usage of hardware tokens, e.g. YubiKeys.
Azure DevOps is dying. Thus, things will not get better:
- New, useful features get implemented by Microsoft for GitHub, but not for DevOps. E.g. https://devblogs.microsoft.com/devops/static-web-app-pr-work...
- "Nearly everyone who works on AzDevOps today became a GitHub employee last year or was hired directly by GitHub since then." (Reddit, https://www.reddit.com/r/azuredevops/comments/nvyuvp/comment...)
- Looking at Azure DevOps Released Features (https://learn.microsoft.com/en-us/azure/devops/release-notes...) it is quite obvious how much things have slowed down since e.g. 2019.
Lastly - their support is ridiculously bad.
Even the hounds of hell may benefit from dogfooding.
Unrecognized 100x programmer somewhere lol
Much much smaller of course though. A raspberry pi had died and I was trying to recover some projects that had not been pushed to GitHub for a while.
Holy crap. A few small JavaScript projects with perhaps 20 or 30 code files, a few thousand lines of code for a couple of 10s of KBs of actual code at most had 10s of gigabytes of data in the .git/ folder. Insane.
In the end I killed the recovery of the entire home dir and had to manually select folders to avoid accidentally trying to recover a .git/ dir as it was taking forever on a poorly SD card that was already in a bad way and I did not want to finally kill it for good by trying to salvage countless gigabytes of trash for git.
- When you have multiple files in the repo which have the same trailing 16 characters in the repo path, git may wrongly calculate deltas, mixing up between those files. In here they had multiple CHANGELOG.md files mixed up.
- So if those files are big and change often, you end up with massive deltas and inflated repo size.
- There's a new git option (in Microsoft git fork for now) and config to use full file path to calculate those deltas, which fixes the issue when pushing, and locally repacking the repo.
```
git repack -adf --path-walk
git config --global pack.usePathWalk true
```
- According to a screenshot, Chromium repacked in this way shrinks from 100GB to 22GB.
- However AFAIU until GitHub enables it by default, GitHub clones from such repos will still be inflated.
Also, thank you for the TLDR!
Fixing an existing repository requires a full repack, and for a repository as big as Chromium it still takes more than half a day (56000 seconds is 15h30), even if that's an improvement over the previous 3 days it's a lot of compute.
From my experience of previous attempts, trying to get Github to run a full repack with harsh settings is extremely difficult (possibly because their infrastructure relies on more loosely packed repositories), I tried to get that for $dayjob's primary repository whose initial checkout had gotten pretty large and got nowhere.
As of right now, said repository is ~9.5GB on disk on initial clone (full, not partial, excluding working copy). Locally running `repack -adf --window 250` brings it down to ~1.5GB, at the cost of a few hours of CPU.
The repository does have some of the attributes described in TFA, so I'm definitely looking forward to trying these changes out.
Wait, what? Has MS forked git?
These are not forks-going-their-own-way forks.
This surely cannot be correct. Even the title of the linked article doesn't use "shranked". What?