"Most developers access Piper through a system called Clients in the Cloud, or CitC, which consists of a cloud-based storage backend and a Linux-only FUSE13 file system. Developers see their workspaces as directories in the file system, including their changes overlaid on top of the full Piper repository. CitC supports code browsing and normal Unix tools with no need to clone or sync state locally. Developers can browse and edit files anywhere across the Piper repository, and only modified files are stored in their workspace. This structure means CitC workspaces typically consume only a small amount of storage (an average workspace has fewer than 10 files) while presenting a seamless view of the entire Piper codebase to the developer."
This is a very powerful model when dealing with large code bases, as it solves the issue of downloading all the code to each client. Kudos to Microsoft for open sourcing it, and under the MIT license no less.
ClearCase did what a lot of Enterprise companies needed at the time, and most importantly, it created hooks, that were mostly too difficult to remove. Once you create deep integration with ClearCase, you are very much committed to using it long term.
I ruminated about ccase/git elsewhere in this thread: https://news.ycombinator.com/item?id=13560108
Google's dev infra is pretty amazing and it's at least a decade ahead of anything else I've seen. Every single ex-Googler misses it quite a bit.
I dunno. I don't miss the 1-minute incremental builds. (Maybe they've improved since I left, though.)
BTW Forge is not just the test runner, but the thing that runs all build tasks, farmed out to all servers. Blaze interprets the build language and does dependency tree analysis but then hands off the tasks to Forge. Blaze has been (partially) open sourced: https://bazel.build/
This may be naive but why not recreate it as an open source project?
https://www.reddit.com/r/programming/comments/5rtlk0/git_vir...
One of the core differences between Windows and Linux is process creation. It's slower - relatively - on Windows. Since Git is largely implemented as many Bash scripts that run as separate processes, the performance is slower on Windows. We’re working with the git community to move more of these scripts to native cross-platform components written in C, like we did with interactive rebase. This will make Git faster for all systems, including a big boost to performance on Windows.
Sad. Rather than fix the root problem they rewrite the product in a less-agile language and require everyone to run opaque binaries.
They probably even think they're doing a good thing.
I still think we need something better than Git, though. It brought some very cool ideas and the inner workings are reasonably understandable, but the UI is atrociously complicated. And yes, dealing with large files is a very sore point.
I'd love to see a second attempt at a distributed version control system.
But I applaud MS's initiative. Git's got a lot of traction and mind share already and they'd probably be heavily criticized if they tried to invent its own thing, even if it was open sourced. Will take a long time to overcome its embrace, extend and extinguish history.
Note that Google and Facebook ran into the same problems Microsoft did, and their solution was to use Mercurial and build similar systems on top of it. Microsoft could've done that too, but instead decided to improve Git, which deserves some commendation. I'd rather Git and hg both got better rather than one "taking over".
They didn't improve git, they only made this for themselves and for their product users. Git doesn't restrict you to a single operating system.
I've assumed Microsoft have been making all this stuff all along, but keeping it internal then throwing it away on the probably false assumption that every bit of it is some sort of competitive advantage. I think they're coming around to the idea that at least appearing constructive and helpful to the developer community will help with trying to hire good developers.
For example one of the goals is to always allow you to switch branches. Stash and stash pop would happen automatically and it would even work if you're in the middle of a merge.
[Reinventing the Git Interface][1] was written almost 3 years ago now and yet to my knowledge nobody's implemented anything quite like that yet.
Out of curiosity, why a whole new attempt? Personally, I'd prefer the approach of "making our current tools better."
Until 1997, forking a project was considered a tragedy. I think things have improved since then :-).
I'd love to see a second attempt at a distributed version control system.
Git wasn't the first, and even then had several contemporaries at 2nd gen.The story of git is a good case-study for people interested in group dynamics.
It's like a whole'nother company after they got rid of Steve Ballmer.
Linus himself admitted that he isnt good at UI. Anyway, I think git just wasnt designed to be used directly, but via another UI. For example, I use it within Visual Studio Code, and that covers about 90 percent of usecases, and then Git Extensions can take care of almost everything else. Sometimes cli is needed, though.
I've still not yet seen a stand-alone GUI for Git that is better than the one that ships with Git Extensions, though.
I'll be watching this to see if Microsoft can break the logjam. By open sourcing the client and protocol, there is potential...
Other attempts:
* https://github.com/blog/1986-announcing-git-large-file-stora...
* https://confluence.atlassian.com/bitbucketserver/git-large-f...
Article on GitHub’s implementation and issues (2015): https://medium.com/@megastep/github-s-large-file-storage-is-...
It is open source (GPLV3) licensed. [not proprietary]
Written in Haskell. [cool aid]
Currently has 1200+ stars on Github and is part of at least Ubuntu (http://packages.ubuntu.com/search?keywords=git-annex) since 12.04. [shows something for support and adoption]
edit: Link to Github https://github.com/joeyh/git-annex -- thanks dgellow
http://git-annex.branchable.com/forum/Storing_git_repos_in_g...
Pros of git-annex:
- it is conceptually very simple: use symlinks instead of ad-hoc pointer files, virtual files system, etc. to represent symbolic pointer that point to the actual blob file;
- you can add support for any backend storage you want. As long as it support basic CRUD operations, git-annex can have it as a remote;
- you can quickly clone a huge repo by just cloning the metadata of the repo (--no-content in git-annex) and just download the necessary files on-demand;
And many other things that no other attempt even consider having, like client-side encryption, location tracking, etc.
The other half is that almost all of the binary formats can't be merged and so you need a mechanism to lock them to prevent people from wiping out other people's changes. Unfortunately that runs pretty much counter the idea of DCVS.
Don't their artists and designers use version control too? Maybe they just have one such person per team, or each person owns one file, or something like that. Hard to say.
Maybe it's like how I used to work on teams that never used branches - you have various problems that you figure there's probably a solution for, but there's never time to (a) figure out what the solution looks like, (b) shift the whole team over to a brand new workflow and set of tools, and (c) clean up the inevitable mess. So you just work around the problems the same way you always have - because at least that's a known quantity.
This way you can look at both and resolve the conflict.
I remembering years ago Facebook says it had this problem. A lot of the comments were centered around that you could change your codebase to for what git can do. I'm glad there's another option now.
https://code.facebook.com/posts/218678814984400/scaling-merc...
As well, the Mercurial team does quarterly sprints (I believe), and Google is hosting the next one[1].
At Splunk we had the same problem, our source code was stored in CVS (perforce), but we wanted to switch to git. And not only because we really wanted to use git, but to simplify our development process, mainly because of the much easier branching model (lightweight branching also is available in perforce, but to get it we still needed to do some upgrades on our servers). We also had a problem that at the beginning we had very large working tree, don't think it was 200-300Gb, I believe it was 10x less, and actually required 4-5 seconds for git status. This was not appropriate for us, so we worked on our source code and release builds to split it in several git repos to make sure that git status will take not more than 0.x seconds.
My point is use right tools for right jobs. 4-5 seconds for git status is still a huge problem, I would prefer to use CVS instead if that will not require me to wait 5 seconds for each git status invocation.
How many of them have you used? I've used a couple, to interact with large code bases on the rough order of 300GB. In my experience they don't work very well, because you have to be hygienic about the commands you run or some part of your Git state gets out of sync with some part of your state for the other source control system. So I gave up on those, and I use something similar to Microsoft's solution at work on a daily basis. It's a real pleasure by comparison, and in spite of that I still call myself a Git fan (about 10 years of heavy Git use now). At work the code base is monolithic and everyone commits directly to trunk (at a ridiculous rate, too).
I've heard horror stories about back when people had to do partial checkouts of the source code, and I'm glad that the tooling I use is better.
The idea of breaking up a repository merely because it is too large reminds me the story of the drunkard looking for his keys under the streetlights. The right tools for the right job, sometimes you change the job to match the tools, and sometimes you change the tools to match the job.
It sounds like they answered that:
> In a repo that is this large, no developer builds the entire source tree. Instead, they typically download the build outputs from the most recent official build, and only build a small portion of the sources related to the area they are modifying. Therefore, even though there are over 3 million files in the repo, a typical developer will only need to download and use about 50-100K of those files.
Source will still be distributed among the developers that touch it. Seems like a decent compromise.
> just to give cool kids access to cool tools
Yes. DVCS with the huge code bases, large binary objects and large teams is hardly the optimal approach. But the "cool kids" are just used to use what they use. And now they can pretend to do it even when they have to be always connected, because the files are virtual and remain on the server until really used.
If Microsoft is giving the solution to the "cool kids," no reason to complain about the fact that Microsoft is willing to care for them.
And if you'd ask the "cool kids" why do they need git at all for such scenarios, have fun with the amount of arguments you'll get. Why this one "needs" vi and another "Emacs" etc. The same reasons. You'll find the arguments also in the comments here. Including mentions of Mercurial, the competition, just like "vi or Emacs". Because. Don't ask.
And no, as far as I understand, Google doesn't primarily "use Mercurial", they use something called Piper, and before they used a customized Perforce just like Microsoft did.
https://www.wired.com/2015/09/google-2-billion-lines-codeand...
"Piper spans about 85 terabytes of data" "and Google’s 25,000 engineers make about 45,000 commits (changes) to the repository each day. That’s some serious activity. While the Linux open source operating spans 15 million lines of code across 40,000 software files, Google engineers modify 15 million lines of code across 250,000 files each week."
Sure, GVFS downloads files only when first read; but maybe it keeps them cached? Maybe you can still work on them and commit changes after you get offline? At least in principle, nothing prevents that.
That being said, you can see more and more people getting off the "Microsoft is evil" train. It's super slow and every bone headed thing that Microsoft does resets the needle for lots of people.
I've always been surprised how much sympathy a company like IBM or Intel gets on HN. They both sue people over patents. That both contribute to non-free software. They were early backers of Linux, though, and that is what people care about superficially.
So, I'm very, very, very sorry that I can't hear their words over the noise of their actions; and in the light of this, I eye each new gift-bearing Redmondian with suspicion.
I don't know if "a lot" is the right qualifier. Solitary repos of millions of files have scalability problems even outside the source control system (I mean: how long does it take your workstation to build that 3.5 million-file windows tree?)
A full Android system tree is roughly the same size and works fine with git via a small layer of indirection (the repo tool) to pull from multiple repositories. A complete linux distro is much larger still, and likewise didn't need to rework its tooling beyond putting a small layer of indirection between the upstream repository and the build system.
Honestly I'd say this GVFS gadget (which I'll fully admit is pretty cool) exists because Microsoft misapplied their source control regime.
All they did is create a caching layer.
How many people have that problem, really?
An easy lower bound is 10s of thousands of engineers : developers at several large tech companies (e.g. MS, facebook, google, ?)If you deal with graphics, audio assets, etc, the binary-blob type of data, the case is central.
Lacking support for large binary blobs is, like, THE #1 reason that an engineer might have to use an alternative.
All you need is several hundred engineers and your monorepo becomes unwieldy for git to handle.
The biggest PITA with clearcase was keeping their lousy MVFS kernel module in sync with ever-advancing linux distros.
I really liked Clearcase in 1999, it was an incredible advancement over other offerings then. MVFS was like "yeah! this is how I'd design a sweet revision control system. Transparent revision access according to a ranked set of rules, read-only files until checked out." But with global collaborators, multi-site was too complex IMO. And overall, clearcase was so different from other revision control systems that training people on it was a headache. Performance for dynamic views would suffer for elements whose vtrees took a lot of branches. Derived objects no longer made sense -- just too slow. Local disk was cheap now, it got bigger much faster than object files.
> However, we also have a handful of teams with repos of unusual size! ... You can see that in action when you run “git checkout” and it takes up to 3 hours, or even a simple “git status” takes almost 10 minutes to run. That’s assuming you can get past the “git clone”, which takes 12+ hours.
This seems like a way-out-there use case, but it's good to know that there's other solutions. I'd be tempted to partition the codebase by decades or something.
Clearcase also suffered, at least in my experience, from a clumsy and ugly merging process and deeply unintuitive command set which meant everyone who "used clearcase" actually tended to use some terrible homegrown wrapper scripts.
Still, considering it was the last remaining vestige of the Apollo Domain OS, not bad.
X Windows
NeWS - https://en.m.wikipedia.org/wiki/NeWS
Remember the mess on usenet?
comp.windows.x.motif
comp.windows.new - not news about Microsoft Windows
I can image people at a forum:
"Hey, GVFS isn't working for me. It crashes with error -504" when I try to mount /nfs/company_data".
Try guessing which GVFS that is.
Our internal libraries need to be compatible with the Core Runtime, so we have to have them target .NET Standard, which is compatible w/ the full .NET Framework or .NET Core. To target .NET Standard, you need the .NET Core SDK/CLI which includes the `dotnet` tool, which is almost never clarified as "the SDK/CLI" in documentation or in talks, but usually just ".NET Core".
Another minor annoyance: to build a .NET Standard-compatible library, you reference the "NETStandardLibrary" NuGet package. Makes a fair amount of sense, but is hard to talk about.
If you're running on Windows and want a smaller server footprint, you can use Windows Server Nano, which requires your apps to target .NET Core Runtime (not .NET Full Framework). Note that this requirement is not true for Windows Server Core. -_-
I later found out I could have looked for "ActiveX" and found similar results.
They have a product in Azure named simply DocumentDB. I don't think "used to" is necessarily the best tense here (:
What could you possibly mean by that? The .com TLD was introduced in 1985, with microsoft.com registered already in 1991. Microsoft COM was created in 1993. (Of course, "the Internet" in any sense of the word predates all of this.)
They just came to the conclusion thas GNOME's product is no threat and that they can just claim the name. Smaller companies [1] tried that before.
[1]: https://www.groupon.com/blog/cities/groupon-launches-gnome
I mean, git itself did this in the beginning.
Microsoft, under Nadella has made me not hate Microsoft again, and that's a tall order because I'm over 40. This is an impressive move, and if they effectively execute all the bits that are possible here, this is just some great work.
(Oh, and I can't even use the word nix now as a catch all for all the POSIX/ POSIX(like) OSs because of nixOS.)
I think going forward, we just have to accept name collision.
GNOME Virtual Filesystem is first search result for "gvfs". Even if you use bing!
In this case, they're both called GVFS AND three of the letters have the same meaning, and they both do relatively similar things.
Even the tooling, and the output of `mount` is bound to be incredible confusing.
I seem to recall that Microsoft has previously used a custom Perforce "fork" for their larger code bases (Windows, Server, Office, etc.).
A custom filesystem is indeed the correct approach, and one that git itself should have probably supported long ago. In fact, there should really only be one "repo" per machine, name-spaced branches, and multiple mountpoints a la `git worktree`. In other words there should be a system daemon managing a single global object store.
I wonder/hope IPFS can benefit from this implementation on Windows, where FUSE isn't an option.
Microsoft's fork contains 67,522 commits. The official Git repo contains 45,810. It appears the bulk of the work started in 2010, with significant ramp up of development in 2015.
https://gitsense.com/mgit-vs-git/history.png
Looks like Microsoft only really introduced about 100 more new files.
https://gitsense.com/mgit-vs-git/files.png
Microsoft's repo contains 1712 contributors. Git's repo contains 1685 contributors. So it looks 20 - 30 employees worked on Microsoft's fork.
https://gitsense.com/mgit-vs-git/mgit-contributors.png https://gitsense.com/mgit-vs-git/git-contributors.png
Basically most operations in git are O(modified files) however there are a few that are O(working tree size). For example checkout and status were mentioned by the article. However these operations can be made to O(modified) files if git doesn't have to scan the working tree for changes.
So pretty much I would be all over this if:
- It worked locally.
- It worked on Linux.
Maybe I'll see how it's implemented and see if I could add the features required. I'm really excited for the future of this project.
Also, it would have been interesting if the article mentioned whether they tried other approaches taken by facebook (mercurial afaik) or google.
Sounds like they've almost solved the secrets of the fire swamp!
For one, it's not really distributed if you're only downloading when you need that specific file.
But that doesn't change the merrits of this at all, I think.
Our whole codebase is 800MB.
Otherwise, I hope you replaced your sysadmin.
This solves the next scaling problem of avoiding managing the whole working tree. (without requiring narrow clones which have significant downsides)
I'd assume this GVFS would work hand in hand with Git LFS for the use case of large files.
How on Earth can anybody work like that?
I'd have thought you may as well ditch git at that point, since nobody's going to be using it as a tool, surely?
git commit -m 'Add today\'s work - night all!' && git push; shutdownSince it's look like they are still migrating I don't think a lot of people actually did work like that. Maybe just a couple of times to figure out how long it would actually take. Or maybe those who really use it are actually doing shallow clones which would probably take much less time. Actually shallow clone is nice but doesn't seem to be known very well. I use it often if I know I won't ever need the full history anyway. Also great to shave time of CI builds.
I think when the powers that be said that whole thing about geniuses and clutter, they were specifically talking about their living spaces and not their work...
It was slow to do 'git status' and other common commands. Restarting RoR app was also slo. I've put repo on RAM disk which made the whole experience at least few times faster.
Since all was in vm that I rarely restarted I didn't have to recreate files on ram disk all that often. I was syncing changes with the persistent disk with rsync running periodically.
Okay, so this is a networking issue. Or is it a stick everything in the same branch issue?
Whatever the reason here the issue is pure size vs. network pipe, pure and simple. Hum, when can I get a laptop with a 10GBaseT interface?
One of the issue with the way they are doing this (only grab files when needed) is you cannot really work offline anymore.
Although I could definitely be wrong but this sounds a lot like monolith vs microservices to me.
Interestingly, however, most of their "open source" efforts (.NET, C#, and related) are all on GitHub rather than their own hosted offerings: CodePlex (which is basically dead) or "Visual Studio Team Services".
Disclosure: I'm a PM on VSTS/TFS, and I own part of version control.
Microsoft is moving to Git and we use Team Services / TFS as our Git server for all private repositories. GitHub is only used for OSS since that's where the OSS community is.
The whole repo is needed for every developer - i.e it's not possible to do a sparse checkout but many gigs of old versions of small binaries I would prefer to keep only at the server until I need it (which is never).
https://github.com/Microsoft/gvfs
"GVFS requires Windows 10 Anniversary Update or later."
I haven't touched Windows in quite a while, so I can't really make a claim either way.
In any case, when C++/WinRT gets feature parity, I imagine it will eventually be deprecated, depending which one gets more developer love.
Here is another virtual filesystem with the exact same name: https://wiki.gnome.org/Projects/gvfs
Debian package for it: https://packages.debian.org/jessie/gvfs
Using a vfs allows you to track which files have changed so that these operations no longer need to scan. Now they are O(changed files) which is generally small.
Now IPFS has a vfs, but it is just a simple read/write interface. This vfs needs slightly more logic to do things like change the base revision and track changes.
For tracking changes (i.e. mutable data) you can use IPNS and create a signed commit history. This will be built on IPFS eventually so it's only a matter of time.
It wasn't an option a couple years ago, but submodules work fine now. With a little bit of scripting to wrap common uses, they're practically pain-free.
The problems with these companies is that developers aren't making technical decisions, it's executives who know nothing about computer science. That's why Windows 10 is such a mess with spyware and adware.
Now they have some FOSS advocate who doesn't really know anything about software or VCS but saw that an internal problem they were trying to solve was making their code base work with git. So he decided it would be really cool for Microsofts image to develop an open source extension of git, instead of actually solving the underlying problems (because he didn't recognize them). Now he's probably got a promotion at Microsoft for "fixing" their problem with git.
So are the DVCS converging to Git and Git only?
Our shop uses Mercurial becuase of its Python basis and the amount of time and effort it takes to master Git makes me draw strong and uncomfortable parallels to emacs.
By "the world" you mean, the HN/SV/startup crowd of cool-kids who feel the need to use whatever is popular without regard for how appropriate it is?