1. The single version dependencies are asinine. We are migrating to a monorepo at work, and someone bumped the version of an open source JS package that introduced a regression. The next deploy took our service down. Monorepos mean loss of isolation of dependencies between services, which is absolutely necessary for the stability of mission-critical business services.
2. It encourages poor API contracts because it lets anyone import any code in any service arbitrarily. Shared functionality should be exposed as a standalone library with a clear, well-defined interface boundary. There are entire packaging ecosystems like npmjs and pypi for exactly this purpose.
3. It encourages a ton of code churn with very low signal. I see at least one PR every week to code owned by my team that changes some trivial configuration, library call, or build directive, simply because some shared config or code changed in another part of the repo and now the entire repo needs to be migrated in lockstep for things to compile.
I've read this paper, as well as watched the talk on this topic, and am absolutely stunned that these problems are not magnified by 100x at Google scale. Perhaps it's simply organizational inertia that prevents them from trying a more reasonable solution.
1) This is solved by 2 interlocking concepts: comprehensive tests & pre-submit checks of those tests. Upgrading a version shouldn’t break anything because any breaking changes should be dealt with in the same change as the version bump.
2) Google’s monorepo allows for visibility restrictions and publicly-visible build targets are not common & reserved for truly public interfaces & packages.
3) “Code churn” is a very uncharitable description of day-to-day maintenance of an active codebase.
Google has invested heavily in infrastructural systems to facilitate the maintenance and execution of tests & code at scale. Monorepos are an organizational design choice which may not work for other teams. It does work at Google.
Does this mean that some things will never get updated, as the effort required is impossibly high?
> 3) “Code churn” is a very uncharitable description of day-to-day maintenance of an active codebase.
Also implicit in the discussion is the fact that Google and other big tech companies performance review based on "impact" rather than arbitrary metrics like "number of PRs/LOCs per month". This provides a check on spending too much engineer time on maintenance PRs, since they have no (or very little) impact on your performance rating.
> The single version dependencies are asinine. We are migrating to
> a monorepo at work, and someone bumped the version of an open
> source JS package that introduced a regression.
There's no requirement to have single versions of dependencies in a monorepo. Google allows[0] multiple versions of third-party dependencies such as jQuery or MySQL, and internal code is expected to specify which version it depends on. > It encourages poor API contracts because it lets anyone import any
> code in any service arbitrarily.
Not true at Google, and I would argue that if you have a repository that allows arbitrary cross-module dependencies then it's not really a monorepo. It's just an extremely large single-project repo with poor structure. The defining feature of a monorepo is that it contains multiple unrelated projects. At Google, this principle was so important that Blaze/Bazel has built-in support for controlling cross-package dependencies. > I see at least one PR every week [...] because some shared config
> or code changed in another part of the repo and now the entire repo
> needs to be migrated in lockstep for things to compile.
That really doesn't sound like a monorepo to me. If all the code has to be migrated "in lockstep", then that implies a single PR might change code across different parts of the company. At which point it's not independent projects in a monorepo, it's (merely) a single giant project.[0] Or allowed -- I last worked there in 2017.
I second your point about monorepo versus ball of mud. They are so different. And managing all of this is about social/culture, less science-y. If you don't have good culture around maintenance, well then, yeah, duh, it will fall apart pretty quickly. It sounds like Google spends crazy money to develop tools to enforce the culture. Hats off.
This prevents situations where "Gmail" ends up bundling 4 different, mildly incompatible versions of MySQL or whatever, and the aggravation that would cause. Or worse, in c++ you get ODR violations due to a function being used from two versions of the same library.
You can see this some with how obnoxious Guava was, back in the day. It seems a sane strategy where you can deprecate things quickly by getting all callers to migrate. This is fantastic for the cases where it works. But, it is mind numbingly frustrating in the cases where it doesn't. Worse, it is the kind of work that burns out employees and causes them to not care about the product you are trying to make. "What did you do last month?" "I managed to roll out an upgrade that had no bearing on what we do."
The third party documentation is public, one-version policies exist but they are exemptions.
Sure, but this is unsustainable. If service Foo depends on myjslib v3.0.0, but service Bar needs to pull in myjslib v3.1.0, in order to make sure Foo is entirely unchanged, you'd have to add a new dependency @myjslib_v3_1_0 used only by Bar. After two years you'd have 10 unique dependencies for 10 versions of myjslib in the monorepo.
At this point you've basically replicated the dependency semantics of a multi-repo world to a monorepo, with extra cruft. This problem is already implicitly solved in a multi-repo world because each service simply declares its own dependencies.
When you're pinning on old versions of software it quickly turns into a depsolving mess.
Software developers have difficulty figuring out which version of code is actually being deployed and used.
When dealing with major version bumps and semver pins around different repositories that creates a massive amount of make-work and configuration churn, and creates entire FTE roles practically dedicated to that job (or else grinds away at the time available for devs to do actual work and not just bump pins and deal with depsolving).
In any successful team which is using many dozens of repos, there's probably one dev running around like fucking nuts making sure everyhing is up to date and in synch who is keeping the whole thing going. If they leave because they're not getting career advancement then the pain is going to get surfaced.
The ability to pin also creates and encourages tech debt and encourages stale library code with security vulnerabilities. All that pinning flexibility is engineering to make tech debt really easy to start generating and to push all that maintenance into the future.
How would multi-repo change this? A dependency updated, and code broke, and the new version was broken—but you update dependencies in multi-repo anyway, and deployments can be broken anyway. I don’t see how multi-repo mitigates this.
> It encourages poor API contracts because it lets anyone import any code in any service arbitrarily.
This has nothing at all to do with monorepos. Google’s own software is built with a tool called Bazel, and Meta has something similar called Buck. These tools let you build the same kind of fine-grained boundaries that you would expect from packaged libraries. In fact, I’d say that the boundaries and API contracts are better when you use tools like Bazel or Buck—instead of just being stuck with something like a private/public distinction, you basically have the freedom to define ACLs on your packages. This is often way too much power for common use cases but it is nice to have it around when you need it, and it’s very easy to work with.
A common way to use this—suppose you have a service. The service code is private, you can’t depend on it. The client library is public, you can import it. The client library may have some internal code which has an ACL so it can only be imported from the client library front-end.
Here’s how we updated services—first add new functionality to the service. Then make the corresponding changes to the client. Finally, push any changes downstream. The service may have to work with multiple versions of the client library at any time, so you have to test with old client libraries. But we also have a “build horizon”—binaries older than some threshold, like 90 days or 180 days or something, are not permitted in production. Because of the build horizon, we know that we only have to support versions of the client library made within the last 90 or 180 days or whatever.
This is for services with “thick clients”—you could cut out the client library and just make RPCs directly, if that was appropriate for your service.
> It encourages a ton of code churn with very low signal.
The places I worked at that had monorepos, you might filter out the automated code changes there to do automated migrations to new APIs. One PR per week sounds pretty manageable, when spread across a team.
Then again, I’ve also worked at places where I had a high meeting load, and barely enough time to get my work done, so maybe one PR per week is burdensome if your are scheduled to death in meetings.
In a multi-repo world, I control the repo for my own service. For a business-critical service in maintenance mode (with no active feature development), there's no reason for me to upgrade the dependencies. Code changes are the #1 cause of incidents; why fix something that isn't broken?
We would have avoided this problem had we not migrated to the monorepo simply because, well, we would have never pulled in the dependency upgrade in the first place.
> In fact, I’d say that the boundaries and API contracts are better when you use tools like Bazel or Buck
I'm familiar with both of these tools, and I agree with this point. However, you are making an implicit assumption that 1. the monorepo in question is built with a tool like Bazel that can enforce code visibility, and 2. that there exists a team or group of volunteers to maintain such a build system across the entire repo. I suspect both of these are not true for the vast majority of codebases outside of FAANG.
> The places I worked at that had monorepos, you might filter out the automated code changes there to do automated migrations to new APIs
Sure, this solves a logistical problem, but not the underlying technical problem of low-signal PRs. I would argue that doing this is an antipattern because it desensitizes service owners from reviewing PRs.
If your organisation can’t work effectively within a monorepo then you should absolutely address the problem, either by fixing the problematic behaviour or by switching away from a monorepo. The problem isn’t monorepos, the problem is monorepos in your organisation.
IMO, it's more of a development paradigm rather than a mere technology. You cannot simply use monorepo in isolation since its trade-off is strongly coupled with many other tooling and workflow. Because of this reason, I usually don't recommend migration toward monorepo unless there's strong organizational level support.
Is this convention for monorepos to all share the same dependencies? Does monorepo imply monolith? Surely one could have dependencies per "service" for example a python app with its own pipfile per directory.
Perhaps that might be the default case, but the build system has a visibility system[1] that means that you can carefully control who depends on what parts of your code.
Separately, while some might build against your code directly, a lot of code just gets built into services, and then folk write their code against your published API, i.e. your protobuf specification.
My point is that in reality, we use what best matches our knowledge, experience and perception and prioritisation of the problems. I, for one, believe that a monorepo is dangerous for small teams because it encourages coupling - not only do I believe it, but I saw it with my own eyes. It also creates unnecessary dependency chains. Monorepos contribute to a fallacy that every dependent on an object must be immediately updated or tech debt happens. But that's not even remotely given.
In any case, companies like Google and Amazon have more than enough resources to deal systematically with the problems of a monorepo. I'm sure they have entire teams whose job it is to fix problems in the VCS. But for small teams I remain unconvinced that it is a good idea. We shouldn't even be trying to do the things the big guys do, unless we want to spend all our time working on the tools instead of our businesses.
One problem is that a lot of developers at big companies code business logic into their modules/dependencies... So whenever the business domain requirements change, they need to update many dependencies... Sometimes they depend on each other and so it's like a tangled web of dependencies which need to be constantly updated whenever requirements change.
Instead of trying to design modules properly to avoid everything becoming a giant tangled web, they prefer to just facilitate it with a monorepo which makes it easier to create and work with the mess (until the point when nobody can make sense of it anymore)... But for sure, this approach introduces vulnerabilities into the system. I don't know how most of the internet still functions.
You're doing it wrong.
The point of monorepo is that if someone breaks something, it breaks right away, at build time, not at deployment time.
You're not really using a monorepo.
For instance in multi-repo environments I've often seen this pattern: own some code, bump an internal dependency to a new version, see it break, ask the person maintaining it what's us, realize this case wasn't taken into account, few back and forth before finding an agreement.
On the other hand in mono-repo environments, it's usually more difficult to introduce a wide changes as you face all consequences immediately, but difficulty is mainly a technical/engineering difficulty rather than a social one, and the outcome is better than the series of compromises made left and right after a big multi-repo change.
Compare that with hundreds of tiny repos, each with their own little dependency system. Testing a version bump across the board before mainlining it is much more involved and you are more likely to hit stuff in production which should have been caught in test.
The other two points sounds more like cultural issues which may touch on branch strategies, code review, and what's expected of a developer. Those mostly cultural issues that overlaps with technical are hard in a way that repository strategy isn't.
I don't believe this is true, except in the short term. Unless the writing party is guaranteeing you forward compatibility, your consuming code will break when you update.
This is (almost) the only reason API contracts are worth having; the reason doesn't go away just because you can technically see all the code.
1, 2 and 3: Use separate dependencies for each package, so this doesn't happen. Use e.g. GitHub Actions or another CI/CD file filtering wisely: if a file is needed by two packages, tests for both packages needs to run whenever it's changed, before merging, in addition to usual end-to-end tests. Have vulnerable dependencies alerting and make sure to upgrade it everywhere it occurs.
2: Also have some guidelines on that and enforce it either automatically or manually in PRs.
1. Have some concept of visibility restriction e.g. Go language has internal package.
2. Ensure that every single package has a command to build the code.
3. Ensure that CI builds all the packages that changed our impacted by the change in a given pull request.
These three steps are mostly sufficient in having a monorepo. What you get in return is high code consistency and code visibility for the whole team.
2) Private/public/internal modifiers
3) Independent builds/project in a monorepo
A good article to reference when this topic gets raised: http://yosefk.com/blog/dont-ask-if-a-monorepo-is-good-for-yo...
In the limit, there are only two options:
1. All code lives one repo
2. Every function/class/entity lives in its own repo
with a third state in between 3. You accept code duplication
This compromise state where some code duplication is (maybe implicitly) acceptable is what most people have in mind with a poly-repo.The problem though is that (3) is not a stable equilibrium. Most engineers have such a kneejerk reaction against code duplication that (3) is practically untenable. Even if your engineers are more reasonable, (3) style compromise means they constantly have to decide "should this code from package A be duplicated in package B, or split off into a new smaller package C, which A and B depend on". People will never agree on the right answer, which generates discussion and wastes engineering time. In my experience, the trend is almost never to combine repos, but always to generate more and more repos.
The limiting case of a mono repo (which is basically it's natural state) is far more palatable than the limiting case of poly-repo.
I really don't see how that would work for most companies in practice. Most of the off the shelf tooling used by companies with hundreds or thousands of developers assumes working with polyrepos. It's good we're seeing simpler alternative to Bazel but that's just one piece of the puzzle.
all of the stuff that you can’t do easily yet (vfs for repo, remote builds) just isn’t relevant enough at this scale.
I guess with a 1K engineering company you can afford a substantial build team.
Honestly their systems are almost identical. Amazon just creates a monotonically increasing watermark outside the “repo”. Google uses “the repo” to create the monotonically increasing watermark.
Otherwise, Google calls it “merge into g3” Amazon calls it “merge into live”.
Amazon has the extra vocabulary of VersionSets/Packages/Build files. Google has all the same concepts, but just calls them Dependencies/Folders/Build files.
Amazon’s workflows are “git-like”, Google is migrating to “git-like” workflows (but has a lot of unnecessary vocabulary around getting there - Piper/Fig/Workspace/etc).
I really can’t tell if the specific difference between “mono-repo” or “multi-repo” makes much practical difference to the devs working on either system.
“Merging to live” builds and tests all packages that depend on the update.
So for example, building the new JDK to live will build and test all Java packages in previous live, all of them need to pass their package’s tests, only then will the JDK update be “committed into live”.
The only difference is that Google runs all the presubmits / “dry run to live checks” in the CL workflow. Amazon runs them post CL in the “merge VersionSet” workflow.
Every week our pipeline would get stuck and some poor college grad would spend a few days poking around at Brazil trying to get it to build. Usually took 3 commits to find a working pattern. The easy path was always to pins all indirect dependencies you relied on- but that was brittle and it’d inevitably break until another engineer wiped the whole list of pins out and discovered it built. Then the cycle repeats. I worked on very old services that had years of history. I’ve often discovered that packages had listed dependencies that went unused, but no one spent time pruning them, even when they were the broken dependency.
At Google, I have no memory of ever tinkering with dependency issues outside of library visibility changes.
Amazon pipelines and versionsets and all that are impressive engineering feats, but I think a version-set was a solution to a problem of their own creation.
It’s then “merged into g3” from that workspace.
Not sure how deployments and CD work at google but I think the picture is different at google for unit tests, integ tests etc. Amazon teams have more control over their own codebase and development practices whereas, based on what I know, google has standardized many parts of their development process.
> Access to the whole codebase encourages extensive code sharing and reuse [...]
Doesn't this strategy result in a great risk of massive code leaks from rogue employees? Even if read access are logged and the culprit found, it's too late once it's been published.
But I found this discussion on HN.
I think building something that scales for one big repo is just a completely different problem than making it scale for a lot of small repos.
And after the layoffs, it's pretty clear that no matter how hard you work, you can get fired so what's the point in dedicating your career to something like this?
There you go, PDF free version.
How would you handle this situation as an IC? As a manager of one of the teams? As a skip-level manager of both teams?
As a budding IC on the team that wants the upgrade, you may want to go fix up the other team’s code for them so you can bring them along with the upgrade. Realistically, the further you get from Google’s level of engineering discipline and skill the more likely you are to encounter the following in the needs-1.22 codebase:
- horrible code that is hard to understand and therefore hard to refactor
- code with no tests, making it risky to refactor
- the team that wrote it have all left or been fired and no one is available to help understand it
- they are a remote team with no social relationship to you who interact entirely online, in writing, in the style of an aggressive subreddit mod
- deeply entrenched factions mean that even if you offer them a patch they will default refuse it because who are you to work on their codebase and they don’t need the upgraded numpy so why should they waste resources on reviewing something they don’t want
- misguided adherence to status enhancing terms like “audit” and “compliance” mean jobsworth ICs refuse to even look at your patch because someone somewhere once heard a friend of a friend whose company failed SOC2 because engineer from floor X made a change to code owned by floor Y and it went against policy
All of these social problems are real ones I have encountered and if you have solved these then you’re probably already happily in a monorepo already. If instead you work in an org full of teams pointing guns at each other in a fight to the death to stop any kind of cross org collaboration from sullying the purity of the tribal system then know this: it gets better, and if you build the right social connections then the technical efficiency of having your monobusiness executing its monomission inside a monorepo is within reach!
*bug
Wait, that's an average of nearly 30 new files per commit. Not 30 files changed per commit, but whatever changes are happening to existing files, plus 30 brand new files. For every single commit.
Although...
> The total number of files also includes source files copied into release branches, files that are deleted at the latest revision, [...]
I'm not quite sure what this is saying.
Is it saying that if `main` contains 1,000 files, and then someone creates a branch called `release`, then the repo now contains 2,000 files? And if someone then deletes 500 files from `main` in the next commit, the repo still contains 2,000 files, not 1,500?
If that's the case, why not just call every different version of every file in the repo a different file? If I have a new repo and in the first commit I create a single 100-line file called `foo.c`, and then I change one line of `foo.c` for the second commit, do I now have a repo with two files?
I mean, if you look at the plumbing for e.g. `git`, yes, the repo is storing two file objects for the repo history. But I don't think I've ever seen someone discuss the Linux git repo and talk about the total number of file objects in the repo object store. And when the linked paper itself mentions Linux, it says "The Linux kernel is a prominent example of a large open source software repository containing approximately 15 million lines of code in 40,000 files" - and in that case it's definitely not talking about the total number of file objects in the store.
I don't think it's entirely clear what the paper even means when it talk about "a file" in a source code repository, or if it even means the same thing consistently. I'm not sure it's using the most obvious interpretation, but I can't understand why it would pick a non-obvious interpretation. Especially if it's not going to explain what it means, let alone explain why it chose one meaning over another.
> The total number of files also includes source files copied into release branches
I guess you haven't used Perforce or similar. a branch is a sparse copy of just the changed files/directories. they are not used very much.
> files that are deleted at the latest revision
so it means "one billion files have existed in the history repo, some are currently deleted".
> I don't think it's entirely clear what the paper even means when it talk about "a file" in a source code repository,
seems pretty clear - a source code repo has lots of files. at the most recent revision, some exist, some were deleted in some past revision. more will be added (and deleted) in later revisions.
it's very much not the same model as git.
hope that clears things up.
It certainly feels that way :-)
> > The total number of files also includes source files copied into release branches
> I guess you haven't used Perforce or similar. a branch is a sparse copy of just the changed files/directories.
Still not sure I see the distinction. Surely "sparse" or "not sparse" is an implementation detail. If I create a new branch in git, the files that are unchanged from its parent branch share the same storage, but the files that have changed use their own storage.
> so it means "one billion files have existed in the history repo, some are currently deleted".
I guess I'm struggling to understand what the point of this metric is? I get why "Total number of commits", "Total storage size of repo in GB/TB/PB", "Number of files in current head/main/trunk", or even "total number of distinct file revisions in repo history", could be useful metrics.
But why "number of files (including ones that have been deleted)"? What can we do with this number?
> hope that clears things up.
It's helping. Thanks.
>How does it work?
Different projects are in different folders instead of different repos.
>I assume they're built separately and pushing code to one doesn't affect the other.
Yes, building or testing something only builds its dependencies.
However products are in the same repo such a gmail, youtube, search (frontend, mobile, server, infra, etc), photos, maps, play, translate and literally thousands of other internal and external products and projects.
The software is built daily, and everyone must be on the same version of every library.
Under the hood there are a bunch of repos, and there are exceptions, but largely operates as a monorepo.
This is sometimes a problem for open source dependencies, though, as there isn't always anyone whose job it is to keep them up to date. Some amount of NIH syndrome is because reinventing the wheel can be less work than integrating an existing wheel that was designed for a different vehicle with different specs.
Why Google Stores Billions of Lines of Code in a Single Repository (2016) - https://news.ycombinator.com/item?id=22019827 - Jan 2020 (121 comments)
Why Google Stores Billions of Lines of Code in a Single Repository (2016) - https://news.ycombinator.com/item?id=17605371 - July 2018 (281 comments)
Why Google stores billions of lines of code in a single repository (2016) - https://news.ycombinator.com/item?id=15889148 - Dec 2017 (298 comments)
Why Google Stores Billions of Lines of Code in a Single Repository - https://news.ycombinator.com/item?id=11991479 - June 2016 (218 comments)
Of particular note is that they published this many years after it had been shipped to their internal customers. This was not some position paper about "why we focus on ai" after not shipping any of their "breakthroughs".
Still, I have recently hit a major issue with the fact that GIT (and other common version control sw) don't have per-directory ACL.
Has anyone dealt with this issue? Which VCS / configuration have you adopted?