Why Google Stores Billions of Lines of Code in a Single Repository (2016) (opens in new tab)

(cacm.acm.org)

478 pointsbwag7y ago281 comments

281 comments

I feel terrible for anyone who sees this and thinks, “ah! I should move to a monorepo!” I’ve seen it several times, and the thing they all seem to overlook is that Google has THOUSANDS of hours of effort put into the tooling for their monorepo. Slapping lots of projects into a single git repo without investing in tooling will not be a pleasant experience.

wirrbel7y ago

Same line of thinking, just different conclusions.

I feel terrible for anyone trying to run a company with open-source style independent repos. On a popular github project, you have MANY potential contributors that will tell you if a PR, or a release candidate break API compatibility, etc. There are thousands of hours in open source dedicated to fixing integration issues due to the (unavoidable) poly-repo situation.

Monorepos in companies are relatively simple. You need to dedicate some effort in your CI and CD infrastructure, but you'll win magnitudes by avoiding integration issues. Enough tooling is out there already to make it easy on you.

Monorepos' biggest problem in an org is the funding, as integration topics are often deprioritized by management, and "we spend 10k per year on monorepo engineering" for some reason is a tough sell for orgs, who seem to prefer to "spend 5k for each of the 5 teams so that they maintain their own CD ways and struggle integrating which incurrs another 20k that just is not explicitly labeled as such".

Developer team dynamics also play a role. I have observed the pattern now multiple times (N=3):

* Developers have a monolithic repo, that has accumulated a few odd corners over time. * The feeling builds up that this monolithic repo needs to be modularized. * It is split up into libraries (or microservices), this is kind of painful, but feels liberating at first (now finally John does not break my builds anymore) * Folks realize: John doesn't break my builds anymore, but now I need to wait for integration on the test system to learn if he broke my code, and sometimes I only learn it in production. * people start posting blog posts on monorepos

That pattern takes 2-3 years to play out, but I have seen it on every job I worked.

zuppy7y ago

We have shared components between some of our projects, I'm not sure how monorepo will fit in here. For integration there are a lot of available build tools and repo management apps. For us it solves the problem of having no dependency between versions of the same library used between multiple products.

paulddraper7y ago

> It is split up into libraries (or microservices)

It's a frequent problem to conflate organization/modularization with lifecycle/version management.

You can have a well-organized codebase just as easily in a monorepo.

That's a separate question from management the lifecycle of the code. (What is release and when? What tests are run? What process approves a change?)

marmaduke7y ago

working as dev with academic teams, I usually use many repos for "damage control" as git-ignorant scientists will dump irrelevant files into a repo.

with that in mind, is monorepo is a universally good approach or is more dependent on good behavior of team members than polyrepo?

jakoblorz7y ago

I don't think that there is the one size fits all solution especially if you can't expect basic knowledge about git

wirrbel7y ago

A monorepo requires a good Continuous Integration infrastructure if it is supposed to work. Unless those small repos are will be unit tested, you will not benefit from a monorepo.

Suppose for your projects you have a utility library `lib_a`, in a polyrepo situation, your projects will use it in probably different versions, which means you have coordination effort necessary to get everyone on the latest release. The monorepo would enable the developers of `lib_a` to get feedback from the downstream test suites directly on whether the changes they perform are breaking user code, so they can up front introduce their changes less intrusive. They can however also roll out security-relevant changes much more easily. The monorepo will make the projects more homogeneous, which facilitates integration and operations (there are exceptions of course).

paulddraper7y ago

Is there frequent code reuse? If so, monorepos are really nice. If not, separate repos make more sense.

diminoten7y ago

I'm honestly getting a little tired of the repetition this cycle causes. Monorepos are Wrong, "too many" repos are also Wrong, and everyone needs to realize that strawmaning the opposing side is really what wastes our time, not broken builds or waiting on dependencies to build.

Monorepo people ignore the learned lessons of those who came before us, and are trying to drag their teams back into a simpler time that, while nice, does not exist anymore. If you use any dependencies at all, you don't live in a monorepo world, and lying to yourself and your coworkers will only leave you confused and angry that your expectations are constantly not being met.

The solution isn't to split every single component into its own repo, but pretending like that's what anyone rational is proposing is not working with the best form of the argument. It's not always completely clear how to split up a growing codebase, but to claim that it's not usually worth splitting up is Wrong.

shados7y ago

Both monorepo or "micro repo" end up falling apart at scale without some devops work involved. Either will work if you only have a few dozen projects. Neither will work once you hit 10s of millions of lines of code.

But people seem to forget that it wasn't that long ago that git didn't exist, making multiple repos was a pain in the butt. Managing multiple repos locally was hell. Monorepos were the norm.

Then as the state of version control ramped up, and making repos became easy, and having so much code in one repo had performance issues (overnight CVS/SourceSafe/SVN pull on your first day at work anyone? Branches that take hours to create?), people started making repos per project. The micro-service fad made that a no-brainer.

Now, for companies like Facebook and Google, or really any company that wrote code before the modern days and has a non-trivial amount of it, switching was not exactly a simple matter. So they just poured their energy into making the monorepo work. They're not the only ones to do it either (though not everyone has to do it at Google, Facebook or Microsoft scale, obviously, so its a bit easier for most). And so it works. And then people forget how to make distributed repos work and claim things like "omg I have to make 1 PR per repo when making breaking changes!", as if it was a big deal or it wasn't a solved problem.

fcarraldo7y ago

People also seem to forget that "Monorepo" or (many) "Microrepos" is not a binary choice.

You can have both tiny repositories which do a single thing and large repositories that consist of many projects. It's totally cool to have both, assuming your team can be trusted to make the appropriate choices as they create new projects.

> And then people forget how to make distributed repos work and claim things like "omg I have to make 1 PR per repo when making breaking changes!", as if it was a big deal or it wasn't a solved problem.

Is this a solved problem? I typically do make one PR per repo to resolve breaking changes, though it's certainly not a big deal. Still, if there's an easier way, I'd love to hear about it!

shados7y ago

> Is this a solved problem

I don't mean that it's magical, just that it's not particularly sorcery. Instead of making a breaking change, add new method, deprecate old method. Update projects, then get rid of old deprecated method. Because they're distinct you can do this one by one so some project can reap the benefits without having to wait until all the problems are solved.

Some people in this thread act like its freagin impossible. Avoiding breaking changes in APIs or proper deprecation strategies is an art everyone developing software should know: sooner or later they'll have to contribute to an open source project or have to make a more complicated breaking change or SOMETHING and will have to deal with it. Even if they use a monorepo. And when it happens you don't want it to be the first time anyone deals with it.

2 more replies

ec1096857y ago

Microsoft went from multiple smaller repos for windows to one large one: https://blogs.msdn.microsoft.com/bharry/2017/05/24/the-large...

hobls7y ago

Absolutely! At some point you must invest in your tools. (Early, in my opinion.) I think the clarification I’d offer is that in the age of GitHub the “standard” model is multiple repos, so you’re actually giving up some tooling if you just shove everything in a single repo.

(I’m also not sure I’d generally categorize tools work as as “dev ops,” though I can certainly see how they end up intertwined.)

lclarkmichalek7y ago

I've seen hardly any tools to manage dependencies across multiple repos. Modifying multiple repos at the same time isn't an issue I see many resources devoted to, and managing those cross repo versions is almost never done well. In comparison, both buck and bazel offer pretty mature monorepo management tooling. On the VCS front, you can take native git/HG a long way.

AmericanChopper7y ago

>Both monorepo or "micro repo" end up falling apart at scale without some devops work involved

Wouldn’t any project fall apart without devops work?

shados7y ago

yeah, that's my point.

adamrt7y ago

We moved to a monorepo about 2 years ago and it has been nothing but success for us.

We have quite a few projects but only 4 major applications. Maybe it is that a few of our projects intertwine a bit so making spanning changes in separate repositories was a pain. Doing separate PRs, etc. Now changes are more atomic. Our entire infrastructure can be brought up in development with a single docker-compose file and all development apps are communicating with each other. I don't think we've had any issues that I can recall.

We are a reasonably small team though, so maybe that is part of it.

Tloewald7y ago

My previous had a monorepo for the website and backend (but not the mobile apps) which was insane to work with (as a coder I had a dedicated 128 core box to work on, some engineers had more than one, less intense engineers shared one) and a substantial amount of my time was spent just finding code. I guess most engineers just end up working in some nook and so that searching code constantly becomes less of an issue (it never did for me) but the code / debug cycle was dreadful.

I should add that a huge amount was invested in tooling. We had an in-house IDE with debug tools that could step through serverside code. We had a highly optimized code search tool. We had modified a major version control system so it could handle our codebase. (Indeed we picked our version control system because we needed to fork it and the other major version control system was less amenable to our PRs.)

My current job we have a micro service architecture and lots of small, focused repos. Each repo is self-documented. Anyone can checkout and build anything. We don’t need obscene dev servers. We have not hugely invested in tools or workflow.

Client apps are unavoidably larger repos than the services apps.

Based on my personal experience, I think monorepos are nuts.

ebikelaw7y ago

You seem to have conflated a bunch of different things and confused them with a monolithic repository. There's no reason why a monorepo requires you to have a gigantic development box ... you identify and compile only the transitive dependencies of your target, not every line of code in the repo.

mcny7y ago

I recently saw a question on quora asking whether the free food at Google boosted productivity. The reply that seemed strange at thwt time was being able to focus on their job and not having to do a bunch of stuff is what boosted productivity at Google.

I get some perspective from the comments above. There is seemingly an army of engineers at Google that keeps the monorepo functioning. I was at a meeting about bazel and angular. I thought I'd ask how they do things at Google. To my surprise, the presenter said he is not at liberty to discuss how things work at Google. I guess it wasn't so surprising in hindsight. I mean what would I do with that information, right? It would be way too overkill for my tiny crud application.

Too7y ago

How is this

   company
     /ProjectA
       /.git
     /ProjectB
       /.git

easier to browse than this?

   company
     /.git
     /ProjectA
     /ProjectB

You still need to find the project A repo if you don't use monorepos. And even if you do use monorepos everything doesn't have to be one monoloithic build hogging down your IDE, you can still have microservices with the code for each hosted in the same repo. You seem to conflate monorepo with lots of other things.

1 more reply

artursapek7y ago

We had this same experience with a large golang project, consisting of about 8 individual services. Switching to a monorepo made it incredibly easy to make changes to common code and inter-service communication. Huge breath of fresh air.

hobls7y ago

A single team is really helpful. Where I’ve seen it get particularly unhelpful is with multiple teams. I’m also not opposed to the concept, I just think it requires work to do correctly.

bedros7y ago

how do you create branches in mono repo?

for example I want to use branch rev5 from project A and rev3 from project B

how I do that in a mono repo, I could not do it in HG, but sure about GIT

aidos7y ago

Depends on the system. When I used to manage SVN we would branch independent projects and then releases would be a snapshot of each into the server section of the repo. Those were then pulled down to their respective machines.

In SVN a branch is simply a convention. You copy (almost zero cost) things around into your branches directory

tom_7y ago

Branch rev5 of project A's folder into a folder in your project's folder, branch rev3 of project B's folder into a folder in your project's folder. Get your project to refer to its own copy, rather than the shared copies you branch from. (This isn't quite like git submodules, but that's probably the closest thing git has.)

You do the usual 3-way merge thing to push your changes upstream, or pull upstream changes into your copy. As with git, the VCS tracks which revision of upstream your copy is up to date with, which is how it determines the base for the 3-way merge.

Tloewald7y ago

Can’t you create a branch and merge the two branches you are interested in into that?

1 more reply

dgsb7y ago

Just curious how small is small ? How many kloc ? How many people ?

mr_tristan7y ago

I sense that Google invests much more in it's infrastructure then most companies make in revenue.

I've worked with monorepos, and I'd be loathe to recommend it as well; the combination of culture shift and tooling it takes to keep a monorepo system running makes most CD processes you see today look like child's play.

There is a lot of very good free software that supports most of the open source approach to CD these days; but very, very little freely available monorepo tooling. Just check out https://github.com/korfuri/awesome-monorepo - it's a quick read. I haven't found many other notably superior compilations. Compared with available OSS workflows and tooling, it's rather sparse, filled with bespoke approaches everywhere.

ryancox7y ago

Agreed about the lack of monorepo tooling. There's just not that much out there. A couple of other links I didn't see in the awesome-monorepo:

- https://github.com/facebookexperimental/mononoke - I hear this is a real thing and not a science fair project

- https://github.com/bors-ng/bors-ng - Needed in a monorepo to handle high arrival rate of commits / merges

golangnews7y ago

What problems specifically did you see? Was this because the repo was too large?

I understand at google scale you'd need lots of tooling but why at a smaller scslr of merging a dozen small repos?

mr_tristan7y ago

The biggest problems are always cultural. Most monorepo workflows really reinforce constant integration, and once you have separate teams with separate managers, I've always witnessed constant conflict that ended up trying to establish spheres of control. It's bizarre - but it's something I've seen at pretty much every place I've worked at.

With all that integration, your single CI toolchain is front and center since everyone's success or failure is tied to it. While projects like bazel exist, how many developers do you know work with bazel every day? I no nobody who does. And most want documented IDE support and ease of use, not some optimal CI workflow. I've found gradle to be OK, but even that kind of pushes everyone toward using Jetbrains tooling. In the end, almost real monorepos have significant custom CI tooling that wires together different toolchains, and, they may have to maintain custom tooling for use in developer machines. And that custom tooling can get expensive to maintain as the project scales up.

1 more reply

013a7y ago

I call this "Google Imposter Syndrome". Because Google (insert Facebook, Apple, Amazon, etc) has success with Monorepos (insert gRPC, Go, Kubernetes, React/Native, etc), it must be a great idea, we should do it. You see this everywhere. Also known as an Appeal to Authority.

My personal opinion: very few companies will hit a point where sheer volume of code or code changes makes a monorepo unwieldy. Code volume is a Google-problem. But every company will have problems with Github/Gitlab/whatever tooling with multiple repos; coordinating merges/deploys across multiple projects, managing issues, context switches between them, etc. And every company will also have problems with CI/CD in a monorepo.

Point being... there are problems with both, and there are benefits to both. I don't think one is right or wrong. I personally feel that solving the problems inherent to monorepos, at average scale, is easier than solving the problems inherent to distributed repos. The monorepo problems are generally internal technical, whereas the distributed repo problems are generally people-related and tooling outside of your control.

kqr7y ago

Someone at some point said "Google may not be successful for the interview practises they use; they're big enough that they could very well be successful despite the interview practises they use."

It stuck with me, and is applicable to so many things. Including, maybe, this?

kungtotte7y ago

Another question is just the sheer scale of the FAANG companies, making things work at that scale is likely to be counterintuitive sometimes.

I just looked it up, Facebook has 2.2 billion users monthly. That's almost a third of the entire planet.

Shit that makes sense for them won't make sense for 99% of everyone else.

rubenbe7y ago

I've seen multiple companies struggling with maintaining interdependencies between multiple repos. It often results in an expensive custom solution. As a general guideline I'd say "when in doubt, put the code in a single repo"

mrweasel7y ago

Most people/companies aren't Google, but as you say they assume that if one or more of the tech gigants (or other very public tech companies) are doing something, then it must be good.

Often the focus is extremely weird. When people noticed that WhatsApp only employed something like 45 engineers, then most assumed that it was because they used Erlang and FreeBSD. The thought that maybe their success was do to hiring the very best engineers and paying accordingly is less attractive.

Monorepos is just a another item to the heap of things that may be a good idea, but it depends.

3minus17y ago

also known as "cargo culting"

pcwalton7y ago

There's also the fact that monorepos have issues when you don't have one organization responsible for all the code. The Linux kernel and NetHack don't live in the same repository for good reason.

dekhn7y ago

I dunno, the BSD distribution included a wide gamut of games along with the kernel source in the same tree. In fact, NetHack is derived from Hack which itself is derived from Rogue, which was distributed within BSD. And BSD represented a cross-organization responsibility (see the history of AT&T and BSD).

trasz7y ago

Wasn't it because it was the same group of people who worked on both? And when it ceased to make sense, the games were split off - the only remaining ones in FreeBSD are things like banner(6) or pom(6).

akvadrako7y ago

It does seem like the Linux model scales better than the BSD model.

pcwalton7y ago

Fine, replace NetHack with Quake 3. :)

mnm17y ago

Slapping a whole bunch of projects into multiple repos with dependencies isn't a pleasant experience either. What is the solution then? I certainly don't want to host my own npm/composer/maven/clojars repos or even use those dependency managers to manage my own code which constantly changes and relies on multiple libraries both on the backend and frontend. I've tried this and, at least with a small team of two, it's not a pleasant experience at all. So how can I solve this problem? Cause the monorepo is very enticing after dealing with multiple repos and multiple dependencies pulled through dependency managers that clearly do not do well with dependencies that are constantly in flux.

shakna7y ago

Submodules? [0]

Easy to use, cutting edge updates.

[0] https://git-scm.com/book/en/v2/Git-Tools-Submodules

[1] https://www.mercurial-scm.org/wiki/Subrepository

glandium7y ago

Submodules are cutting edge and have cutting edges. The user experience on some corner cases can be painful. Example: if you happen to have unrelated conflicts when you rebase some patch across a submodule update, you're most likely going to end up committing a reversal of the submodule update.

EnderMB7y ago

I've seen this a few times in the .NET world, mainly as a carry-over from Subversion when we had moved to Mercurial and git.

Some mad genius in a company will write a fuck-ton of helper classes and utilities that take the heavy lifting out of everything remotely hard, to the point where you almost never need to touch a third-party API for a CMS, email send service, or cloud-hosting provider. Instead of supplying these as private NuGet packages to be installed into an application, they sit in solutions in their entirety, in case they are needed. That application then goes to a new developer team, and they have zero idea why there are millions of lines of code and dozens of projects for a basic website that doesn't really seem to do anything.

It's a nice idea, but it has resulted in some very tightly coupled applications. I remember one time where a new developer changed some code in one of the utilities that handled multi-language support, and for some reason our logs reported that the emails were broke.

rco87867y ago

Are you suggesting that there’s a solution to managing large amounts of code that doesn’t involve large amounts of tooling?

hobls7y ago

There’s an important distinction between lots of code and lots of projects. I agree; if you have a ton of code you’d better invest in tooling. But if you just have several normally sized projects, a monorepo can make your life much more difficult than simply using several repos.

rco87867y ago

Sure. You could also replace the term “monorepo” with “separate repos” and your statement would be just as valid. Either way you go has pros and cons.

1 more reply

hobs7y ago

heh, thousands, its probably at least an OOM greater, if not two.

hobls7y ago

Hah, you know I started with millions and then did some fuzzy math and started debating team sizes (since I know google often doesn’t have giant teams), ended up somewhere in the hundreds of thousands, and rounded to thousands. But I made it all caps so you know it’s the SERIOUS kind of thousands. :P

waterhouse7y ago

My brain sees OOM and thinks "out of memory", which might be applicable too.

nine_k7y ago

Most folks who consider a monorepo don't have billions of lines of code, and often not even millions.

Linux kernel is a monorepo.

threeseed7y ago

Linux kernel is one functional piece of work though.

Imagine if we combined KDE, Gnome, Linux Kernel, ZFS etc all in the one monorepo.

dmoy7y ago

And gnucash, libreoffice, a couple copies of android, three other things that forked the linux kernel, and then all of apache to boot.

1 more reply

nine_k7y ago

Where there are obvious and pronounced functional boundaries, often backed by administrative boundaries, separate repos makes total sense.

Otherwise, it's an optimization; see "premature optimization" for cautions.

acomjean7y ago

But what are the alternatives to the monorepo in git?

All the ways of splitting code up and deploying multiple git repos for one project seem terrible.

tiglionabbit7y ago

Fun fact. I asked Facebook why they built their monorepo on Mercurial instead of Git. They said there were scaling issues in Git that made it unusable for large repos and the Git maintainers would not work with them to fix these issues. However, they were able to work with Mercurial to make it capable of holding their entire company in one repo.

favorited7y ago

Someone from FB did a really cringe-inducing presentation a few years ago about how "X can't handle our scale" (I think the predicate was iOS, but they went into IDEs and SCM systems). They had to pull the video and slides because it was so bad.

1 more reply

yehosef7y ago

Here's the article (or one of them) https://code.fb.com/core-data/scaling-mercurial-at-facebook/

reificator7y ago

> multiple git repos for one project

If it's one project, it's not a monorepo. It's a repo.

acomjean7y ago

We wanted to have a "common" subsystem that was common across projects. Being able to add and work on the common area and new projects at the same time was important. Pushing the common area back and being able to deploy to the older projects and test was important.

This seems difficult in git.

There are "submodules" and "subtrees" but none seemed particularly great and as far as I could tell each came with a bunch of caveats.

I'll admit my Git skills aren't great, but I've used a variety of source control and tried to suss out the best way to deal with a small team.

We ended up using "git subrepo" which is an add on thing I don't love, but it works.

part of the motivation is "common" and "project 2" are to be open sourced, but "project 1" which also uses "common" isn't.

1 more reply

forrestthewoods7y ago

> what are the alternatives to the monorepo in git?

A monorepo in Perforce!

threeseed7y ago

I don't understand why people are against this. You can have per repo branches/tags, the history is clean and relevant, it's easy to triage breakages, easy for different apps to have different versions of code etc. Plus for CI/CD it's trivial to just have one Jenkins jobs per repo as well and simple Git commit triggers.

The entire programming world revolves around libraries and yet when it comes to our own code we are afraid of them ? Strange.

flukus7y ago

> All the ways of splitting code up and deploying multiple git repos for one project seem terrible.

Of course they are, git isn't the tool for this. You don't want multiple repos for a single project, you want one per project (this is not a monorepo). If there are things like code common to multiple projects then they are their own project with their own repo and release schedule, releases go into some sort of package manager (even if it's just the file system) and the projects depending on that common code update as they go.

ebikelaw7y ago

See, this is where your argument broke down for me. Once you’ve decided there is some library of common code, and assuming you factor out that code into another repo, you’ve just lost your ability to easily make breaking changes to the common code, which is something trivially easy to do in a monorepo. Why would you want that? It seems to me that if you have multiple projects sharing a base of common code then a monorepo is clearly superior.

2 more replies

keerthiko7y ago

What I can advise against is repo partitioning prematurely. I have been on multiple teams that have thought "Oh this will be a common library for all our projects" or "this is a sample project" or "this is the android version and this is the iOS version" and split projects up into different repos, only to wind up with crazy dependencies between repos which have fallen out of sync or require another repo to be on a specific branch/hash to work correctly, causing all kinds of chaos. Split your repos by dependencies, and once your system architecture is kind of fleshed out. Just use branches on the same repo until then.

maxpert7y ago

Spot on! I've seen org wide mono repos at Microsoft and they had their custom tooling and build systems built on top of SourceDepot.

robaato7y ago

Which is just rebadged Perforce :)

georgewfraser7y ago

I kinda disagree, we’re a dev team of 30, 3.5 years in, 150k lines of code and we’ve always had a monorepo. We had to switch from maven to bazel after about 2 years because test times got out of control; bazel has been about 50% more annoying than maven but the incremental builds work perfectly.

timkrueger7y ago

Interesting. Do you have wrote something about that migration?

georgewfraser7y ago

No, though that would be a good blog post. We tried to make multi-module maven work for a while, eventually gave up and wrote some scripts that would convert maven to bazel, using many assumptions that applied only to our particular case. We did the cutover in one day but kept maven around for a couple weeks in case we decided to bail on bazel. It worked out; we even found CircleCI works great. I would say the weak link in the bazel ecosystem is the IntelliJ plugin, which is very functional but also very slow.

ma2rten7y ago

Google used Perforce for a very long time before they built their own version control system.

w_t_payne7y ago

Tooling is required for coordinating configuration management on multiple repositories too.

Also, why isn't such tooling available as open source? I'm trying to do my bit, but we could do with more effort being put into this, somehow.

foota7y ago

Really probably closer to millions of hours.

baybal27y ago

Dumping all code in a single repo, even for a 30 man development shop was really tough. Doing so for a company of few thousands must be truly crazy.

I advice Google to replace the person in their internal IT who came up with that idea.

flukus7y ago

> and the thing they all seem to overlook is that Google has THOUSANDS of hours of effort put into the tooling for their monorepo

That's a huge understatement. They haven't just slapped a few scripts on top of git/svn, they've created their own proprietary scm to manage all of this. They've thrown more at this beast than most companies will throw at their actual product.

I'm also not convinced they haven't reinvented individual repositories inside this monorepo, it sounds like you can create "branches" of just your code and share them with other people with committing to the trunk, this is essentially an individual repository that will be auto deployed when you merge to master.

joshuamorton7y ago

Your last paragraph doesn't sound like anything at Google. Most engineers will never use branches at all, and even fewer will use branches that merge into trunk (instead of away from it).

flukus7y ago

There is a set of code changes locally and those changes are bundled off to the test server to run the full test suite? That's a branch.

Now let's say I break a project sharing this code and because I'm not an expert in all 2 billion LoC and 3000 projects google is running I need to enlist some help in fixing what I broke. Presumably there is a way for the developers on that downstream project to pull in my change set? That's a shared branch.

Now assuming I can get all of these planets aligned correctly I'm going to need to take this set of changes and put it into the master version aren't I? That's merging my branch into trunk.

3 more replies

jgibson7y ago

Is it just me, or are a lot of people here conflating source control management and dependency management? The two don't have to be combined. For example, if you have Python Project X that depends on Python Project Y, you can either have them A) in different scm repos, with a requirements.txt link to a server that hosts the wheel artifact, B) have them in the same repo and refer to each other from source, or C) have them in the same repository, but still have Project X list its dependency of project Y in a requirements.txt file at a particular version. With the last option, you get the benefit of mono-repo tooling (easier search, versioning, etc) but you can control your own dependencies if you want.

edit: I do have one question though, does googles internal tool handle permissions on a granular basis?

edejong7y ago

The key here is reverse dependency management. “If I change X, what would influence this change?”.

This can be achieved with single repo better than multi-repo due to the completeness of the (dependency) graph.

thomaslee7y ago

Exactly this. Or at least it's a way this can be achieved, assuming solid testing & some tooling in the mix.

For folks unfamiliar with it, the issue is something like:

1. You find a bug in a library A.

2. Libraries B, C and D depend on A.

3. B, C and D in turn are used by various applications.

How do you fix a bug in A? Well, "normal" workflow would be something like: fix the bug in A, submit a PR, wait for a CI build, get the PR signed off, merge, wait for another CI build, cut a release of A. Bump versions in B, C and D, submit PRs, get them signed off, CI builds, cut a release of each. Now find all users of B, C and D, submit PRs, get them signed off, CI builds, cut more releases ...

Now imagine the same problem where dependency chains are a lot more than three levels deep. Then throw in a rat's nest of interdependencies so it's not some nice clean tree but some sprawling graph. Hundreds/thousands of repos owned by dozens/hundreds of teams.

See where this is going? A small change can take hours and hours just to make a fix. Remember this pain applies to every change you might need to make in any shared dependency. Bug fixes become a headache. Large-scale refactors are right out. Every project pays for earlier bad decisions. And all this ignores version incompatibilities because folks don't stay on the latest & greatest versions of things. Productivity grinds to a halt.

It's easy to think "oh, well that's just bad engineering", but there's more to it than that I think. It seems like most companies die young/small/simple & existing dependency management tooling doesn't really lend itself well to fast-paced internal change at scale.

So having run into this problem, folks like Google, Twitter, etc. use monorepos to help address some of this. Folks like Netflix stuck it out with the multi-repo thing, but lean on tooling [0] to automate some of the version bumping silliness. I think most companies that hit this problem just give up on sharing any meaningful amount of code & build silos at the organizational/process level. Each approach has its own pros & cons.

Again, it's easy to underestimate the pain when the company is young & able to move quickly. Once upon a time I was on the other side of this argument, arguing against a monorepo -- but now here I am effectively arguing the opposition's point. :)

[0] https://github.com/nebula-plugins/gradle-dependency-lock-plu...

bluejekyll7y ago

> So having run into this problem, folks like Google, Twitter, etc. use monorepos to help address some of this.

I think you’re retroactively claiming that Google actively anticipated this in their choice at the beginning of using Perforce as an SCM. They may believe that it’s still the best option for them, but as I understand it, to make it work they bought a license to the Perforce source code forked it and practically rewrote it to work.

Here’s a tech talk Linus gave at Google in 2007: https://youtu.be/4XpnKHJAok8

My theory (I wonder if someone can confirm this), is that Google was under pressure at that point with team size and Perforce’s limitations. Git would have been an entirely different direction had they chosen to ditch p4 and instead use git. What would have happened in the Git space earlier if that had happened? Fun to think about... but maybe Go would have had a package manager earlier ;)

2 more replies

rkangel7y ago

There's a subtler, and potentially more important thing that can crop up with your scenario:

Library A realises that its interface could be improved, but it would not be backwards incompatible. In the best case scenario, with semver, there is a cost to this change. Users have to bump versions and rewrite code, maybe the maintainer of Library A has to keep 2 versions of a function to ease the pain for users. It may just be that B, C and D trust A less because the interface keeps changing. All this can mean an unconscious pressure to not change and improve interfaces, and adds pain when they do.

Doing it in a monorepo can mean that the developers of A can just go around and fix all the calls if they want to make the change, allowing for greater freedom to fix issues with interfaces between modules. And that is really important in large complex systems with interdependent pieces.

Too7y ago

This is my biggest gripe in discussions like this as well, dependency management and source control are two completely different things. It should be convenient to use one to find the other but they should not necessarily be 1-1 coupled together with each other.

1. A single repo should be able to produce multiple artifacts. 2. It should be possible to use multiple repos to produce one artifact. 3. It should be possible to have revisions in your source control that don't build. 4. It should be possible to produce artifacts that depend on things not even stored in a repo, think build environment or cryptographic keys etc. An increase in version number could simply be an exchange of the keys.

joshuamorton7y ago

Number three I disagree with. Bisection depends on build (and test) always working on trunk.

justicezyx7y ago

Single repo is one design that coherently addresses source control management and dependency management.

The key is to let the repo be a single comprehensive source of data for building arbitrary artifacts.

MaxBarraclough7y ago

> The key is to let the repo be a single comprehensive source of data for building arbitrary artifacts.

By that do you mean it's one way of doing it, or that it's the only way?

Seems clear to me that it's not the only way. For instance .Net code tends to be Git for the project source + NuGet for external dependencies. It works pretty well.

justicezyx7y ago

It's one way. There isnt any problem that can only be solved in one way.

evfanknitram7y ago

I don't know what this means.

How is "single repo" a "design" and how does this design dictate dependency management?

Yes, if you have a single repo then that would be a single source of data for building your stuff. That seems redundant.

justicezyx7y ago

See Bazel, you have the depes manifested as source controlled the data, then you can build everything as deterministically as possible.

Then you can manage dependency as part of the normal source control process.

adrianN7y ago

A single repo makes it a bit tricky to use some library in version A for project X and version B for project Y.

paulddraper7y ago

Correct.

You can consider that a bad thing or a good thing.

Most language's package composition (C/C++, Java, Python, Ruby) don't permit running multiple versions at runtime. The single-version policy is one way of addressing dependency hell.

bunderbunder7y ago

I think that's actually a good thing. Allowing different projects to use different versions of a 3rd-party package may be convenient for developers in the short term, but it creates bigger problems in the long term.

1 more reply

Nihilartikel7y ago

If I recall, in Google's build system, a dependency in the source tree can be referenced at a commit ID, so you can actually have a dependency on an earlier version of artifacts in source control.

1 more reply

gberger7y ago

Yes, Google's internal tool handles permissions based on directory owners.

They use the same OWNERS-file model as in the Chromium project [1], the only difference being the tooling (Chromim is git, google3 is ... its own Perforce-based thing).

[1] https://chromium.googlesource.com/chromium/src/+/lkcr/docs/c...

maccard7y ago

I can't comment specifically on Google's tool, but I know it's based on perforce. perforce does have granular permissions - https://www.perforce.com/perforce/r15.1/manuals/p4sag/chapte...

paulddraper7y ago

> The two don't have to be combined.

They do have to be combined in some way, at least to be reproducible. Your requirements.txt example is one way of combining version control + dependencies: give code an explciit version and depend on it elsewhere by that version.

Google has chosen to do combine them in a different way, where ever commit of a library implicitly produces a new version, and all downstream projects use that.

> googles internal tool handle permissions on a granular basis?

Not sure what you mean...it's build tool handles package visibilty (https://docs.bazel.build/versions/master/be/common-definitio...). It's version control tool handles edit permissions (https://github.com/bkeepers/OWNERS).

fps_doug7y ago

It is very tempting to believe that a monorepo will solve all your dependency issues. If you have a project that's say pure python consisting of a client app, a server app, and then a dozen libs, that might actually be true, since you force everyone to always have the latest version of everything, and always be running the latest version. Given a somewhat sane code base and smart IDE, refactoring is really easy and and updates everything atomically.

In reality you often have different components, some written in different languages, at a certain size, not everyone has all the build environment set up and might be working with older binaries, and now it's just as easy to have version mismatches, structural incompatibilities, etc. So you need a strong tooling and integration process to go along with your monorepo. The repo alone doesn't solve all your problems.

bananarepdev7y ago

Maybe this is a reflection of modern tools using the version control system to store built artifacts, like npm and "Go get" do. Anyway, depending on the programming language, you can have a monorepo and still bind your modules with artifact dependecy, not necessarily depending on the code itself.

senozhatsky7y ago

Well, it's not so uncommon. For instance, OpenBSD, NetBSD repos are sort of monolithic. And, believe it or not, there are some advantages. For instance, let's take a look at OpenBSD 5.5 [0] release notes:

> OpenBSD is year 2038 ready and will run well

> beyond Tue Jan 19 03:14:07 2038 UTC

OpenBSD 5.5 was released on May 1, 2014. While Linux is still "not quite there yet" y2038-wise. y2038 is a very complex issue, while it may look simple - time_t and clock_t should be 64-bit. This requires changes both on the kernel -- new sys-calls interfaces [stat()], new structures layouts [struct stat], new sizeof()-s, etc. -- and the user space sides. This, basically, means ABI breakage: newer kernels will not be able to run older user space binaries. So how did OpenBSD handle that? The reason why y2038 problem looked so simple to OpenBSD was a "monolithic repository". It's a self-contained system, with the kernel and user space built together out of a single repository. OpenBSD folks changed both user space and kernel space in "one shot".

IOW, a monolithic repository makes some things easier:

a) make a dramatic change to A

b) rebuild the world

c) see what's broken, patch it

d) while there are regressions or build breakages, goto (b)

e) commit everything

[0] http://www.openbsd.org/55.html?hn

[UPDATE: fixed spelling errors... umm, some of them]

-ss

glandium7y ago

The reason why y2038 problem looked so simple to OpenBSD has little to do with "monolithic repository" and everything to do with "happy to break kernel ABI compatibility". You're saying as much yourself.

Monolithic repository might have been a tool that helped enforce it, but that's not what made it happen. It's the decision that ABI could be broken that did.

And that's also why it hasn't happened in Linux yet. Even if there was a monorepo containing all the open source and free software in the world (or at least, say, that you can find in common distros), the fact that there's a contract to never break the ABI makes it simply hard to do.

senozhatsky7y ago

> Monolithic repository might have been a tool that helped enforce it, > but that's not what made it happen. It's the decision that ABI could > be broken that did.

Well, there are probably some subtle details which I'm missing, and may be you are totally right.

The way it looks to me is as follows: They are "happy to break kernel ABI compatibility" because the repository is monolithic - they break ABI, they immediately fix user space apps.

E.g. NetBSD time_t 64-bit commit: https://marc.info/?l=openbsd-cvs&m=137637321205010&w=2

They patched the kernel:

	 sys/kern       : kern_clock.c kern_descrip.c kern_event.c
	                 kern_exit.c kern_resource.c kern_subr.c 
	                 kern_synch.c kern_time.c sys_generic.c 
	                 syscalls.conf syscalls.master vfs_getcwd.c 
	                 vfs_syscalls.c vfs_vops.c

and fixed broken user space at the same time:

...

	 sys/msdosfs    : msdosfs_vnops.c
	 sys/netinet6   : in6.c nd6.c
	 sys/nfs        : nfs_serv.c nfs_subs.c nfs_vnops.c xdr_subs.h
	 sys/ntfs       : ntfs_vnops.c
	 sys/sys        : _time.h _types.h dirent.h event.h resource.h
	                 shm.h siginfo.h stat.h sysctl.h time.h types.h 
	                 vnode.h 
	sys/ufs/ext2fs : ext2fs_lookup.c 
	sys/ufs/ufs    : ufs_vnops.c

...

There is no "transitional" stage, when the kernel is already patched, but no user space apps are ready for those changes yet. It all happens at once.

-ss

flukus7y ago

> There is no "transitional" stage, when the kernel is already patched, but no user space apps are ready for those changes yet. It all happens at once.

What about third party apps? It's not a fully self contained system, there are binaries out there running on openBSD that the openBSD devs have never heard of, and they were broken by the change.

2 more replies

dguest7y ago

This is a very good point.

I work in an organization that just switched to monolithic and it's been going very well, with hundreds of active developers and millions of lines of code. But our developers are students or academics. As many as half don't understand the concept of an ABI. So the monorepo works quite well for us because rebuilding from scratch is something we do multiple times a day.

perlgeek7y ago

... and all the third-party software that was compiled for older versions of OpenBSD is now also broken by default.

The problem is that this approach only works if it is really a self-contained system. But OpenBSD isn't: it's a basis to run software, potentially third-party software. It's can't be a closed Universe and still be useful at the same time.

ChrisCinelli7y ago

Managing dependencies and versions across repos is a pain. Refactoring across repos is quite hard when your code spreads across repos considering the tree of dependencies.

Unfortunately Git checkout all the code, including history, at once and it does not scale to big codebases.

The approach that Facebook chose with Mercurial seems a good compromise ( https://code.fb.com/core-data/scaling-mercurial-at-facebook/ )

jsolson7y ago

As mentioned in the post (which is from 2016), Google has also been experimenting with Mercurial as a frontend (in collaboration with "contributors from other companies that value the monolithic source model"). As an avid user of that experiment at Google, it's seems to be going very well.

ChrisCinelli7y ago

I am not surprised. code.google.com used Mercurial. But I am still curious. Is Mercurial the frontend of Piper or it could live independently? What is open sourced and what is not?

kajecounterhack7y ago

Mercurial is used as a frontend to piper in that experiment. It doesn't live independently. Piper isn't open source.

csdreamer77y ago

Doesn't the Git Virtual File system that Microsoft is contributing to Git take care of this?

https://blogs.msdn.microsoft.com/devops/2017/02/03/announcin...

Edit: don't just down vote. If you have a problem with my comment, tell me why.

justincormack7y ago

This currently only works on Windows, although they are planning OSX and Linux ports.

csdreamer77y ago

That is important. My link was back in 2017. Do you know of any ETA for GVFS getting on Linux?

1 more reply

bluedino7y ago

>> Unfortunately Git checkout all the code, including history, at once and it does not scale to big codebases

A shallow clone can be helpful in cases like this

justinjlynn7y ago

Shallow clones have some significant limitations in terms of the operations that can be performed on the resultant shallow image. It used to be quite severe but many of the initial limitations have been lifted with improved implementations. Still shallow clones are tricky to do deep in the tree, but they are a potential option for huge repos.

antt7y ago

Git works very well when the code is distributed. Which funnily enough is in the name. That we are using git as a centralized repository is a case of "Why do I need a screwdriver when I have a hammer?".

shub7y ago

There's nothing about git that requires you to use it like the kernel does. Centralized version control is just a special case of decentralized, if you're using git. You still get the benefits of your repo being a peer of the master repo, like local branches.

EpicEng7y ago

> Managing dependencies and versions across repos is a pain

whack7y ago

Maybe I'm not cool enough to understand this, but I don't see the draw for monorepos. Imagine if you're a tool owner, and you want to make a change that presents significant improvements for 99.9% of people, but causes significant problems for 0.1% of your users. In a versioned world, you can release your change as a new version, and allow your users to self-select if/when/how they want to migrate to the new version. But in a monorepo, you have to either trample over the 0.1%, or let the 0.1% hold everyone else hostage.

Conversely, imagine if you're using some tools developed by a far off team within the company. Every time the tooling team decides to make a change, it will immediately and irrevocably propagate into your stack, whether you like it or not.

If you were at a startup and had a production critical project, would you hardcode specific versions for all your dependencies, and carefully test everything before moving to newer versions? Or would you just set everything to LATEST and hope that none of your dependencies decide to break you the next day? Working with a monorepo is essentially like the latter.

anyfoo7y ago

I've worked both at Google (only as an intern, though) and at other very very big companies with gargantuan code bases. At that scale, with software that is constantly in flux, pretty much the last thing you want is having to keep compatibility between several versions of a component. It's bad enough if you have to do it for external reasons, but if the only reason is so that "others in the company have a choice" then... no, just no.

You might think this ought to be trivial by having clear API contracts, but that's a) not how things work in practice if all code is effectively owned by the same, overarching entity and, more importantly, b) now you have an enormous effort to transition between incompatible API revisions instead of just being able to do lockstep changes, for no real gain.

Even if you manage to pull that off (again, for what benefit?), it will bite you that 1.324.2564 behaves subtly different from 1.324.5234 even though the intent was just to add a new option and they otherwise ought to have no extensional changes in behavior.

whack7y ago

It took me a while to figure out that you're disagreeing with me, because your last paragraph is a perfect example of why monorepos are so dangerous.

Imagine a tooling team on a different continent that makes some changes this afternoon. Like you said, their intent is just to add a new option, and it ought to have no extensional changes in behavior, but it still ends up behaving subtly different. The next morning, all your services end up broken as a result.

In a versioned world, you can still freeze your dependency at 1.324.5234, and migrate only when you want to, and when you're feeling confident about it.

In a monorepo world, you don't have a choice. You've been forcefully migrated as soon as the tooling team decides to make the change on their end. They had the best of intentions, but that doesn't always translate to a good outcome.

FWIW, I'm currently working at a large famous company that uses a monorepo. Color me not-impressed. I do think that having a single repository for an entire team/project is a good idea. Hundreds of different projects and teams who've never seen one another? Not so much.

anyfoo7y ago

> In a versioned world, you can still freeze your dependency at 1.324.5234, and migrate only when you want to, and when you're feeling confident about it.

The correct course of action is to either reverse/fix the code change to the library you depend on, or if your code is clearly using the library wrong and can be easily fixed, to do that. Not to let the whole ecosystem slowly spiral out of control.

Either way, the point is that it will force the issue to be resolved, quickly, and the code base to move forward.

The tools/libraries you depend on are themselves dependent on other libraries and tools. They may have done changes that are necessary to continue working, which you are not picking up if you stay behind. They will do IPC and RPC and always rely on their infrastructure being current.

>In a monorepo world, you don't have a choice. You've been forcefully migrated

Yes, and that's good, because:

> and migrate only when you want to, and when you're feeling confident about it.

... does not help in moving the code forward.

If your change will break others, you need to coordinate with those others so that the transition happens gracefully, not let them live on what amounts to unsupported (and slowly more incompatible) code.

2 more replies

Eridrus7y ago

> The next morning, all your services end up broken as a result.

I mean, this is the argument for having good integration tests.

At some point someone has to figure out if the new code will break a system; if you don't have good integration tests you're basically left eyeballing the changes, and sure eyeballing changes can work fine in small teams, but at some point you need good tests.

I did some work on a ranking system last year, and by definition there's no way to roll that out incrementally, because, well, it is the central component deciding what thing to do/show, and you have to change the world at once, there is literally no other option. So you need good ways of evaluating these wide reaching changes.

"If you liked it then you shoulda put a test on it" :)

1 more reply

shub7y ago

> Like you said, their intent is just to add a new option, and it ought to have no extensional changes in behavior, but it still ends up behaving subtly different. The next morning, all your services end up broken as a result.

Someone makes a commit to library code and production magically breaks? How does that happen?

1 more reply

perlgeek7y ago

I thought that Google deploys new versions gradually (first to 1% of users, and if that doesn't show errors, to more and more). Which implies that there are at least two version of an application or service running.

How does that work when you don't keep APIs stable, at least at the service boundaries?

crazygringo7y ago

> But in a monorepo, you have to either trample over the 0.1%, or let the 0.1% hold everyone else hostage.

Nope. In a monorepo (like at Google), you're responsible for not breaking anyone else's code, as evidenced by their tests still passing.

So you never trample over the 0.1%. Instead you fix your code, or you fix their code for them -- which was probably due to your own bugs or undefined behavior in the first place. Or else you don't push.

And if you break their code because they didn't have tests? That's their problem, and better for them to learn their lesson sooner that later, because they're breaking engineering standards that they've been told since the day they joined. A monorepo depends, fundamentally, on all code having complete test coverage.

u801e7y ago

> So you never trample over the 0.1%. Instead you fix your code, or you fix their code for them -- which was probably due to your own bugs or undefined behavior in the first place. Or else you don't push.

Given the size of a monrepo, is it possible to run the entire test suite in one's development environment, or do they have another endpoint to push to to run tests on a dedicated server?

summerlight7y ago

Google has a CI infrastructure which runs most of the affected tests for each commit (which they call "CL") on thousand of machines in parallel. Though even for Google, running the entire test suite every time is prohibitively expensive so they have a way to merge and run multiple CLs in a single batch run every 3 hours, which is useful for testing a CL that may affect hundreds of thousands of build/test targets. If you're interested, this paper may give you an idea how Google is doing test.

https://static.googleusercontent.com/media/research.google.c...

1 more reply

YokoZar7y ago

> Given the size of a monrepo, is it possible to run the entire test suite in one's development environment, or do they have another endpoint to push to to run tests on a dedicated server?

Eventually you hit a point where you need systems to run the tests for you. Making this work is part of the investment in infrastructure and tooling you need to do as a big serious company.

int_19h7y ago

> A monorepo depends, fundamentally, on all code having complete test coverage.

Covering every single line of code still doesn't mean that you have complete behavioral coverage, unless your tests somehow run for all possible inputs. In practice, there will still be holes, not because someone was negligent, but because they missed a corner case specific to some state.

perfunctory7y ago

> Working with a monorepo is essentially like the latter.

Not really. In the dependencies analogy the author of the dependency has no way to test the dependee(s). While with monorepo this is exactly what you do, "the tooling team" will "carefully test everything" before "propagate into your stack" (and it doesn't have to be irrevocable).

whack7y ago

In practice, at any medium/large organization, the tooling team doesn't know your system, and its nuances, nearly well enough to "carefully test everything".

Having a solid automated test suite does help. But I personally would like to be in control of when my project updates its dependencies, instead of being forced to always pull everything from LATEST.

joshuamorton7y ago

You are, by having tests.

At Google the contract is essentially infrastructure teams (and generally, your dependencies) will not break your unit tests (or will contact you well in advance to handle changes). But if you don't have a test, they might. They don't have to carefully test everything. You do. And if you don't, breakages are entirely your responsibility, because you didn't have a test for them.

perfunctory7y ago

> the tooling team doesn't know your system

they don't know it because they don't use monorepo. monorepo makes "solid automated tests" easier since basically there is only one version to test. The instinct against pulling everything from LATEST, developed in the traditional world, is perfectly understandable. However in monorepo "your" project is also tooling-team's project. "being forced" becomes "being helped". It's shared responsibility.

perfunctory7y ago

>In a versioned world, you can release your change as a new version, and allow your users to self-select

Repeat this process multiple times and you end up with configuration/settings hell. Been there done that. It's not black and white but "trampling over the 0.1%" could be a sensible business/architectural decision. For example how do you imagine "google maps" users selecting when/how to migrate?

whack7y ago

I was referring only to static dependencies, like Guava for example. Static dependencies don't require ongoing "upkeep", so you really should allow your users to use an older version of your library/code, if that's what they really want to do.

When it comes to live services that you're actually running on a daily basis, like Google maps, forcing users to migrate makes a lot more sense.

perfunctory7y ago

I see. I guess static dependencies are not as prone to ongoing upkeep as services, but they are not totally immune. Think Heartbleed.

1 more reply

googlemike7y ago

What do you define as a static dependency?

andrewfong7y ago

Not saying this is how Google does it, but a monorepo doesn't prevent you from having multiple versions of the same dependency. Ideally, with a monorepo, you could update 99% of your sub-packages to the latest version while still leaving the one alone.

rossjudson7y ago

There are a few exceptions to the "one version" rule; the monorepo and the build system support multiple versions just fine. It's just that we don't want them.

IMAYousaf7y ago

Hello.

What sort of tooling differences would one expect for a monorepo vs. multiple repos?

Is that a factor of something intrinsic about having one big repo, or is that a factor of the scale of the type of organization that Google is?

Thanks.

makecheck7y ago

This is clearly detrimental to external projects such as Go packaging, since their own developers will never be looking at dependency problems in the same way as outside groups.

Monorepo also bugs me because there will always be some external package you need, and invariably it’s almost impossible to integrate due to years of colleagues making internal-only things assume everything imaginable about the structure and behavior of the monorepo. There will be problems not handled, etc. and it leads to a lot of NIH development because it’s almost easier in the end.

Also, it just feels risky from an engineering perspective: if your repository or tools have any upper limits, it seems like you will inevitably find them with a humongous repo. And that will be Break The Company Day because your entire process is essentially set up for monorepo and no one will have any idea how to work without it.

topspin7y ago

> This is clearly detrimental to external projects such as Go packaging

Indeed. Google's monorepo means the largest cohort of Go programmers in the world are mostly indifferent to composing packages in the usual (cpan/maven/composer/npm/nuget/cargo/swift/pip/rubygems/bower/etc) manner. Non-Google Go programmers have been left to schlep around with marginal solutions for years, although in the last few months we begin to see progress here[1]. This was the #1 discouragement I experienced when experimenting with Go.

Google's monorepo may be wonderful from Google's perspective but I don't think it's been a win for Go.

* yes I know some of these are also build systems and provide many other capabilities, some of which are arguably detrimental. Versioned, packaged, signed dependencies and thus repeatable build artifacts is the point.

[1] https://github.com/golang/go/issues/24301

robaato7y ago

What about Android and 800-1,000 git repos?!

Have seen the pain trying to manage that across larger teams (e.g. thousands of devs) - and no the "repo" tool is not sufficient.

nwlieb7y ago

I'm very curious, what pain did you see with the repo.py tool?

tzhenghao7y ago

Having worked at different companies adopting both monorepo and the multiple repos approach, I find monorepo a better normalizer at scale in consolidating all "software" that runs the company.

Just like what many commenters here have mentioned, the monorepo approach is a forcing function on keeping compatibility issues at bay.

What you don't want is to end up in a situation where teams reinvent their own wheels instead of building on top of existing code, and at scale, I think the multiple repo approach tends to breed such codebase smell. [1] I'm sure 8000 repos is living hell for most organizations.

[1] - https://www.youtube.com/watch?v=kb-m2fasdDY

shiift7y ago

I really liked that talk! Lots of relevant information and I can definitely relate, working at a Amazon. Wouldn't say that we are hurt by all of the same problems (we have solutions that work very well for some of them), but we definitely are aware of them.

mlthoughts20187y ago

One of my former managers had worked a long time at Google and was present for the advent of Google’s in-house tooling developed around their monorepo.

His account was that it was basically accidental, at first resulting from short term fire drills, and then creating a snowball effect where the momentum of keeping things in the Perforce monorepo and building tooling around it just happened to be the local optimum, and nobody was interested in slowing down or assessing a better way.

He personally thought working with the monorepo was horrible, and in the company where I worked with him, we had dozens of isolated project repos in Git, and used packaging to deploy dependencies. His view, at least, was that the development experience and reliability of this approach was vastly better than Google’s approach, which practically required hiring amazing candidates just to have a hope of a smooth development experience for everyone else.

I laugh cynically to myself about this any time I ever hear anyone comment as if Google’s monorepo or tooling are models of success. It was an accidental, path-dependent kludge on top of Perforce, and there is really no reason to believe it’s a good idea, certainly not the mere fact that Google uses this approach.

gefh7y ago

Do you wonder whether he is a reliable narrator?

mlthoughts20187y ago

I don’t, but it’s fair to ask. He was unequivocally the best senior manager I’ve worked with. Extremely technically smart but skilled at letting people under him work autonomously, good communicator, cared a lot about pushing best practices past bureaucratic barriers.

His description of Google made it seem like it had the same dysfunction every place has. And the monorepo was a totally mundane, garden variety eyesore kind of in-house framework that you’ll find anywhere.

I think he recognized the usefulness of just working with it and picking battles. He was just dumbfounded that any outsider would see the monorepo project and think it possibly had any relevance for anyone else. It was just a Google-history-specific frankenstein sort of thing that got wrangled with tooling later. The supposed benefits are all just retrofitted on.

haglin7y ago

Google's handling of their source code makes me wanna work there.

I don't like distributed version control systems with hundreds of repositories spread out. It makes management more complicated. I understand this is a minority view, but that is my experience. It was easier to work in a single Perforce repository than hundreds of Git or Mercurial repos.

djur7y ago

Distributed vs. centralized VCS has very little directly to do with many vs. monolithic repos. After all, git was originally developed for a project with a large monolithic repo. Distributed VCS and many small repos got popular around the same time, but that's partly coincidental (microservice architectures getting popular, npm community preferring extremely small libraries) and partly because of GitHub making it very cheap in money/time to have many git repos.

a-dub7y ago

It should be noted that the monolithic model is somewhat encouraged by the client mapping system in Perforce, which was Google's first version control system so it is unclear to me if this was deliberate or just a side effect of the best VCS of the time.

I also still have doubts around the value of a monorepo, in the article they claim it's valuable because you get:

Unified versioning, one source of truth;

Extensive code sharing and reuse;

Simplified dependency management;

Atomic changes;

Large-scale refactoring;

Collaboration across teams;

Flexible team boundaries and code ownership; and

Code visibility and clear tree structure providing implicit team namespacing.

With the exception of the niceness of atomic changes for large scale refactoring, I don't really see how the rest are better supported by throwing everything into one, rather than having a bunch of little repos and a little custom tooling to keep them in sync.

malkia7y ago

Incrementally monolithic CL number is also useful. You can mark quite a lot of things with it - not only binary releases, but other developments too (configuration files, etc.). At the end your binary "version" comprises of main base CL + cherrypicked individual CL's - rather than branch with these fixes - I guess one can encode this too with git/hg - by using sha hashes - but this becomes much bigger in terms of information, and human handling it.

I guess not very strong point, but using CL numbers (I'm working with perforce mostly these days) makes things easier. And having one CL monothonically increasing all over all source code you have even better - you can even reference things easier - just type cl/123456 - and your browser can turn it into a link. Among many other not so obious benefits...

lpghatguy7y ago

Most popular Git frontends (GitHub and GitLab too, I believe) let you link to commits with just the first 5-6 characters of the hash. I don't think that's much different to remember than a Perfore CL number.

malkia7y ago

To me the issue is when mentally trying to work with these numbers, P4 & G4's numbers increment, so I can tell which one came before the other - I can't do this with hashes. I'm sure I can get used to the other way, but this cannot easily be ignored.

2 more replies

techbio7y ago

Previous thread:

https://news.ycombinator.com/item?id=11991479

1 more reply

ridiculous_fish7y ago

> Google's monolithic software repository, which is used by 95% of its software developers worldwide, meets the definition of an ultra-large-scale4 system, providing evidence the single-source repository model can be scaled successfully

This 95% number is the most surprising part of the article. That implies that the sum of engineers working on Android + Chrome + ChromeOS + all the Google X stuff + long tail of smaller non-google3 projects (Chromecast, etc) constitute only 5% of their engineers. Is e.g. Android really that small?

dlubarov7y ago

They must have meant that 95% of Google engineers use the monorepo in some capacity, even if the majority of their work is done in a different repo.

hyperpape7y ago

I don’t know how to parse the number, but 5% of a billion still leaves 50 million lines of code, or three Linux kernels worth.

dlp2117y ago

I think you're interpretation is incorrect. A better way to think of this is that those 5% of people work exclusively on those projects. I'd be very surprised to learn that only 5% of Google engineers work on those projects.

Too7y ago

That 95% is most likely more figurative than fact.

stevesimmons7y ago

My company has a 50m LOC Python codebase in a monorepo. It works really well, given the rate of change of thousands of developers globally. That is only possible because of the significant investment in devtools, testing and the deployment infrastructure.

Here is "Python at Massive Scale", my talk about it at PyData London earlier this year:

https://youtu.be/ZYD9yyMh9Hk

timkrueger7y ago

We work with an monorepo since Septemeber 2017. I wrote about the migration:

https://timkrueger.me/a-maven-git-monorepo/

Our developers like it, because they can use 'mkdir' to create a new component, search threw the complete codebase with 'grep' and navigate with 'cd'.

jamesmiller57y ago

I wish more developers knew of the wonderful "repo" tool[0] developed by the Android devs which allows a monorepo _perspective_ of many git repositories. Breakdown of the repo tool and example manifest files http://blog.udinic.com/2014/05/24/aosp-part-1-get-the-code-u...

[0] https://source.android.com/setup/develop/repo

wrayjustin7y ago

> includes approximately one billion files

...

> including approximately two billion lines of code

_also_

> in nine million unique source files

I should insert a joke about how well the system would do if each source file contained more than two lines of code.

But seriously, this summary could use some work.

rpcastagna7y ago

Binary files (arbitrary example: images used for golden screenshots in tests) have no line counts and are likely skewing the numbers here -- in the way you're (logically) looking to interpret them at least.

From a system design perspective, being able to handle a large number of files regardless of type is an interesting challenge, as is being able to handle a large number of highly indexed text files. All three of those statistics seem potentially interesting for different audiences that might read this paper.

tsycho7y ago

It's not just devops that you need to pull off a large monorepo; the other big thing is a strong testing culture. You have to be able to rely on unit tests from across the code base being a sufficient indicator of whether your commit is good. AND a presubmit process that can compute which parts of the monorepo get affected by your diff, and run tests against them automatically before committing your diff.

Google not only has the above but also has a strong pre-submission code review process which catches large classes of bugs in advance.

malkia7y ago

Here is the video (with Rachel Potvin), predating the article by some months: https://www.youtube.com/watch?v=W71BTkUbdqE

vbezhenar7y ago

I've used monorepo for few small related projects and it worked just fine for me. Much easier to make related changes across several projects.

joe_fishfish7y ago

This is probably a stupid question, but I couldn't find an answer. Does this mean Google keeps all of its different products in all their different languages and environments in one repo? So like, Android lives in the same repo as Gmail, which is the same repo as all the Waymo code and the Google search engine code as well? That seems insane to me.

krackers7y ago

Android & chromium are kept outside the monorepo

p-schultz7y ago

Yes, exactly. The self-driving car is in there too.

growse7y ago

Why does that seem insane?

paulddraper7y ago

Version controlled repositories are like business offices.

You can have your entire company in one location, or the entire company in separate locations. The most important thing is the logical rather than physical organization: team structure, executive leadership, inter-org dependencies, etc. You can achieve autonomy and good structure with or without separate locations.

A single location reduces barriers, but at some point multiple locations can solve physical and logistical challenges. General rule of thumb is to own and operate office space in a few locations as possible, but at some point you have to take drastic measures one way or another.

(Notice that Google had to invent their own proprietary version control system just for their monorepo. And not even Google actually uses a single repo as the source of truth: e.g. Chromium and Android.)

paulie_a7y ago

Im sure properly organized it's okay, but from what I've seen it's mediocre at best, especially with legacy/technical debt it's a huge mistake.

Start breaking that repo apart, because it probably isn't very/hopefully depending on the debt that exists.

hayleox7y ago

One of the big advantages of the monorepo is actually that it prevents technical debt from accumulating. If a change somewhere else breaks your code, you can't put off dealing with it -- you are forced to fix the issue immediately.

erik_seaberg7y ago

Tech debt is a useful tool and we shouldn't have zero tolerance. I can see wanting to deprecate old versions promptly, but I can't see instantly deprecating every old version with no workaround for mitigating emergencies.

paulie_a7y ago

That makes a lot of sense and I definitely like the idea of that. Unfortunately unless you either spend a tremendous amount of effort in a legacy system to make that reality, or start with a new green field, it's not realistic day to day.

the_arun7y ago

Seems like Google uses its own custom Source Control & tools - https://www.quora.com/What-version-control-system-does-Googl....

carapace7y ago

https://en.wikipedia.org/wiki/Conway%27s_law

> "organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations."

Interestingly, in light of the above adage, this massive repo is organized (if that's the word for it) like a bazaar or flea market. (Rather than like a phone book https://en.wikipedia.org/wiki/Yellow_pages )

alexeiz7y ago

> Trunk-based development. ... is beneficial in part because it avoids the painful merges that often occur when it is time to reconcile long-lived branches. Development on branches is unusual and not well supported at Google, though branches are typically used for releases.

This sounds like the SVN model to me where branches are cumbersome and therefore they are very rare. After getting used to the Git branching model where branches are free and merges are painless, it would be very hard to go back to the old development model without branches.

jbergknoff7y ago

How does CI work with a monorepo? Do you always have to run all the tests and build all the artifacts? Or are there nice ways to say "just build this part of the repo"?

dekhn7y ago

You specify targets. Just like using bazel: bazel build //tensorflow/blah/....

I maintain a small part of the monorepo, and it's really nice to be able say "Run every test that transitively depends on numpy with my uncommitted changes", so you can know if your changes break anybody who uses numpy when you update the version.

Personally I think it would be neat if there was an external "virtual monorepo" that integrated as-close-to-head of all software projects (starting at the root, that's things like absl and icu, with the tails being complex projects like tensorflow), and constantly ran CI to update the base versions of things. Every time I move to the open source world, I basically have to recompile the world from scratch and it's a ton of work.

dlubarov7y ago

It's flexible; presubmit tests can be configured per-directory. There's also an option to run all tests of packages that could be affected by a change based on the Blaze dependency graph.

If you're making changes to a package with tons of dependencies such as Guava, for a risky change you might want to run all affected tests, but for a minor change you might want to run just the standard unit tests. As a compromise, there's also an option to run a random sample of affected tests.

FartyMcFarter7y ago

For safe-looking changes, it's OK to only run a subset of the tests (usually including the tests that directly test the changed library).

For changes that are more likely to break distant code, you can run all tests (perhaps bundling together several changes in order not to overload the system).

Alternatively you can take the risk of breaking tests post-submit... this is not very good citizenship, but in some cases it might be reasonable (when the risk is small).

ebikelaw7y ago

Dependencies are explicit so the build tool (Bazel, to an approximation) compute the transitive closure of requirements of the desired target.

There are more details about testing at [1]

1: https://static.googleusercontent.com/media/research.google.c...

nicodjimenez7y ago

I have slight experience with both monorepos and smaller repos and I think they can both work. The advantage of smaller repos is that it forces different components to expose well designed API's. Bigger repos make sense for products and embedded software, smaller repos make sense for platforms build up of small services communicating on the internet.

djur7y ago

Smaller repos force different components to expose APIs, but I don't think it forces or even encourages the APIs to be well designed. In some cases, having work spread across multiple repos can impede iterative development, meaning that you risk half-assed or, uh, two-and-a-half-assed implementations.

Also, when someone's asking for review for a change that encompasses, say, a change to a service, a change to a client library for that service, and a change to 2-3 other services that use that client library, I know that I cringe a little when suggesting a change, knowing that to implement it is going to require a commit on all of these different repos, waiting for CI to run on each one, etc. I try to only use that impulse to counter the urge to bikeshed, but the temptation is there.

jorblumesea7y ago

Is this really relevant for anyone except for "google scale" companies? For most teams, managing 30-40 services backed by git repos isn't a huge task and doesn't cause many problems.

Is there mature tooling that helps teams manage this, or is this proprietary google magic tooling?

fastball7y ago

Most teams can probably get by with much fewer than 30-40 services. Unless you have 30-40 groups within your team.

jorblumesea7y ago

Even if they had that, managing the contract between and in a small group isn't super difficult.

testcross7y ago

I don't understand why gitlab/github/bitbucket don't provide better tools for monorepo. This is a topic pretty trendy. But there is absolutely no tools helping with control access, good ci, ...

malkia7y ago

What's missing in these is cross-reference, which is not possible without somewhat established BUILD system (caps "pun-intened") - e.g. like bazel/build, then a source code indexer, etc, etc.

This becomes very critical for doing reviews, since it allows you to "trace" things without running them, apart from many other things. For example large scale refactorings looking for usages of functions, and other examples like it.

Why githab/gitlab/etc. can't do it? Well because hardly there could be one encompassing BUILD system to generate correctly this index.

testcross7y ago

They can create a standard file format that has to be generated by build system. github is in a pretty powerful position. They can create even a shitty version of it and people will follow.

I've been thinking about a tool like this for a long time. A way to attach to each commit not only the diff in the code, but also the list of places affected by the changes (usages of functions that are modified for example). Then during review we wouldn't have only a stupid diff. We would have a list of place to check to be sure that the changes make sense in the context of the project.

malkia7y ago

Even if they can, it's one thing indexing your own source files every night, another indexing a much bigger amount + massive amounts of branches, clones, etc. (I'm talking about github) - e.g. not practical - as there is no no clear way to say which branch (from git) must be indexed (obviously not all) - e.g. there is no encompassing "standard" saying so.

That by itself is another BIG PLUS for mono-repo (and "mono"-rules) - things are done one (opinionated) way, trunk based development - but thus giving you things that you won't be able to have normally.

Now indexing source file is not an easy and cheap task - it's basically a huge MapReduce done over several hours (just guessing), so there must be a reason for this to be done.

IloveHN847y ago

The giant monorepo works only if you're using SVN, with Git it would be tremendous

therealmarv7y ago

Unless you change git like Microsoft did: https://blogs.msdn.microsoft.com/bharry/2017/05/24/the-large...

HNNewer7y ago

sure, but GitHub / GitLab / Bitbucket don't offer it

axaxs7y ago

Sorry, but as someone who has been in orgs that do both, mono repo is a mistake. Constant needs to pull unrelated changes before pushing, pipelines requiring to grab the whole repo for dependencies, etc. I understand the arguments for mono repo, but never think it's nothing that outweighs the cons.

robaato7y ago

Well those are issues around having a git mono repo - where the repo is the unit of change - you get it or you don't.

With mono repos such as SVN or Perforce you just work on whatever subset you want.

prepend7y ago

I love these articles. Is there a wiki or collection of detailed descriptions of large company tech practices that isn’t marketing blargh.

I read years ago about Google data ingest, locator process but neglected to bookmark so now can’t find the reference.

mlinksva7y ago

Me too. I don't know of a collection, but others can be found at https://ai.google/research/pubs/ https://research.fb.com/publications/ https://www.microsoft.com/en-us/research/search/?q&content-t... and similar (though only a small fraction give hints about at scale practices, and those would be neat to collect in one place).

Closely related to this post: just noticed a 2018 case study on Advantages and Disadvantages of a Monolithic Repository https://ai.google/research/pubs/pub47040

gervase7y ago

Should probably have a [2016] tag.

guessmyname7y ago

Indeed, but to be fair, the information in the article is based on several research papers from 2011 [1].

And I am 100% sure the idea of having a monolithic project is several years older than that.

I am grateful that the article is re-posted in multiple websites, because just the other day I was in an interview and, while doing my coding challenge, overheard the conversation of a young computer science graduate and another interviewer. The interviewer asked him to explain what was a monolithic repository and the benefits. This guy had no idea what the interviewer was talking about and right there I realized that what many of us take for granted terminology-wise in the IT world, will certainly be a foreign language to young students who are just entering the work force.

[1] http://info.perforce.com/rs/perforce/images/GoogleWhitePaper...

emmelaich7y ago

(2016)

tflinton7y ago

A repo including configuration and data.

How about we stop considering google an engineering leader and just a search leader?

tflinton7y ago

A repo with configuration, secrets and data?

Can we stop considering google an engineering leader and just a search algorithm leader?

curtis7y ago

I think monorepos make a lot of sense when you're talking about millions of lines of code. I'm not at all sure they make sense when you're talking about billions.

gravypod7y ago

I don't think the number of linea matters. I think the interconnection of your code matters. If you have 2 sets of services that are completely uncoupled the having two monorepos for those two deployments make sense. If you can guarantee atomic changes across all services that interconnect you have the benefits monorepos give you.

mason557y ago

Isn’t this only true if you’re doing full CI? Otherwise I could update my service and you can update yours to work with mine but unless we coordinate deployments you still have to worry about interface mismatches. I guess the alternative is you can just never (for a loose definition of never) make breaking changes to an interface. You can only enhance or create a new version.

ebikelaw7y ago

Even at google this is true. There are naturally multiple monorepos :) For example the Linux kernel devs have their own. This makes sense since the kernel-user interface is strongly defined.

jldugger7y ago

Well, this particular monorepo has two billion LoC. But it's not a git monorepo, which matters significantly.

fizixer7y ago

I don't care about that. For me this is incomprehensible:

Why the eff does Google have billions of lines of code in their repo?

I hope they are not counting revisions (e.g., if a single 1 million project has 100 revisions, that's 1 million, not 100 million).

I have heard that they do count generated code (so it's not all handwritten code). In that case again, I have two things to say:

- that's a bad metric. I could overnight generate a billion lines of code with each line a printf of number_to_word of numbers from 1 to a billion. They want to measure the size of the repo? They should tell us the gigabytes, terabytes etc. But when it's lines of code, it's cheezy and childish to blow up the measure by including lines of generated code.

- But more importantly, I hope the generated code is 90% or more of that repository. Because any less than that would mean that Google engineers have handwritten 100 million or more lines of code through out the lifetime of the company, in which case I have to ask: what bloated mess do you have on your hands? I thought you guys were the top engineers of the world.

j / k navigate · click thread line to collapse

281 comments

hobls7y ago

wirrbel7y ago

Same line of thinking, just different conclusions.

Developer team dynamics also play a role. I have observed the pattern now multiple times (N=3):

That pattern takes 2-3 years to play out, but I have seen it on every job I worked.

zuppy7y ago

paulddraper7y ago

> It is split up into libraries (or microservices)

It's a frequent problem to conflate organization/modularization with lifecycle/version management.

You can have a well-organized codebase just as easily in a monorepo.

That's a separate question from management the lifecycle of the code. (What is release and when? What tests are run? What process approves a change?)

marmaduke7y ago

working as dev with academic teams, I usually use many repos for "damage control" as git-ignorant scientists will dump irrelevant files into a repo.

with that in mind, is monorepo is a universally good approach or is more dependent on good behavior of team members than polyrepo?

jakoblorz7y ago

I don't think that there is the one size fits all solution especially if you can't expect basic knowledge about git

wirrbel7y ago

A monorepo requires a good Continuous Integration infrastructure if it is supposed to work. Unless those small repos are will be unit tested, you will not benefit from a monorepo.

paulddraper7y ago

Is there frequent code reuse? If so, monorepos are really nice. If not, separate repos make more sense.

diminoten7y ago

shados7y ago

But people seem to forget that it wasn't that long ago that git didn't exist, making multiple repos was a pain in the butt. Managing multiple repos locally was hell. Monorepos were the norm.

fcarraldo7y ago

People also seem to forget that "Monorepo" or (many) "Microrepos" is not a binary choice.

Is this a solved problem? I typically do make one PR per repo to resolve breaking changes, though it's certainly not a big deal. Still, if there's an easier way, I'd love to hear about it!

shados7y ago

> Is this a solved problem

2 more replies

ec1096857y ago

Microsoft went from multiple smaller repos for windows to one large one: https://blogs.msdn.microsoft.com/bharry/2017/05/24/the-large...

hobls7y ago

(I’m also not sure I’d generally categorize tools work as as “dev ops,” though I can certainly see how they end up intertwined.)

lclarkmichalek7y ago

AmericanChopper7y ago

>Both monorepo or "micro repo" end up falling apart at scale without some devops work involved

Wouldn’t any project fall apart without devops work?

shados7y ago

yeah, that's my point.

adamrt7y ago

We moved to a monorepo about 2 years ago and it has been nothing but success for us.

We are a reasonably small team though, so maybe that is part of it.

Tloewald7y ago

Client apps are unavoidably larger repos than the services apps.

Based on my personal experience, I think monorepos are nuts.

ebikelaw7y ago

mcny7y ago

Too7y ago

How is this

   company
     /ProjectA
       /.git
     /ProjectB
       /.git

easier to browse than this?

   company
     /.git
     /ProjectA
     /ProjectB

1 more reply

artursapek7y ago

hobls7y ago

A single team is really helpful. Where I’ve seen it get particularly unhelpful is with multiple teams. I’m also not opposed to the concept, I just think it requires work to do correctly.

bedros7y ago

how do you create branches in mono repo?

for example I want to use branch rev5 from project A and rev3 from project B

how I do that in a mono repo, I could not do it in HG, but sure about GIT

aidos7y ago

In SVN a branch is simply a convention. You copy (almost zero cost) things around into your branches directory

tom_7y ago

Tloewald7y ago

Can’t you create a branch and merge the two branches you are interested in into that?

1 more reply

dgsb7y ago

Just curious how small is small ? How many kloc ? How many people ?

mr_tristan7y ago

I sense that Google invests much more in it's infrastructure then most companies make in revenue.

ryancox7y ago

Agreed about the lack of monorepo tooling. There's just not that much out there. A couple of other links I didn't see in the awesome-monorepo:

- https://github.com/facebookexperimental/mononoke - I hear this is a real thing and not a science fair project

- https://github.com/bors-ng/bors-ng - Needed in a monorepo to handle high arrival rate of commits / merges

golangnews7y ago

What problems specifically did you see? Was this because the repo was too large?

I understand at google scale you'd need lots of tooling but why at a smaller scslr of merging a dozen small repos?

mr_tristan7y ago

1 more reply

013a7y ago

kqr7y ago

It stuck with me, and is applicable to so many things. Including, maybe, this?

kungtotte7y ago

Another question is just the sheer scale of the FAANG companies, making things work at that scale is likely to be counterintuitive sometimes.

I just looked it up, Facebook has 2.2 billion users monthly. That's almost a third of the entire planet.

Shit that makes sense for them won't make sense for 99% of everyone else.

rubenbe7y ago

mrweasel7y ago

Most people/companies aren't Google, but as you say they assume that if one or more of the tech gigants (or other very public tech companies) are doing something, then it must be good.

Monorepos is just a another item to the heap of things that may be a good idea, but it depends.

3minus17y ago

also known as "cargo culting"

pcwalton7y ago

There's also the fact that monorepos have issues when you don't have one organization responsible for all the code. The Linux kernel and NetHack don't live in the same repository for good reason.

dekhn7y ago

trasz7y ago

akvadrako7y ago

It does seem like the Linux model scales better than the BSD model.

pcwalton7y ago

Fine, replace NetHack with Quake 3. :)

mnm17y ago

shakna7y ago

Submodules? [0]

Easy to use, cutting edge updates.

[0] https://git-scm.com/book/en/v2/Git-Tools-Submodules

[1] https://www.mercurial-scm.org/wiki/Subrepository

glandium7y ago

EnderMB7y ago

I've seen this a few times in the .NET world, mainly as a carry-over from Subversion when we had moved to Mercurial and git.

rco87867y ago

Are you suggesting that there’s a solution to managing large amounts of code that doesn’t involve large amounts of tooling?

hobls7y ago

rco87867y ago

Sure. You could also replace the term “monorepo” with “separate repos” and your statement would be just as valid. Either way you go has pros and cons.

1 more reply

hobs7y ago

heh, thousands, its probably at least an OOM greater, if not two.

hobls7y ago

waterhouse7y ago

My brain sees OOM and thinks "out of memory", which might be applicable too.

nine_k7y ago

Most folks who consider a monorepo don't have billions of lines of code, and often not even millions.

Linux kernel is a monorepo.

threeseed7y ago

Linux kernel is one functional piece of work though.

Imagine if we combined KDE, Gnome, Linux Kernel, ZFS etc all in the one monorepo.

dmoy7y ago

And gnucash, libreoffice, a couple copies of android, three other things that forked the linux kernel, and then all of apache to boot.

1 more reply

nine_k7y ago

Where there are obvious and pronounced functional boundaries, often backed by administrative boundaries, separate repos makes total sense.

Otherwise, it's an optimization; see "premature optimization" for cautions.

acomjean7y ago

But what are the alternatives to the monorepo in git?

All the ways of splitting code up and deploying multiple git repos for one project seem terrible.

tiglionabbit7y ago

favorited7y ago

1 more reply

yehosef7y ago

Here's the article (or one of them) https://code.fb.com/core-data/scaling-mercurial-at-facebook/

reificator7y ago

> multiple git repos for one project

If it's one project, it's not a monorepo. It's a repo.

acomjean7y ago

This seems difficult in git.

There are "submodules" and "subtrees" but none seemed particularly great and as far as I could tell each came with a bunch of caveats.

I'll admit my Git skills aren't great, but I've used a variety of source control and tried to suss out the best way to deal with a small team.

We ended up using "git subrepo" which is an add on thing I don't love, but it works.

part of the motivation is "common" and "project 2" are to be open sourced, but "project 1" which also uses "common" isn't.

1 more reply

forrestthewoods7y ago

> what are the alternatives to the monorepo in git?

A monorepo in Perforce!

threeseed7y ago

The entire programming world revolves around libraries and yet when it comes to our own code we are afraid of them ? Strange.

flukus7y ago

> All the ways of splitting code up and deploying multiple git repos for one project seem terrible.

ebikelaw7y ago

2 more replies

keerthiko7y ago

maxpert7y ago

Spot on! I've seen org wide mono repos at Microsoft and they had their custom tooling and build systems built on top of SourceDepot.

robaato7y ago

Which is just rebadged Perforce :)

georgewfraser7y ago

timkrueger7y ago

Interesting. Do you have wrote something about that migration?

georgewfraser7y ago

ma2rten7y ago

Google used Perforce for a very long time before they built their own version control system.

w_t_payne7y ago

Tooling is required for coordinating configuration management on multiple repositories too.

Also, why isn't such tooling available as open source? I'm trying to do my bit, but we could do with more effort being put into this, somehow.

foota7y ago

Really probably closer to millions of hours.

baybal27y ago

Dumping all code in a single repo, even for a 30 man development shop was really tough. Doing so for a company of few thousands must be truly crazy.

I advice Google to replace the person in their internal IT who came up with that idea.

flukus7y ago

> and the thing they all seem to overlook is that Google has THOUSANDS of hours of effort put into the tooling for their monorepo

joshuamorton7y ago

Your last paragraph doesn't sound like anything at Google. Most engineers will never use branches at all, and even fewer will use branches that merge into trunk (instead of away from it).

flukus7y ago

There is a set of code changes locally and those changes are bundled off to the test server to run the full test suite? That's a branch.

Now assuming I can get all of these planets aligned correctly I'm going to need to take this set of changes and put it into the master version aren't I? That's merging my branch into trunk.

3 more replies

jgibson7y ago

edit: I do have one question though, does googles internal tool handle permissions on a granular basis?

edejong7y ago

The key here is reverse dependency management. “If I change X, what would influence this change?”.

This can be achieved with single repo better than multi-repo due to the completeness of the (dependency) graph.

thomaslee7y ago

Exactly this. Or at least it's a way this can be achieved, assuming solid testing & some tooling in the mix.

For folks unfamiliar with it, the issue is something like:

1. You find a bug in a library A.

2. Libraries B, C and D depend on A.

3. B, C and D in turn are used by various applications.

[0] https://github.com/nebula-plugins/gradle-dependency-lock-plu...

bluejekyll7y ago

> So having run into this problem, folks like Google, Twitter, etc. use monorepos to help address some of this.

Here’s a tech talk Linus gave at Google in 2007: https://youtu.be/4XpnKHJAok8

2 more replies

rkangel7y ago

There's a subtler, and potentially more important thing that can crop up with your scenario:

Too7y ago

joshuamorton7y ago

Number three I disagree with. Bisection depends on build (and test) always working on trunk.

justicezyx7y ago

Single repo is one design that coherently addresses source control management and dependency management.

The key is to let the repo be a single comprehensive source of data for building arbitrary artifacts.

MaxBarraclough7y ago

> The key is to let the repo be a single comprehensive source of data for building arbitrary artifacts.

By that do you mean it's one way of doing it, or that it's the only way?

Seems clear to me that it's not the only way. For instance .Net code tends to be Git for the project source + NuGet for external dependencies. It works pretty well.

justicezyx7y ago

It's one way. There isnt any problem that can only be solved in one way.

evfanknitram7y ago

I don't know what this means.

How is "single repo" a "design" and how does this design dictate dependency management?

Yes, if you have a single repo then that would be a single source of data for building your stuff. That seems redundant.

justicezyx7y ago

See Bazel, you have the depes manifested as source controlled the data, then you can build everything as deterministically as possible.

Then you can manage dependency as part of the normal source control process.

adrianN7y ago

A single repo makes it a bit tricky to use some library in version A for project X and version B for project Y.

paulddraper7y ago

Correct.

You can consider that a bad thing or a good thing.

Most language's package composition (C/C++, Java, Python, Ruby) don't permit running multiple versions at runtime. The single-version policy is one way of addressing dependency hell.

bunderbunder7y ago

1 more reply

Nihilartikel7y ago

If I recall, in Google's build system, a dependency in the source tree can be referenced at a commit ID, so you can actually have a dependency on an earlier version of artifacts in source control.

1 more reply

gberger7y ago

Yes, Google's internal tool handles permissions based on directory owners.

They use the same OWNERS-file model as in the Chromium project [1], the only difference being the tooling (Chromim is git, google3 is ... its own Perforce-based thing).

[1] https://chromium.googlesource.com/chromium/src/+/lkcr/docs/c...

maccard7y ago

I can't comment specifically on Google's tool, but I know it's based on perforce. perforce does have granular permissions - https://www.perforce.com/perforce/r15.1/manuals/p4sag/chapte...

paulddraper7y ago

> The two don't have to be combined.

Google has chosen to do combine them in a different way, where ever commit of a library implicitly produces a new version, and all downstream projects use that.

> googles internal tool handle permissions on a granular basis?

fps_doug7y ago

bananarepdev7y ago

senozhatsky7y ago

> OpenBSD is year 2038 ready and will run well

> beyond Tue Jan 19 03:14:07 2038 UTC

IOW, a monolithic repository makes some things easier:

a) make a dramatic change to A

b) rebuild the world

c) see what's broken, patch it

d) while there are regressions or build breakages, goto (b)

e) commit everything

[0] http://www.openbsd.org/55.html?hn

[UPDATE: fixed spelling errors... umm, some of them]

-ss

glandium7y ago

Monolithic repository might have been a tool that helped enforce it, but that's not what made it happen. It's the decision that ABI could be broken that did.

senozhatsky7y ago

> Monolithic repository might have been a tool that helped enforce it, > but that's not what made it happen. It's the decision that ABI could > be broken that did.

Well, there are probably some subtle details which I'm missing, and may be you are totally right.

The way it looks to me is as follows: They are "happy to break kernel ABI compatibility" because the repository is monolithic - they break ABI, they immediately fix user space apps.

E.g. NetBSD time_t 64-bit commit: https://marc.info/?l=openbsd-cvs&m=137637321205010&w=2

They patched the kernel:

	 sys/kern       : kern_clock.c kern_descrip.c kern_event.c
	                 kern_exit.c kern_resource.c kern_subr.c 
	                 kern_synch.c kern_time.c sys_generic.c 
	                 syscalls.conf syscalls.master vfs_getcwd.c 
	                 vfs_syscalls.c vfs_vops.c

and fixed broken user space at the same time:

...

	 sys/msdosfs    : msdosfs_vnops.c
	 sys/netinet6   : in6.c nd6.c
	 sys/nfs        : nfs_serv.c nfs_subs.c nfs_vnops.c xdr_subs.h
	 sys/ntfs       : ntfs_vnops.c
	 sys/sys        : _time.h _types.h dirent.h event.h resource.h
	                 shm.h siginfo.h stat.h sysctl.h time.h types.h 
	                 vnode.h 
	sys/ufs/ext2fs : ext2fs_lookup.c 
	sys/ufs/ufs    : ufs_vnops.c

...

There is no "transitional" stage, when the kernel is already patched, but no user space apps are ready for those changes yet. It all happens at once.

-ss

flukus7y ago

> There is no "transitional" stage, when the kernel is already patched, but no user space apps are ready for those changes yet. It all happens at once.

What about third party apps? It's not a fully self contained system, there are binaries out there running on openBSD that the openBSD devs have never heard of, and they were broken by the change.

2 more replies

dguest7y ago

This is a very good point.

perlgeek7y ago

... and all the third-party software that was compiled for older versions of OpenBSD is now also broken by default.

ChrisCinelli7y ago

Managing dependencies and versions across repos is a pain. Refactoring across repos is quite hard when your code spreads across repos considering the tree of dependencies.

Unfortunately Git checkout all the code, including history, at once and it does not scale to big codebases.

The approach that Facebook chose with Mercurial seems a good compromise ( https://code.fb.com/core-data/scaling-mercurial-at-facebook/ )

jsolson7y ago

ChrisCinelli7y ago

I am not surprised. code.google.com used Mercurial. But I am still curious. Is Mercurial the frontend of Piper or it could live independently? What is open sourced and what is not?

kajecounterhack7y ago

Mercurial is used as a frontend to piper in that experiment. It doesn't live independently. Piper isn't open source.

csdreamer77y ago

Doesn't the Git Virtual File system that Microsoft is contributing to Git take care of this?

https://blogs.msdn.microsoft.com/devops/2017/02/03/announcin...

Edit: don't just down vote. If you have a problem with my comment, tell me why.

justincormack7y ago

This currently only works on Windows, although they are planning OSX and Linux ports.

csdreamer77y ago

That is important. My link was back in 2017. Do you know of any ETA for GVFS getting on Linux?

1 more reply

bluedino7y ago

>> Unfortunately Git checkout all the code, including history, at once and it does not scale to big codebases

A shallow clone can be helpful in cases like this

justinjlynn7y ago

antt7y ago

shub7y ago

EpicEng7y ago

> Managing dependencies and versions across repos is a pain

whack7y ago

anyfoo7y ago

whack7y ago

It took me a while to figure out that you're disagreeing with me, because your last paragraph is a perfect example of why monorepos are so dangerous.

In a versioned world, you can still freeze your dependency at 1.324.5234, and migrate only when you want to, and when you're feeling confident about it.

anyfoo7y ago

> In a versioned world, you can still freeze your dependency at 1.324.5234, and migrate only when you want to, and when you're feeling confident about it.

Either way, the point is that it will force the issue to be resolved, quickly, and the code base to move forward.

>In a monorepo world, you don't have a choice. You've been forcefully migrated

Yes, and that's good, because:

> and migrate only when you want to, and when you're feeling confident about it.

... does not help in moving the code forward.

2 more replies

Eridrus7y ago

> The next morning, all your services end up broken as a result.

I mean, this is the argument for having good integration tests.

"If you liked it then you shoulda put a test on it" :)

1 more reply

shub7y ago

Someone makes a commit to library code and production magically breaks? How does that happen?

1 more reply

perlgeek7y ago

How does that work when you don't keep APIs stable, at least at the service boundaries?

crazygringo7y ago

> But in a monorepo, you have to either trample over the 0.1%, or let the 0.1% hold everyone else hostage.

Nope. In a monorepo (like at Google), you're responsible for not breaking anyone else's code, as evidenced by their tests still passing.

u801e7y ago

Given the size of a monrepo, is it possible to run the entire test suite in one's development environment, or do they have another endpoint to push to to run tests on a dedicated server?

summerlight7y ago

https://static.googleusercontent.com/media/research.google.c...

1 more reply

YokoZar7y ago

> Given the size of a monrepo, is it possible to run the entire test suite in one's development environment, or do they have another endpoint to push to to run tests on a dedicated server?

Eventually you hit a point where you need systems to run the tests for you. Making this work is part of the investment in infrastructure and tooling you need to do as a big serious company.

int_19h7y ago

> A monorepo depends, fundamentally, on all code having complete test coverage.

perfunctory7y ago

> Working with a monorepo is essentially like the latter.

whack7y ago

In practice, at any medium/large organization, the tooling team doesn't know your system, and its nuances, nearly well enough to "carefully test everything".

Having a solid automated test suite does help. But I personally would like to be in control of when my project updates its dependencies, instead of being forced to always pull everything from LATEST.

joshuamorton7y ago

You are, by having tests.

perfunctory7y ago

> the tooling team doesn't know your system

perfunctory7y ago

>In a versioned world, you can release your change as a new version, and allow your users to self-select

whack7y ago

When it comes to live services that you're actually running on a daily basis, like Google maps, forcing users to migrate makes a lot more sense.

perfunctory7y ago

I see. I guess static dependencies are not as prone to ongoing upkeep as services, but they are not totally immune. Think Heartbleed.

1 more reply

googlemike7y ago

What do you define as a static dependency?

andrewfong7y ago

rossjudson7y ago

There are a few exceptions to the "one version" rule; the monorepo and the build system support multiple versions just fine. It's just that we don't want them.

IMAYousaf7y ago

Hello.

What sort of tooling differences would one expect for a monorepo vs. multiple repos?

Is that a factor of something intrinsic about having one big repo, or is that a factor of the scale of the type of organization that Google is?

Thanks.

makecheck7y ago

This is clearly detrimental to external projects such as Go packaging, since their own developers will never be looking at dependency problems in the same way as outside groups.

topspin7y ago

> This is clearly detrimental to external projects such as Go packaging

Google's monorepo may be wonderful from Google's perspective but I don't think it's been a win for Go.

[1] https://github.com/golang/go/issues/24301

robaato7y ago

What about Android and 800-1,000 git repos?!

Have seen the pain trying to manage that across larger teams (e.g. thousands of devs) - and no the "repo" tool is not sufficient.

nwlieb7y ago

I'm very curious, what pain did you see with the repo.py tool?

tzhenghao7y ago

Having worked at different companies adopting both monorepo and the multiple repos approach, I find monorepo a better normalizer at scale in consolidating all "software" that runs the company.

Just like what many commenters here have mentioned, the monorepo approach is a forcing function on keeping compatibility issues at bay.

[1] - https://www.youtube.com/watch?v=kb-m2fasdDY

shiift7y ago

mlthoughts20187y ago

One of my former managers had worked a long time at Google and was present for the advent of Google’s in-house tooling developed around their monorepo.

gefh7y ago

Do you wonder whether he is a reliable narrator?

mlthoughts20187y ago

haglin7y ago

Google's handling of their source code makes me wanna work there.

djur7y ago

a-dub7y ago

I also still have doubts around the value of a monorepo, in the article they claim it's valuable because you get:

Unified versioning, one source of truth;

Extensive code sharing and reuse;

Simplified dependency management;

Atomic changes;

Large-scale refactoring;

Collaboration across teams;

Flexible team boundaries and code ownership; and

Code visibility and clear tree structure providing implicit team namespacing.

malkia7y ago

lpghatguy7y ago

malkia7y ago

2 more replies

techbio7y ago

Previous thread:

https://news.ycombinator.com/item?id=11991479

1 more reply

ridiculous_fish7y ago

dlubarov7y ago

They must have meant that 95% of Google engineers use the monorepo in some capacity, even if the majority of their work is done in a different repo.

hyperpape7y ago

I don’t know how to parse the number, but 5% of a billion still leaves 50 million lines of code, or three Linux kernels worth.

dlp2117y ago

Too7y ago

That 95% is most likely more figurative than fact.

stevesimmons7y ago

Here is "Python at Massive Scale", my talk about it at PyData London earlier this year:

https://youtu.be/ZYD9yyMh9Hk

timkrueger7y ago

We work with an monorepo since Septemeber 2017. I wrote about the migration:

https://timkrueger.me/a-maven-git-monorepo/

Our developers like it, because they can use 'mkdir' to create a new component, search threw the complete codebase with 'grep' and navigate with 'cd'.

jamesmiller57y ago

[0] https://source.android.com/setup/develop/repo

wrayjustin7y ago

> includes approximately one billion files

...

> including approximately two billion lines of code

_also_

> in nine million unique source files

I should insert a joke about how well the system would do if each source file contained more than two lines of code.

But seriously, this summary could use some work.

rpcastagna7y ago

tsycho7y ago

Google not only has the above but also has a strong pre-submission code review process which catches large classes of bugs in advance.

malkia7y ago

Here is the video (with Rachel Potvin), predating the article by some months: https://www.youtube.com/watch?v=W71BTkUbdqE

vbezhenar7y ago

I've used monorepo for few small related projects and it worked just fine for me. Much easier to make related changes across several projects.

joe_fishfish7y ago

krackers7y ago

Android & chromium are kept outside the monorepo

p-schultz7y ago

Yes, exactly. The self-driving car is in there too.

growse7y ago

Why does that seem insane?

paulddraper7y ago

Version controlled repositories are like business offices.

paulie_a7y ago

Im sure properly organized it's okay, but from what I've seen it's mediocre at best, especially with legacy/technical debt it's a huge mistake.

Start breaking that repo apart, because it probably isn't very/hopefully depending on the debt that exists.

hayleox7y ago

erik_seaberg7y ago

paulie_a7y ago

the_arun7y ago

Seems like Google uses its own custom Source Control & tools - https://www.quora.com/What-version-control-system-does-Googl....

carapace7y ago

https://en.wikipedia.org/wiki/Conway%27s_law

> "organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations."

alexeiz7y ago

jbergknoff7y ago

How does CI work with a monorepo? Do you always have to run all the tests and build all the artifacts? Or are there nice ways to say "just build this part of the repo"?

dekhn7y ago

You specify targets. Just like using bazel: bazel build //tensorflow/blah/....

dlubarov7y ago

It's flexible; presubmit tests can be configured per-directory. There's also an option to run all tests of packages that could be affected by a change based on the Blaze dependency graph.

FartyMcFarter7y ago

For safe-looking changes, it's OK to only run a subset of the tests (usually including the tests that directly test the changed library).

For changes that are more likely to break distant code, you can run all tests (perhaps bundling together several changes in order not to overload the system).

Alternatively you can take the risk of breaking tests post-submit... this is not very good citizenship, but in some cases it might be reasonable (when the risk is small).

ebikelaw7y ago

Dependencies are explicit so the build tool (Bazel, to an approximation) compute the transitive closure of requirements of the desired target.

There are more details about testing at [1]

1: https://static.googleusercontent.com/media/research.google.c...

nicodjimenez7y ago

djur7y ago

jorblumesea7y ago

Is this really relevant for anyone except for "google scale" companies? For most teams, managing 30-40 services backed by git repos isn't a huge task and doesn't cause many problems.

Is there mature tooling that helps teams manage this, or is this proprietary google magic tooling?

fastball7y ago

Most teams can probably get by with much fewer than 30-40 services. Unless you have 30-40 groups within your team.

jorblumesea7y ago

Even if they had that, managing the contract between and in a small group isn't super difficult.

testcross7y ago

I don't understand why gitlab/github/bitbucket don't provide better tools for monorepo. This is a topic pretty trendy. But there is absolutely no tools helping with control access, good ci, ...

malkia7y ago

What's missing in these is cross-reference, which is not possible without somewhat established BUILD system (caps "pun-intened") - e.g. like bazel/build, then a source code indexer, etc, etc.

Why githab/gitlab/etc. can't do it? Well because hardly there could be one encompassing BUILD system to generate correctly this index.

testcross7y ago

They can create a standard file format that has to be generated by build system. github is in a pretty powerful position. They can create even a shitty version of it and people will follow.

malkia7y ago

Now indexing source file is not an easy and cheap task - it's basically a huge MapReduce done over several hours (just guessing), so there must be a reason for this to be done.

IloveHN847y ago

The giant monorepo works only if you're using SVN, with Git it would be tremendous

therealmarv7y ago

Unless you change git like Microsoft did: https://blogs.msdn.microsoft.com/bharry/2017/05/24/the-large...

HNNewer7y ago

sure, but GitHub / GitLab / Bitbucket don't offer it

axaxs7y ago

robaato7y ago

Well those are issues around having a git mono repo - where the repo is the unit of change - you get it or you don't.

With mono repos such as SVN or Perforce you just work on whatever subset you want.

prepend7y ago

I love these articles. Is there a wiki or collection of detailed descriptions of large company tech practices that isn’t marketing blargh.

I read years ago about Google data ingest, locator process but neglected to bookmark so now can’t find the reference.

mlinksva7y ago

Closely related to this post: just noticed a 2018 case study on Advantages and Disadvantages of a Monolithic Repository https://ai.google/research/pubs/pub47040

gervase7y ago

Should probably have a [2016] tag.

guessmyname7y ago

Indeed, but to be fair, the information in the article is based on several research papers from 2011 [1].

And I am 100% sure the idea of having a monolithic project is several years older than that.

[1] http://info.perforce.com/rs/perforce/images/GoogleWhitePaper...

emmelaich7y ago

(2016)

tflinton7y ago

A repo including configuration and data.

How about we stop considering google an engineering leader and just a search leader?

tflinton7y ago

A repo with configuration, secrets and data?

Can we stop considering google an engineering leader and just a search algorithm leader?

curtis7y ago

I think monorepos make a lot of sense when you're talking about millions of lines of code. I'm not at all sure they make sense when you're talking about billions.

gravypod7y ago

mason557y ago

ebikelaw7y ago

Even at google this is true. There are naturally multiple monorepos :) For example the Linux kernel devs have their own. This makes sense since the kernel-user interface is strongly defined.

jldugger7y ago

Well, this particular monorepo has two billion LoC. But it's not a git monorepo, which matters significantly.

fizixer7y ago

I don't care about that. For me this is incomprehensible:

Why the eff does Google have billions of lines of code in their repo?

I hope they are not counting revisions (e.g., if a single 1 million project has 100 revisions, that's 1 million, not 100 million).

I have heard that they do count generated code (so it's not all handwritten code). In that case again, I have two things to say:

j / k navigate · click thread line to collapse