I feel terrible for anyone trying to run a company with open-source style independent repos. On a popular github project, you have MANY potential contributors that will tell you if a PR, or a release candidate break API compatibility, etc. There are thousands of hours in open source dedicated to fixing integration issues due to the (unavoidable) poly-repo situation.
Monorepos in companies are relatively simple. You need to dedicate some effort in your CI and CD infrastructure, but you'll win magnitudes by avoiding integration issues. Enough tooling is out there already to make it easy on you.
Monorepos' biggest problem in an org is the funding, as integration topics are often deprioritized by management, and "we spend 10k per year on monorepo engineering" for some reason is a tough sell for orgs, who seem to prefer to "spend 5k for each of the 5 teams so that they maintain their own CD ways and struggle integrating which incurrs another 20k that just is not explicitly labeled as such".
Developer team dynamics also play a role. I have observed the pattern now multiple times (N=3):
* Developers have a monolithic repo, that has accumulated a few odd corners over time. * The feeling builds up that this monolithic repo needs to be modularized. * It is split up into libraries (or microservices), this is kind of painful, but feels liberating at first (now finally John does not break my builds anymore) * Folks realize: John doesn't break my builds anymore, but now I need to wait for integration on the test system to learn if he broke my code, and sometimes I only learn it in production. * people start posting blog posts on monorepos
That pattern takes 2-3 years to play out, but I have seen it on every job I worked.
It's a frequent problem to conflate organization/modularization with lifecycle/version management.
You can have a well-organized codebase just as easily in a monorepo.
That's a separate question from management the lifecycle of the code. (What is release and when? What tests are run? What process approves a change?)
with that in mind, is monorepo is a universally good approach or is more dependent on good behavior of team members than polyrepo?
Monorepo people ignore the learned lessons of those who came before us, and are trying to drag their teams back into a simpler time that, while nice, does not exist anymore. If you use any dependencies at all, you don't live in a monorepo world, and lying to yourself and your coworkers will only leave you confused and angry that your expectations are constantly not being met.
The solution isn't to split every single component into its own repo, but pretending like that's what anyone rational is proposing is not working with the best form of the argument. It's not always completely clear how to split up a growing codebase, but to claim that it's not usually worth splitting up is Wrong.
But people seem to forget that it wasn't that long ago that git didn't exist, making multiple repos was a pain in the butt. Managing multiple repos locally was hell. Monorepos were the norm.
Then as the state of version control ramped up, and making repos became easy, and having so much code in one repo had performance issues (overnight CVS/SourceSafe/SVN pull on your first day at work anyone? Branches that take hours to create?), people started making repos per project. The micro-service fad made that a no-brainer.
Now, for companies like Facebook and Google, or really any company that wrote code before the modern days and has a non-trivial amount of it, switching was not exactly a simple matter. So they just poured their energy into making the monorepo work. They're not the only ones to do it either (though not everyone has to do it at Google, Facebook or Microsoft scale, obviously, so its a bit easier for most). And so it works. And then people forget how to make distributed repos work and claim things like "omg I have to make 1 PR per repo when making breaking changes!", as if it was a big deal or it wasn't a solved problem.
You can have both tiny repositories which do a single thing and large repositories that consist of many projects. It's totally cool to have both, assuming your team can be trusted to make the appropriate choices as they create new projects.
> And then people forget how to make distributed repos work and claim things like "omg I have to make 1 PR per repo when making breaking changes!", as if it was a big deal or it wasn't a solved problem.
Is this a solved problem? I typically do make one PR per repo to resolve breaking changes, though it's certainly not a big deal. Still, if there's an easier way, I'd love to hear about it!
(I’m also not sure I’d generally categorize tools work as as “dev ops,” though I can certainly see how they end up intertwined.)
Wouldn’t any project fall apart without devops work?
We have quite a few projects but only 4 major applications. Maybe it is that a few of our projects intertwine a bit so making spanning changes in separate repositories was a pain. Doing separate PRs, etc. Now changes are more atomic. Our entire infrastructure can be brought up in development with a single docker-compose file and all development apps are communicating with each other. I don't think we've had any issues that I can recall.
We are a reasonably small team though, so maybe that is part of it.
I should add that a huge amount was invested in tooling. We had an in-house IDE with debug tools that could step through serverside code. We had a highly optimized code search tool. We had modified a major version control system so it could handle our codebase. (Indeed we picked our version control system because we needed to fork it and the other major version control system was less amenable to our PRs.)
My current job we have a micro service architecture and lots of small, focused repos. Each repo is self-documented. Anyone can checkout and build anything. We don’t need obscene dev servers. We have not hugely invested in tools or workflow.
Client apps are unavoidably larger repos than the services apps.
Based on my personal experience, I think monorepos are nuts.
for example I want to use branch rev5 from project A and rev3 from project B
how I do that in a mono repo, I could not do it in HG, but sure about GIT
I've worked with monorepos, and I'd be loathe to recommend it as well; the combination of culture shift and tooling it takes to keep a monorepo system running makes most CD processes you see today look like child's play.
There is a lot of very good free software that supports most of the open source approach to CD these days; but very, very little freely available monorepo tooling. Just check out https://github.com/korfuri/awesome-monorepo - it's a quick read. I haven't found many other notably superior compilations. Compared with available OSS workflows and tooling, it's rather sparse, filled with bespoke approaches everywhere.
- https://github.com/facebookexperimental/mononoke - I hear this is a real thing and not a science fair project
- https://github.com/bors-ng/bors-ng - Needed in a monorepo to handle high arrival rate of commits / merges
I understand at google scale you'd need lots of tooling but why at a smaller scslr of merging a dozen small repos?
My personal opinion: very few companies will hit a point where sheer volume of code or code changes makes a monorepo unwieldy. Code volume is a Google-problem. But every company will have problems with Github/Gitlab/whatever tooling with multiple repos; coordinating merges/deploys across multiple projects, managing issues, context switches between them, etc. And every company will also have problems with CI/CD in a monorepo.
Point being... there are problems with both, and there are benefits to both. I don't think one is right or wrong. I personally feel that solving the problems inherent to monorepos, at average scale, is easier than solving the problems inherent to distributed repos. The monorepo problems are generally internal technical, whereas the distributed repo problems are generally people-related and tooling outside of your control.
It stuck with me, and is applicable to so many things. Including, maybe, this?
Often the focus is extremely weird. When people noticed that WhatsApp only employed something like 45 engineers, then most assumed that it was because they used Erlang and FreeBSD. The thought that maybe their success was do to hiring the very best engineers and paying accordingly is less attractive.
Monorepos is just a another item to the heap of things that may be a good idea, but it depends.
Easy to use, cutting edge updates.
Some mad genius in a company will write a fuck-ton of helper classes and utilities that take the heavy lifting out of everything remotely hard, to the point where you almost never need to touch a third-party API for a CMS, email send service, or cloud-hosting provider. Instead of supplying these as private NuGet packages to be installed into an application, they sit in solutions in their entirety, in case they are needed. That application then goes to a new developer team, and they have zero idea why there are millions of lines of code and dozens of projects for a basic website that doesn't really seem to do anything.
It's a nice idea, but it has resulted in some very tightly coupled applications. I remember one time where a new developer changed some code in one of the utilities that handled multi-language support, and for some reason our logs reported that the emails were broke.
Linux kernel is a monorepo.
Imagine if we combined KDE, Gnome, Linux Kernel, ZFS etc all in the one monorepo.
All the ways of splitting code up and deploying multiple git repos for one project seem terrible.
If it's one project, it's not a monorepo. It's a repo.
A monorepo in Perforce!
The entire programming world revolves around libraries and yet when it comes to our own code we are afraid of them ? Strange.
Of course they are, git isn't the tool for this. You don't want multiple repos for a single project, you want one per project (this is not a monorepo). If there are things like code common to multiple projects then they are their own project with their own repo and release schedule, releases go into some sort of package manager (even if it's just the file system) and the projects depending on that common code update as they go.
Also, why isn't such tooling available as open source? I'm trying to do my bit, but we could do with more effort being put into this, somehow.
I advice Google to replace the person in their internal IT who came up with that idea.
That's a huge understatement. They haven't just slapped a few scripts on top of git/svn, they've created their own proprietary scm to manage all of this. They've thrown more at this beast than most companies will throw at their actual product.
I'm also not convinced they haven't reinvented individual repositories inside this monorepo, it sounds like you can create "branches" of just your code and share them with other people with committing to the trunk, this is essentially an individual repository that will be auto deployed when you merge to master.
edit: I do have one question though, does googles internal tool handle permissions on a granular basis?
This can be achieved with single repo better than multi-repo due to the completeness of the (dependency) graph.
For folks unfamiliar with it, the issue is something like:
1. You find a bug in a library A.
2. Libraries B, C and D depend on A.
3. B, C and D in turn are used by various applications.
How do you fix a bug in A? Well, "normal" workflow would be something like: fix the bug in A, submit a PR, wait for a CI build, get the PR signed off, merge, wait for another CI build, cut a release of A. Bump versions in B, C and D, submit PRs, get them signed off, CI builds, cut a release of each. Now find all users of B, C and D, submit PRs, get them signed off, CI builds, cut more releases ...
Now imagine the same problem where dependency chains are a lot more than three levels deep. Then throw in a rat's nest of interdependencies so it's not some nice clean tree but some sprawling graph. Hundreds/thousands of repos owned by dozens/hundreds of teams.
See where this is going? A small change can take hours and hours just to make a fix. Remember this pain applies to every change you might need to make in any shared dependency. Bug fixes become a headache. Large-scale refactors are right out. Every project pays for earlier bad decisions. And all this ignores version incompatibilities because folks don't stay on the latest & greatest versions of things. Productivity grinds to a halt.
It's easy to think "oh, well that's just bad engineering", but there's more to it than that I think. It seems like most companies die young/small/simple & existing dependency management tooling doesn't really lend itself well to fast-paced internal change at scale.
So having run into this problem, folks like Google, Twitter, etc. use monorepos to help address some of this. Folks like Netflix stuck it out with the multi-repo thing, but lean on tooling [0] to automate some of the version bumping silliness. I think most companies that hit this problem just give up on sharing any meaningful amount of code & build silos at the organizational/process level. Each approach has its own pros & cons.
Again, it's easy to underestimate the pain when the company is young & able to move quickly. Once upon a time I was on the other side of this argument, arguing against a monorepo -- but now here I am effectively arguing the opposition's point. :)
[0] https://github.com/nebula-plugins/gradle-dependency-lock-plu...
1. A single repo should be able to produce multiple artifacts. 2. It should be possible to use multiple repos to produce one artifact. 3. It should be possible to have revisions in your source control that don't build. 4. It should be possible to produce artifacts that depend on things not even stored in a repo, think build environment or cryptographic keys etc. An increase in version number could simply be an exchange of the keys.
The key is to let the repo be a single comprehensive source of data for building arbitrary artifacts.
By that do you mean it's one way of doing it, or that it's the only way?
Seems clear to me that it's not the only way. For instance .Net code tends to be Git for the project source + NuGet for external dependencies. It works pretty well.
How is "single repo" a "design" and how does this design dictate dependency management?
Yes, if you have a single repo then that would be a single source of data for building your stuff. That seems redundant.
They use the same OWNERS-file model as in the Chromium project [1], the only difference being the tooling (Chromim is git, google3 is ... its own Perforce-based thing).
[1] https://chromium.googlesource.com/chromium/src/+/lkcr/docs/c...
They do have to be combined in some way, at least to be reproducible. Your requirements.txt example is one way of combining version control + dependencies: give code an explciit version and depend on it elsewhere by that version.
Google has chosen to do combine them in a different way, where ever commit of a library implicitly produces a new version, and all downstream projects use that.
> googles internal tool handle permissions on a granular basis?
Not sure what you mean...it's build tool handles package visibilty (https://docs.bazel.build/versions/master/be/common-definitio...). It's version control tool handles edit permissions (https://github.com/bkeepers/OWNERS).
In reality you often have different components, some written in different languages, at a certain size, not everyone has all the build environment set up and might be working with older binaries, and now it's just as easy to have version mismatches, structural incompatibilities, etc. So you need a strong tooling and integration process to go along with your monorepo. The repo alone doesn't solve all your problems.
> OpenBSD is year 2038 ready and will run well
> beyond Tue Jan 19 03:14:07 2038 UTC
OpenBSD 5.5 was released on May 1, 2014. While Linux is still "not quite there yet" y2038-wise. y2038 is a very complex issue, while it may look simple - time_t and clock_t should be 64-bit. This requires changes both on the kernel -- new sys-calls interfaces [stat()], new structures layouts [struct stat], new sizeof()-s, etc. -- and the user space sides. This, basically, means ABI breakage: newer kernels will not be able to run older user space binaries. So how did OpenBSD handle that? The reason why y2038 problem looked so simple to OpenBSD was a "monolithic repository". It's a self-contained system, with the kernel and user space built together out of a single repository. OpenBSD folks changed both user space and kernel space in "one shot".
IOW, a monolithic repository makes some things easier:
a) make a dramatic change to A
b) rebuild the world
c) see what's broken, patch it
d) while there are regressions or build breakages, goto (b)
e) commit everything
[0] http://www.openbsd.org/55.html?hn
[UPDATE: fixed spelling errors... umm, some of them]
-ss
Monolithic repository might have been a tool that helped enforce it, but that's not what made it happen. It's the decision that ABI could be broken that did.
And that's also why it hasn't happened in Linux yet. Even if there was a monorepo containing all the open source and free software in the world (or at least, say, that you can find in common distros), the fact that there's a contract to never break the ABI makes it simply hard to do.
Well, there are probably some subtle details which I'm missing, and may be you are totally right.
The way it looks to me is as follows: They are "happy to break kernel ABI compatibility" because the repository is monolithic - they break ABI, they immediately fix user space apps.
E.g. NetBSD time_t 64-bit commit: https://marc.info/?l=openbsd-cvs&m=137637321205010&w=2
They patched the kernel:
sys/kern : kern_clock.c kern_descrip.c kern_event.c
kern_exit.c kern_resource.c kern_subr.c
kern_synch.c kern_time.c sys_generic.c
syscalls.conf syscalls.master vfs_getcwd.c
vfs_syscalls.c vfs_vops.c
and fixed broken user space at the same time:...
sys/msdosfs : msdosfs_vnops.c
sys/netinet6 : in6.c nd6.c
sys/nfs : nfs_serv.c nfs_subs.c nfs_vnops.c xdr_subs.h
sys/ntfs : ntfs_vnops.c
sys/sys : _time.h _types.h dirent.h event.h resource.h
shm.h siginfo.h stat.h sysctl.h time.h types.h
vnode.h
sys/ufs/ext2fs : ext2fs_lookup.c
sys/ufs/ufs : ufs_vnops.c
...There is no "transitional" stage, when the kernel is already patched, but no user space apps are ready for those changes yet. It all happens at once.
-ss
I work in an organization that just switched to monolithic and it's been going very well, with hundreds of active developers and millions of lines of code. But our developers are students or academics. As many as half don't understand the concept of an ABI. So the monorepo works quite well for us because rebuilding from scratch is something we do multiple times a day.
The problem is that this approach only works if it is really a self-contained system. But OpenBSD isn't: it's a basis to run software, potentially third-party software. It's can't be a closed Universe and still be useful at the same time.
Unfortunately Git checkout all the code, including history, at once and it does not scale to big codebases.
The approach that Facebook chose with Mercurial seems a good compromise ( https://code.fb.com/core-data/scaling-mercurial-at-facebook/ )
https://blogs.msdn.microsoft.com/devops/2017/02/03/announcin...
Edit: don't just down vote. If you have a problem with my comment, tell me why.
A shallow clone can be helpful in cases like this
Conversely, imagine if you're using some tools developed by a far off team within the company. Every time the tooling team decides to make a change, it will immediately and irrevocably propagate into your stack, whether you like it or not.
If you were at a startup and had a production critical project, would you hardcode specific versions for all your dependencies, and carefully test everything before moving to newer versions? Or would you just set everything to LATEST and hope that none of your dependencies decide to break you the next day? Working with a monorepo is essentially like the latter.
You might think this ought to be trivial by having clear API contracts, but that's a) not how things work in practice if all code is effectively owned by the same, overarching entity and, more importantly, b) now you have an enormous effort to transition between incompatible API revisions instead of just being able to do lockstep changes, for no real gain.
Even if you manage to pull that off (again, for what benefit?), it will bite you that 1.324.2564 behaves subtly different from 1.324.5234 even though the intent was just to add a new option and they otherwise ought to have no extensional changes in behavior.
Imagine a tooling team on a different continent that makes some changes this afternoon. Like you said, their intent is just to add a new option, and it ought to have no extensional changes in behavior, but it still ends up behaving subtly different. The next morning, all your services end up broken as a result.
In a versioned world, you can still freeze your dependency at 1.324.5234, and migrate only when you want to, and when you're feeling confident about it.
In a monorepo world, you don't have a choice. You've been forcefully migrated as soon as the tooling team decides to make the change on their end. They had the best of intentions, but that doesn't always translate to a good outcome.
FWIW, I'm currently working at a large famous company that uses a monorepo. Color me not-impressed. I do think that having a single repository for an entire team/project is a good idea. Hundreds of different projects and teams who've never seen one another? Not so much.
How does that work when you don't keep APIs stable, at least at the service boundaries?
Nope. In a monorepo (like at Google), you're responsible for not breaking anyone else's code, as evidenced by their tests still passing.
So you never trample over the 0.1%. Instead you fix your code, or you fix their code for them -- which was probably due to your own bugs or undefined behavior in the first place. Or else you don't push.
And if you break their code because they didn't have tests? That's their problem, and better for them to learn their lesson sooner that later, because they're breaking engineering standards that they've been told since the day they joined. A monorepo depends, fundamentally, on all code having complete test coverage.
Given the size of a monrepo, is it possible to run the entire test suite in one's development environment, or do they have another endpoint to push to to run tests on a dedicated server?
Covering every single line of code still doesn't mean that you have complete behavioral coverage, unless your tests somehow run for all possible inputs. In practice, there will still be holes, not because someone was negligent, but because they missed a corner case specific to some state.
Not really. In the dependencies analogy the author of the dependency has no way to test the dependee(s). While with monorepo this is exactly what you do, "the tooling team" will "carefully test everything" before "propagate into your stack" (and it doesn't have to be irrevocable).
Having a solid automated test suite does help. But I personally would like to be in control of when my project updates its dependencies, instead of being forced to always pull everything from LATEST.
Repeat this process multiple times and you end up with configuration/settings hell. Been there done that. It's not black and white but "trampling over the 0.1%" could be a sensible business/architectural decision. For example how do you imagine "google maps" users selecting when/how to migrate?
When it comes to live services that you're actually running on a daily basis, like Google maps, forcing users to migrate makes a lot more sense.
What sort of tooling differences would one expect for a monorepo vs. multiple repos?
Is that a factor of something intrinsic about having one big repo, or is that a factor of the scale of the type of organization that Google is?
Thanks.
Monorepo also bugs me because there will always be some external package you need, and invariably it’s almost impossible to integrate due to years of colleagues making internal-only things assume everything imaginable about the structure and behavior of the monorepo. There will be problems not handled, etc. and it leads to a lot of NIH development because it’s almost easier in the end.
Also, it just feels risky from an engineering perspective: if your repository or tools have any upper limits, it seems like you will inevitably find them with a humongous repo. And that will be Break The Company Day because your entire process is essentially set up for monorepo and no one will have any idea how to work without it.
Indeed. Google's monorepo means the largest cohort of Go programmers in the world are mostly indifferent to composing packages in the usual (cpan/maven/composer/npm/nuget/cargo/swift/pip/rubygems/bower/etc) manner. Non-Google Go programmers have been left to schlep around with marginal solutions for years, although in the last few months we begin to see progress here[1]. This was the #1 discouragement I experienced when experimenting with Go.
Google's monorepo may be wonderful from Google's perspective but I don't think it's been a win for Go.
* yes I know some of these are also build systems and provide many other capabilities, some of which are arguably detrimental. Versioned, packaged, signed dependencies and thus repeatable build artifacts is the point.
Just like what many commenters here have mentioned, the monorepo approach is a forcing function on keeping compatibility issues at bay.
What you don't want is to end up in a situation where teams reinvent their own wheels instead of building on top of existing code, and at scale, I think the multiple repo approach tends to breed such codebase smell. [1] I'm sure 8000 repos is living hell for most organizations.
His account was that it was basically accidental, at first resulting from short term fire drills, and then creating a snowball effect where the momentum of keeping things in the Perforce monorepo and building tooling around it just happened to be the local optimum, and nobody was interested in slowing down or assessing a better way.
He personally thought working with the monorepo was horrible, and in the company where I worked with him, we had dozens of isolated project repos in Git, and used packaging to deploy dependencies. His view, at least, was that the development experience and reliability of this approach was vastly better than Google’s approach, which practically required hiring amazing candidates just to have a hope of a smooth development experience for everyone else.
I laugh cynically to myself about this any time I ever hear anyone comment as if Google’s monorepo or tooling are models of success. It was an accidental, path-dependent kludge on top of Perforce, and there is really no reason to believe it’s a good idea, certainly not the mere fact that Google uses this approach.
His description of Google made it seem like it had the same dysfunction every place has. And the monorepo was a totally mundane, garden variety eyesore kind of in-house framework that you’ll find anywhere.
I think he recognized the usefulness of just working with it and picking battles. He was just dumbfounded that any outsider would see the monorepo project and think it possibly had any relevance for anyone else. It was just a Google-history-specific frankenstein sort of thing that got wrangled with tooling later. The supposed benefits are all just retrofitted on.
I don't like distributed version control systems with hundreds of repositories spread out. It makes management more complicated. I understand this is a minority view, but that is my experience. It was easier to work in a single Perforce repository than hundreds of Git or Mercurial repos.
I also still have doubts around the value of a monorepo, in the article they claim it's valuable because you get:
Unified versioning, one source of truth;
Extensive code sharing and reuse;
Simplified dependency management;
Atomic changes;
Large-scale refactoring;
Collaboration across teams;
Flexible team boundaries and code ownership; and
Code visibility and clear tree structure providing implicit team namespacing.
With the exception of the niceness of atomic changes for large scale refactoring, I don't really see how the rest are better supported by throwing everything into one, rather than having a bunch of little repos and a little custom tooling to keep them in sync.
I guess not very strong point, but using CL numbers (I'm working with perforce mostly these days) makes things easier. And having one CL monothonically increasing all over all source code you have even better - you can even reference things easier - just type cl/123456 - and your browser can turn it into a link. Among many other not so obious benefits...
This 95% number is the most surprising part of the article. That implies that the sum of engineers working on Android + Chrome + ChromeOS + all the Google X stuff + long tail of smaller non-google3 projects (Chromecast, etc) constitute only 5% of their engineers. Is e.g. Android really that small?
Here is "Python at Massive Scale", my talk about it at PyData London earlier this year:
https://timkrueger.me/a-maven-git-monorepo/
Our developers like it, because they can use 'mkdir' to create a new component, search threw the complete codebase with 'grep' and navigate with 'cd'.
...
> including approximately two billion lines of code
_also_
> in nine million unique source files
I should insert a joke about how well the system would do if each source file contained more than two lines of code.
But seriously, this summary could use some work.
From a system design perspective, being able to handle a large number of files regardless of type is an interesting challenge, as is being able to handle a large number of highly indexed text files. All three of those statistics seem potentially interesting for different audiences that might read this paper.
Google not only has the above but also has a strong pre-submission code review process which catches large classes of bugs in advance.
You can have your entire company in one location, or the entire company in separate locations. The most important thing is the logical rather than physical organization: team structure, executive leadership, inter-org dependencies, etc. You can achieve autonomy and good structure with or without separate locations.
A single location reduces barriers, but at some point multiple locations can solve physical and logistical challenges. General rule of thumb is to own and operate office space in a few locations as possible, but at some point you have to take drastic measures one way or another.
(Notice that Google had to invent their own proprietary version control system just for their monorepo. And not even Google actually uses a single repo as the source of truth: e.g. Chromium and Android.)
Start breaking that repo apart, because it probably isn't very/hopefully depending on the debt that exists.
> "organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations."
Interestingly, in light of the above adage, this massive repo is organized (if that's the word for it) like a bazaar or flea market. (Rather than like a phone book https://en.wikipedia.org/wiki/Yellow_pages )
This sounds like the SVN model to me where branches are cumbersome and therefore they are very rare. After getting used to the Git branching model where branches are free and merges are painless, it would be very hard to go back to the old development model without branches.
I maintain a small part of the monorepo, and it's really nice to be able say "Run every test that transitively depends on numpy with my uncommitted changes", so you can know if your changes break anybody who uses numpy when you update the version.
Personally I think it would be neat if there was an external "virtual monorepo" that integrated as-close-to-head of all software projects (starting at the root, that's things like absl and icu, with the tails being complex projects like tensorflow), and constantly ran CI to update the base versions of things. Every time I move to the open source world, I basically have to recompile the world from scratch and it's a ton of work.
If you're making changes to a package with tons of dependencies such as Guava, for a risky change you might want to run all affected tests, but for a minor change you might want to run just the standard unit tests. As a compromise, there's also an option to run a random sample of affected tests.
For changes that are more likely to break distant code, you can run all tests (perhaps bundling together several changes in order not to overload the system).
Alternatively you can take the risk of breaking tests post-submit... this is not very good citizenship, but in some cases it might be reasonable (when the risk is small).
There are more details about testing at [1]
1: https://static.googleusercontent.com/media/research.google.c...
Also, when someone's asking for review for a change that encompasses, say, a change to a service, a change to a client library for that service, and a change to 2-3 other services that use that client library, I know that I cringe a little when suggesting a change, knowing that to implement it is going to require a commit on all of these different repos, waiting for CI to run on each one, etc. I try to only use that impulse to counter the urge to bikeshed, but the temptation is there.
Is there mature tooling that helps teams manage this, or is this proprietary google magic tooling?
This becomes very critical for doing reviews, since it allows you to "trace" things without running them, apart from many other things. For example large scale refactorings looking for usages of functions, and other examples like it.
Why githab/gitlab/etc. can't do it? Well because hardly there could be one encompassing BUILD system to generate correctly this index.
I've been thinking about a tool like this for a long time. A way to attach to each commit not only the diff in the code, but also the list of places affected by the changes (usages of functions that are modified for example). Then during review we wouldn't have only a stupid diff. We would have a list of place to check to be sure that the changes make sense in the context of the project.
With mono repos such as SVN or Perforce you just work on whatever subset you want.
I read years ago about Google data ingest, locator process but neglected to bookmark so now can’t find the reference.
Closely related to this post: just noticed a 2018 case study on Advantages and Disadvantages of a Monolithic Repository https://ai.google/research/pubs/pub47040
And I am 100% sure the idea of having a monolithic project is several years older than that.
I am grateful that the article is re-posted in multiple websites, because just the other day I was in an interview and, while doing my coding challenge, overheard the conversation of a young computer science graduate and another interviewer. The interviewer asked him to explain what was a monolithic repository and the benefits. This guy had no idea what the interviewer was talking about and right there I realized that what many of us take for granted terminology-wise in the IT world, will certainly be a foreign language to young students who are just entering the work force.
[1] http://info.perforce.com/rs/perforce/images/GoogleWhitePaper...
How about we stop considering google an engineering leader and just a search leader?
Can we stop considering google an engineering leader and just a search algorithm leader?
Why the eff does Google have billions of lines of code in their repo?
I hope they are not counting revisions (e.g., if a single 1 million project has 100 revisions, that's 1 million, not 100 million).
I have heard that they do count generated code (so it's not all handwritten code). In that case again, I have two things to say:
- that's a bad metric. I could overnight generate a billion lines of code with each line a printf of number_to_word of numbers from 1 to a billion. They want to measure the size of the repo? They should tell us the gigabytes, terabytes etc. But when it's lines of code, it's cheezy and childish to blow up the measure by including lines of generated code.
- But more importantly, I hope the generated code is 90% or more of that repository. Because any less than that would mean that Google engineers have handwritten 100 million or more lines of code through out the lifetime of the company, in which case I have to ask: what bloated mess do you have on your hands? I thought you guys were the top engineers of the world.