I have a new project that is in a somewhat similar space of wrapping compilers: https://github.com/sourcefrog/cargo-mutants, a mutation testing tool for Rust.
https://github.com/icecc/icecream - another option that does what distcc does, but aimed at a somewhat different use case.
https://ccache.dev/ - a similar idea but provides caching of build outputs instead of distributing builds. You can use it together with distcc to achieve even better performance.
I know this probably won't help with your current project, but you should think of your compiler as an exotic virtual machine: your code is the input program, and output executable is the output. Just like with a "real" CPU, there are ways to write a program that are fast, and ways to write a program that are slow.
To continue the analogy: if you have to sort a list, use `qsort()`, not `bubble sort()`.
So, for C/++ we can order the "cost" of various language features, from most-expensive-to-least-expensive:
1. Deeply nested header-only (templated/inline) "libraries";
2. Function overloading (especially with templates);
3. Classes;
4. Functions & type definitions; and,
5. Macros & data.
That means, if you were to look at my code-base, you'll see lots-and-lots of "table driven" code, where I've encoded huge swathes of business logic as structured arrays of integers, and even more as macros-that-make-such-tables. This code compiles at ~100kloc/s.We don't use function-overloading: one place we removed this reduced compile times from 70 hours to 20 seconds. Function-overloading requires the compiler to walk a list of functions, perform ADL, and then decide which is best. Functions that are "just C like" require a hash-lookup. The difference is about a factor of 10000 in speed. You can do "pretend" function-overloading by using a template + a switch statement, and letting template instantiation sort things out for you.
The last thing is we pretty much never allow "project" header files to include each other. More importantly, templated types must be instantiated once in one C++, and then `extern`ed. This is all the benefit of a template (write-once, reuse), with none of the holy-crap-we're-parsing-this-again issues.
You can set this via config, as by default ccache is paranoid about being correct. But you can tweak it with things like setting a build directory home (this is great for me, as I'm the only user but compile things in say `/home/josh/dev/foo` and `/home/josh/dev/bar` and have my build directory as my dev directory and it's shared. (see https://ccache.dev/manual/latest.html for all the wonderful knobs you can turn and tweak).
Fantastic tool, the compression with zstd is fantastic as well.
I played with distcc (as I have a "homelab" with a couple of higher end consumer desktops), but found it not worth my time as compiling locally was faster. I'm sure with much bigger code bases (like yours) this would be great. Reason I used it it that archlinux makes it super easy to use with makepkg (their build tool script that helps to build packages).
* use make -j to run multiple task at once
* replace¹ gcc with a script that picked a host and ran the task on that²
This had many limitations that distcc lists it doesn't have (the machines had to have the same everything installed, the input and output locations for gcc had to be on a shared filesystem mounted in the same place on each host, it only parallelised the gcc part not linking or anything else, and it broke if any make steps had side-effects like making environment changes, etc.) and was a little fragile (it was just a quick hack…), but it worked surprisingly well overall. For small processes like our Uni work it didn't make much difference over make -j on its own³ but for building some larger projects like libraries we were calling that were not part of the department's standard Linux build, it could significantly reduce build times. On the machines we had back then a 25% improvement⁴ could mean quite a saving in wall-clock time.
One issue we ran into using it was that some makefiles were not well constructed in terms of dependency ordering, so would break⁵ with make -j even without my hack layered on top. They worked reliably for sequential processing which is presumably the only way the original author used them.
----
[1] well, not actually replace, but arrange for my script to be ahead of the real thing in the search path
[2] via rlogin IIRC, this predates SSH (or at least my knowledge of it)
[3] despite the single-core CPUs we had at the time, task concurrency still helped on a single host as IO bottlenecks meant that single core was less aggressively used without -j2 or -j3
[4] we sometimes measured noticeably more than that, but it depended on how patballisable the build actually was, and the shared IO resource was also shared by many other tasks over the network so that was a key bottleneck
[5] sometimes intermittently, so it could be difficult to diagnose
That remains true to this day of even makefiles from some really big name projects. Huge pet peeve of mine and I’ve filed PRs with fixes many times.
But after a few weeks every Gentoo box in the house started crashing regularly. It took me a while to figure out what was going on: one of the slower machines had developed a single-bit memory error and was sharing corrupted .so files with all other machines.
Reproducible builds are also a big win here.
Had to jump through various hoops and find creative ways around problems (laptop too old to boot from USB, all my modern machines lack optical drives. And so on) but I did eventually break in.
It’s the compiler that needs to line up for that. But my recommendation is to install sccache which will figure it out for you.
Even without distributed builds, clearmake could re-use .o files built by other people if the input dependencies were identical. On a large multi-person project this meant that you would typically only need to build a very small percentage of the code base and the rest would be "winked in".
If you wanted to force a full build, you could farm it out across a dozen machines and get circa 10x speedup.
Clearcase lost the edge with the arrival of cheaper disk and multi-core CPUs. I'd say set the gold standard for version control and dependency tracking and nothing today comes close to it.
Not to mention the absolutely bad design for handling merge conflicts (punt to human if more than 1 person touched a file seriously???)
On the other hand Clearcase dynamic views were pretty awesome. You just needed to edit your view config spec and the view FS would render the right file versions immediately. No checkout required. There was even view extended naming to let you open multiple versions of the same file directly from any editor.
As for merging Clearcase set the gold standard for three-way automatic merges and conflict resolution that wasn't matched until git came along. It's still superior in one important way - Clearcase had versioned directories as well as files so you could commit a file move and someone else's changes to the same "element" in the original file location would be merged correctly after the move. No heuristics, just versioned elements with versioned directory tree entries. Backporting fixes was a breeze, even with a significant refactor in between.
Git more or less wins hands down today because it is fast, distributed and works on change sets. But something with git's storage model, a multi version file system and Clearcase's directory versioning would be an awesome VCS.
> I'd say set the gold standard for version control and dependency tracking and nothing today comes close to it.
In my 2005-2011 experience with Clearcase, it was slow and required dedicated developers just to manage the versions, and I'm so happy its version control model has been an evolutionary dead-end in the greater developer community. The MVFS is an attractive trap. Giving people the ability to trivially access older versions means you've just outsourced the job of keeping everything working together to some poor SOB. It was very much a "enough rope to hang yourself" kind of design.
As I said, it was slow, because MVFS. The recommended solution from Clearcase/IBM was to break up the source tree into different volumes (or whatever Clearcase's "repo"-analogue was named), which just increased the pain of keeping things in sync.
Additionally, it was an "ask-permission" design for modifying files, where you had to checkout a file before being able to modify it, and you couldn't checkout if someone else had, which added a ton of painful overhead.
I'll grant that my company/group didn't know what they were doing, but following IBM/Clearcase's guidance was not a good idea.
These days, I use Clearcase as a warning to the younger generation.
To this day, I find Clearcase's way of doing things is the better way to do version control. Git, in comparison, kind of feels alien and I could never really get the same type of comfort on it.
Nowadays most of it has been ported into Java, if the ongoing efforts when I left, finally managed to migrate everything away from C++, Perl and CORBA.
Also supports creating project files that are compatible with XCode and Visual Studio so you can just build from those IDE's and a pretty flexible dependency based build file format that can accomodate any kind of dependency.
Having distributed builds on msvc thought with free software sounded very promising thought.
> distcc is not itself a compiler, but rather a front-end to the GNU C/C++ compiler (gcc), or another compiler of your choice. All the regular gcc options and features work as normal.
So `make` is also a compiler?
I think the end result was that the transfer time at 10mbps speeds made it take longer than it would have just transforming locally, but it was neat to see it work!
edit: just remembered I wrote up a blog post about it when I worked there
https://pspdfkit.com/blog/2017/crazy-fast-builds-using-distc...
Fun times!
1) set up passwordless, ssh.
and
2) use the gnu parallel. https://www.gnu.org/software/parallel/
gnu parallel is super flexible, very useful.
> But unlike distcc, Icecream uses a central server that dynamically schedules the compile jobs to the fastest free server.
So using distcc means that all computers using it must be trusted. And that means that using it on "all developers computers to share the load" is good for performance but bad for security.
I'm not sure whether distcc affects reproducible builds?
You could, in any case, have tighter controls on the release builds, which would be done on a CI machine before signing.
(Back when I used distcc we didn't distribute across the dev machines, we had an entire build farm of two racks of 2U servers!)
I would not generalise so quickly. Is every computer compromised in the internet if one is compromised?
No. It highly depends on the trust between those machines and whether they share similar services with critical vulnerabilities. Only then, they might be compromised together.
But the world has evolved and not everyone anymore bases their total trust and security thinking for "no outside internet connection, we are fine".
That might've been true 10 years ago and still is in some case but I wouldn't assume that now, far more things run over encryption for example, or have firewall
When the team is bigger or when security is more important, it's important to have a build system where you're confident that no one can subvert the output, and that includes ensuring that very few people can control machines running distccd.
Another way in which it shows its age is that, by default, there are only netblock based restrictions on clients, and connections are over unencrypted TCP by default (last time I looked), although there is an option to use SSH (or I guess Tailscale or similar.)
Granted, if your distcc host gets compromised, the compiled output shouldn't be trusted until the server can be reprovisioned.
So:
/opt/ourcompany/dev/bin/cc
absolutely *needs* to be the same on all machines involved or you risk very hard to spot issues.
For that reason, either use the absolute same distribution for everyone (and then run /use/bin/cc but watch out for alternatives!) or roll-out your own toolstack but make sure to put its version in the path.
https://github.com/Overv/outrun
Since I was just going down this rabbit hole recently, I kind of wonder if it's possible to set the filesystem on something more like the BitTorrent protocol so things like the libraries/compilers/headers that are used during compilation dont all need to come from the main pc. It probably wouldn't be useful until you reached a stupid number of computers and you started reaching the limits of the Ethernet wire, but for something stupid that can run on a pi cluster it would be a fun project.
Bandwidth and latency effects will have some effect on how well it will scale, but on a reasonably high latency and low RTT connection it's probably still faster than purely local compilation. Sending work over the network takes time but not much CPU, so the local CPU is freed up to do other work.
So all in all, I wouldn't say works outside of a well-controlled cluster on a LAN.
On top of it, the APIs Bazel uses to communicate with the remote execution environment is standardized and adopted by other build tools with multiple server implementation to match it. Looking into https://github.com/bazelbuild/remote-apis/#clients, you could see big players are involved: Meta, Twitter, Chromium project, Bloomberg while there are commercial supports for some server implementations.
Finally, on top of C/C++, Bazel also supports these remote compilation / remote test execution for Go, Java, Rust, JS/TS etc... Which matters a lot for many enterprise users.
Disclaimer: I work for https://www.buildbuddy.io/ which provides one of the remote execution server implementation and I am a contributor to Bazel.
But it has protobufs interfaces (IIRC), so a distributed build farm would generate the grpc endpoints for their implementation and then you tell bazel on the command line (or via .bazelrc) the address of the build farm it can use.
There's a couple of projects that implement the distributed/grpc part, the main one is https://github.com/bazelbuild/bazel-buildfarm
Yes, that's the main feature of Bazel. And it caches the generated files. Although you could do theoretically cache using ccache with distcc.
“The client is invoked as a wrapper around the compiler by Make. Because distcc is invoked in place of gcc, it needs to understand every pattern of command line invocations. If the arguments are such that the compilation can be run remotely, distcc forms two new sets of arguments, one to run the preprocessor locally, and one to run the. compiler remotely. If the arguments are not understood by distcc, it takes the safe default of running the command locally. Options that read or write additional local files such assembly listings or profiler tables are run locally”
Scalability:
“Reports from users indicate, distcc is nearly linearly scalable for small numbers of CPUs. Compiling across three identical machines is typically 2.5 to 2.8 times faster than local compilation. Builds across sixteen machines have been reported at over ten times faster than a local builds. These numbers include the overhead of distcc and Make, and the time for non-parallel or non-distributed tasks”
> Does performance scale linearly with the number of worker nodes
Yes, for small N.
Overall scaling was limited by how much "make -j" the source machine could cope with.
> distcc sends the complete preprocessed source code across the network for each job, so all it requires of the volunteer machines is that they be running the distccd daemon, and that they have an appropriate compiler installed.
So all the "environment" is on the source machine and just a bare compiler is required on the remote machines for compilation.
'''
distcc distributes work from a client machine to any number of volunteer machines. (Since this is free software, we prefer the term volunteer to the slaves that are used by other systems.)
distcc consists of a client program and a server. The client analyzes the command run, and for jobs that can be distributed it chooses a host, runs the preprocessor, sends the request across the network and reports the results. The server accepts and handles requests containing command lines and source code and responds with object code or error messages.
'''
I assume that for template-heavy c++, that could easily be hundreds of gigabytes for 1 gigabyte of c++ code to compile...?
If you're working from a laptop, surely the wifi connection will by far be the bottleneck?
One day I'll get back to that rewrite of ccontrol using modern distcc's features...
For a later hack in the same vein, check out https://github.com/sourcefrog/cargo-mutants
A problem not shared by icecc.
Insert XKCD "Compiling..." meme :)