> ABSTRACT
> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design. In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability. We catalog the ways in which fork is a terrible abstraction for the modern programmer to use, describe how it compromises OS implementations, and propose alternatives.
> As the designers and implementers of operating systems, we should acknowledge that fork’s continued existence as a first-class OS primitive holds back systems research, and deprecate it. As educators, we should teach fork as a historical artifact, and not the first process creation mechanism students encounter.
No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program. The original implementation worked by swapping out the forking program to disk on a fork() call. Then, at the moment the program was swapped out but control had not returned, the process table entry was duplicated and adjusted so that there were now two processes, one in memory and one swapped out. The one in memory then got control, and could do an exec() call.
This allowed large programs to run on small PDP-11 machines. It was needed back in the era of really expensive memory. That's why.
QNX had an interesting approach. Program loading isn't in the OS at all. There's "fork", but program loading is in a library. It links to a .so file which reads the executable header, allocates memory, loads the program, gets it ready to run, and starts it. The program loader runs in user space and is unprivileged. This is probably the right way to do it.
Well, it seems we are back in an era with really expensive memory.
“An era of really expensive memory”. That sounds familiar…
"In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability"
(But thanks for the good explanation)
aiui this is what exec does, the problem outlined here is the split between process creation (expensive, kernel space, has to be done each time even if spawning the same process "template" repeatedly) and loading (cheap and in userspace).
> No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program.
Ironically vfork() is even better in this regard. I wish Unix had only ever had vfork().
I agree that there should be non-fork primitives, I'm just not that sure that performance is the best argument.
Now these decisions aren't objectively bad, but they have significant trade-offs and it's probably not a good idea that they're forced simply because we use fork()+exec() for process creation.
Traditionally Windows applications that create processes all the time come from UNIX heritage.
Contrary to UNIX, Windows NT was designed with threads first mentality, from the get go.
While on UNIX they were added after fact, and to this day there are gotchas mixing posix threads with signals, fork and exec.
Both systems are implemented using threads as the execution context, but in Unix, the history means that that you fork+exec most of the time, resulting in a two tasks that do not share memory any more. By contrast, on Windows (NT onward) the common case when creating a new execution context is to create a thread that shares memory with others in its process.
Both systems allow the easy use of the other's core abstraction. On Unix, you can either code like its 1986 and use fork without exec, or use clone(3) or any of its higher level abstractions like pthreads.
You're right that POSIX semantics get tangled when using threads.
https://news.ycombinator.com/item?id=19621799 - A fork() in the road (2019-04-10, 178 comments)
Hard to come up with an optimization that is equally efficient and elegant
I would guess it would be a small difference in measurable performance between zygote and a direct clean spawn, but it's one less trick an application needs to do, and it would be very helpful for libraries that spawn things. Spawning inside a library isn't always a great thing to do, but some things would really benefit from process level isolation.
[1] In case one isn't aware, the zygote pattern involves forking a 'zygote' process during application startup, and having that process do any forks that need to happen during application runtime. This reduces the cost of forking in large applications, because the zygote will have few fds open and use little memory. This lets your large application spawn new processes without delaying the application or the startup of the new processes. Some applications will spawn many zygotes to allow parallelism for spawning at runtime.
Yes, zygote pattern makes it easy to make fork() into bottleneck - it requires a lot more discipline and low level tricks (linker scripts, compiler-specific extensions, custom sections, low level dependencies on pagesize that get "fun" on ARM servers).
If you don't, you might wake up with fork() causing latency issues.
Raw fork() is terrible. Instead we need a proper primitive to stop and make a snapshot of a process.
Windows, for all its many, many faults, did not use fork+exec and instead mostly has options for how one creates a process. It wasn’t done elegantly, but it was the right decision.
Any kind of replacement should aim for the same conceptual simplicity and power. Sadly, I fear that people driving development nowadays are more interested in building unbreakable walled gardens for advertisement or app stores, or trying to squeeze down the some small gain when used on the cloud. I am more interested in general computing on the user side.
* https://jdebp.uk/FGA/bernstein-on-ttys/cttys.html
Interestingly, on MS/PC/DR-DOS file descriptor 3 was stdaux. and file descriptor 4 was stdprn.
The Windows approach may be correct, but it suffers in performance from the POSIX perspective.
I have heard that WSL1 iimproves this.
If you want to greenfield re-engineer the world with all new system calls and a totally different execution model, feel free to go right ahead.
For example
pidfd_t ps = spawn(); // creates a process stopped (kernel does this anyway by default)
setuid(ps, 33);
capset(ps, ...);
socket(ps, ...);
mmap(ps, ...);
process_vm_writev(ps, ...);
exec(ps, ...);
signal(ps, SIGCONT);
// error handling elided
I guess this is a little bit me being a bit of critical of the usual syscall APIs for not thinking about "what if I want to do this to another process I have access to" but...It also makes things like thread safety even reasonably doable with fork. I do agree though that stuff like CreateProcess which take in a gazillion parameters don't really make for the greatest of userspace APIs
But how often would one actually need this? And what are the semantics? Refer arguments (e.g. file descriptors) to the current process or the other one? How are cross-permissions handled? It seems a lot of complexity...
Someones proposed a ptrace_syscall which could achieve the same thing.
Well, the idea is that it'd probably be close to the default API for spawning processes (and could even be the bedrock for posix_spawn and friends in libc (and potentially even "simple" fork cases[1])). fork/clone would be the special case
In most cases, most programs don't need special setup. Something like `ptrace_syscall` would also work for this and would be probably the way to do it with the backwards compat limitations of nowadays
ptrace-ability seems to be generally how permissions for this sort of thing are handled in general (see also procfs, process_vm_writev, ptrace, etc). The complication is a little bit around setuid programs but either you could special case execve to imply SIGCONT for setuid or have execve also imply a SIGCONT as well
[1]: Probably would be rare for a compiler to optimize it though
In an alternative world where fork+exec never existed, a lot of those "usual APIs" would probably have had an explicit pid argument to them that let you modify process configuration from a different process. (This is how Fuschia works, e.g.). There's a lot of benefit to this world: the most obvious is that you don't have to magic up some IPC system just to report configuration errors, but there's actually a good amount of utility in being able to have a manager process that is tweaking attributes of its children (e.g., debuggers would love it).
Unfortunately, the opposite is true, when the parent process is multi-threaded. In the child process, only one thread exists (the thread returning from fork()), but the memory is an exact copy of the parent's. As a result, the child may inherit locks (resident in memory) that are in acquired state, but have no owner threads -- the threads that are responsible for eventually releasing those locks in the child's copy of the process memory do not exist in the child. If the single thread in the child process (returning from fork()) attempts to take such a lock (before exec), it deadlocks. This is why POSIX says that only async-signal-safe functions may be called in a child process, between fork and exec. And then, for example, "malloc" is not such a function (at least per POSIX), so the fork-to-exec environment in the child process is an extremely uncomfortable one. You've got to preallocate everything in the parent, can't report errors to stderr, etc.
https://pubs.opengroup.org/onlinepubs/9799919799/functions/f...
https://pubs.opengroup.org/onlinepubs/9799919799/functions/V...
The fork(2) Linux manual page spells out the sam restriction.
https://man7.org/linux/man-pages/man2/fork.2.html
https://man7.org/linux/man-pages/man7/signal-safety.7.html
"pthread_atfork" exists, but is effectively unusable.
https://pubs.opengroup.org/onlinepubs/9799919799/functions/p...
I think one problem is that it is already how it is; making an entirely new operating system (that is not Linux, not GNU, and not POSIX) would solve it, but that is not the case here, so it would need to be done as it is.
One possibility would be a new function that creates a new empty child process, but the parent process specifies what system calls the child process executes, and can stop if specifying that exec or exit is (successfully) called by the child process, or if the parent process gives it the program memory to execute directly instead of using a file (since that use is also useful). The new function can still have some of the clone flags available. (I don't actually know how much better it would work.)
There are other possibilities as well.
The existing methods can also remain available for when they are helpful, but functions such as popen might be changed to use the new method.
POSIX says nothing much about vfork() anymore. It was a mistake removing it. Zealots failed to understand that vfork() >> fork(). https://news.ycombinator.com/item?id=30502392
Quick, what's the highest numbered open file descriptor in the your program?
This gets even worse if you have multiple threads running. Without looking it up, what is the state of all the various synchronization primitives in a forked process?
It shares way too much, and have huge use cases where it is really, really bad.
Isn't that what posix_spawn is for?
And of course that has already been done. On NetBSD, posix_spawn() is a fully-fledged system call and much of the work is done in kernel mode.
* https://blog.netbsd.org/tnf/entry/posix_spawn_syscall_added
This is just an example of I don't even know how many things a modern-day process will share from its parent.
By "complicated" I do not even remotely mean "unsolvable". I just mean that if you really dig down into what it means to "share nothing" in a modern operating system, it's a lot richer than it was back when fork+exec was a practical solution. There's a lot of fuzzy things that could go either way when you say "shares nothing".
I mean maybe this has been optimized for already and I don't know what I'm talking about but maybe someone with more knowledge about the kernel knows? Is this something we simply can't optimize for because of security implications?
Editing to add: this deduplication is one of the greatest upsides to dynamic linking. Common libs like libgcc and libc only have to exist in memory once and can stay in CPU caches, whereas if they were statically linked into every binary, each binary would have a copy of that library that wouldn't be shared with anything else and you'd waste a lot of memory.
> The kernel keeps track of which file is mapped where, and can detect when a request is made to map an already mapped file again, avoiding physical memory allocation if possible.
Relevant stack overflow answer: https://stackoverflow.com/questions/61950951/linux-shared-li...
Unices have been sharing executable memory between processes longer than there's been mmap for user space to do the same thing themselves. I remember seeing it in the 2BSD kernel for instance.
In this case too, you think it is silly because you don't understand it. Your assumptions are wrong, making it seem silly.
It's weird to leave out a mention of copy-on-write - the optimisation that means that you don't copy over all the memory.
That means you have to allocate new pages to hold a copy of all these structures, even if the actual memory pointed by the pages is shared. And walking all those structures to make a copy is still costly.
Even back in 2012 this blog post showed the high cost of this operation: https://redis.io/blog/testing-fork-time-on-awsxen-infrastruc...
On an m2.xlarge using ~25GB of RAM, fork() took 5.67 seconds. That's a long pause when Redis clients typically experience single-digit msec latency for most operations. Yes, that's only the time needed to copy the page table. It's surprising they don't mention huge pages, it seems like it would be a key consideration here.
No doubt hardware is faster 14 years later, but Redis instances likely use more RAM too. It'd be interesting to see this benchmark revisited.
For the intended audience of such a paper this is base knowledge.
I guess it depends on how sensitive your application is to main thread pauses.
In fact, if you profile it, in the fork() + execve() model, execve() is far more expensive, because not only does it replace the old process with a new one, but it also involves running the dynamic linker, which opens, parses, and mmaps library files.
It still makes sense to get rid of the fork() overhead if you're going to throw away the cloned process state soon thereafter, but if you wanted to make process execution radically faster, rethinking the exec architecture would probably offer more significant gains.
It might be commonly held convention, and thus, an assumption, in Linux (and, broadly, UNIX) but I don't think it's true inside VAX or even Windows, so I don't think it's a requirement.
Unless I've missed something (which is totally possible, this is not an area of OS design I've spent much time).
I am curious about what the best way to handle the example in the article of one process spawning many git subprocesses is. Surely it just doesn't make sense to repeatedly start git from scratch in the course of a long-running parent operation. What's the low cost abstraction for the same result, though?
Otherwise you need multiple steps to create a process, fill it with something to run, and arrange for it to execute. Or like Win32 you permanently smush them together with other layers, like filesystems and object loaders and linkers.
Attempting to justify clone/exec as a reasonable design is just Stockholm syndrome.
The only thing I want to inherit from the parent process is its cwd and environment variables, even those are often overridden. The rest can easily be passed explicitly through other channels like pipes or command line arguments.
Back to the example from the article. It makes no sense that a git-subprocess forked from a web server need to have any process state inherited from the web server.
If I use a library, I also need to start using threads and need to invent some core synchronization mechanism. I essentially are reinventing a small scheduler, when I already get this from the OS for free. Also know any crash in the third-party code will crash the whole program, the third-party code has access to the whole address space. With invoking a process you also have a standardized API implemented by the OS.
I can recall just one program that's intentionally not implemented as a library, but I think people have since built a library on top of it:
https://dechifro.org/dcraw/#:~:text=Why%20don%27t%20you%20im...
If you contrast that with win32, where you optionally pack a bunch of initial values into a struct, win32 is a much more narrow, less pleasant, less freeform interface, where it is harder to introduce more features.
But I think there is already posix_spawn to imitate that philosophy on Unix-like OSs.
What do you mean underestimated? You can do anything between fork and exec; there are no limitations.
I do use threaded code. It's significantly harder to write and reason about. (45 years in to a CS career, ageing out)
You have to be clever to do better than clever people. Clever people bootstrapped me into fork()/exec() and I know my limits.
Yes, we’re not the one paying for Linux development, but its subsystems are so complicated for general purpose computing. Like fitting formula 1 car parts onto a camry.
Yes, it's copy on write... but there is a linear relationship between the size of the process and the number of page table entries required to represent it.
This is not exactly fixed since you can vary the amount of memory each page maps with things like hugepages and the same process can run with different page sizes.
My idea is that we could make a new syscall, for example "spawn", that creates a new empty process, loads some lightweight "loader" into it, and passes arbitrary configuration data. The loader configures the process and exec()'s the main program. This allows to avoid forking the memory and keep existing APIs, but still requires to fork file descriptors and other things.
(Sorry if you weren't joking) but yes, posix_spawn() has been a thing and in glibc fork is just a alias to clone()
Not exactly that OP idea, but fork/exec is legacy really.
For launching something totally new, like the example in the article of some tool calling git, I think it does make a ton of sense to make something new.
Especially since I suspect that is by far the more common case. I suspect “I want a clone of me“ is relatively rarely used at this point.
Every couple of years, someone claims they have "the solution" implying everyone else who came before them didn't know what they were doing.
I.e. a year that starts with 20, not 19.
That, and even those clone-without-pagetable-copy improvements leave a lot of slowness on the table. Being able to skip even disable-able functionality intended for fork would simplify code. Also, for programs that launch the same subprocess many times, a better API might allow caching away some of the pre-entrypoint initialization of exec.
- address space
- memory objects
- threads
Mix and match. A Task (process) is not a primitive, but a composite object combining address space with one or more threads. How you fill the address space with actual memory objects is up to you. Map from disk or COW your own address space...have fun!
https://developer.apple.com/library/archive/documentation/Da...