Moving beyond fork() + exec() (opens in new tab)

(lwn.net)

359 pointsjwilk19d ago345 comments

345 comments

164 comments · 27 top-level

rom1v19d ago· 35 in thread

Related to the discussion: "A fork() in the road": https://www.microsoft.com/en-us/research/wp-content/uploads/...

> ABSTRACT

> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design. In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability. We catalog the ways in which fork is a terrible abstraction for the modern programmer to use, describe how it compromises OS implementations, and propose alternatives.

> As the designers and implementers of operating systems, we should acknowledge that fork’s continued existence as a first-class OS primitive holds back systems research, and deprecate it. As educators, we should teach fork as a historical artifact, and not the first process creation mechanism students encounter.

Animats19d ago

> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design.

No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program. The original implementation worked by swapping out the forking program to disk on a fork() call. Then, at the moment the program was swapped out but control had not returned, the process table entry was duplicated and adjusted so that there were now two processes, one in memory and one swapped out. The one in memory then got control, and could do an exec() call.

This allowed large programs to run on small PDP-11 machines. It was needed back in the era of really expensive memory. That's why.

QNX had an interesting approach. Program loading isn't in the OS at all. There's "fork", but program loading is in a library. It links to a .so file which reads the executable header, allocates memory, loads the program, gets it ready to run, and starts it. The program loader runs in user space and is unprivileged. This is probably the right way to do it.

afiori18d ago

This comment starts with a no, but agrees with the parent...

not_a_bijection18d ago

I think fork() is more of a PDP-7 mistake than a PDP-11 mistake. On the original UNIX system, memory was so limited that the only sane partitioning was to write the running program's memory image to disk, then reuse the running image as the child. An immediate consequence is the UNIX I/O model, where disk I/O is always synchronous (can't swap processes while waiting for disk I/O because swapping processes requires disk I/O). Anyway, as soon as the UNIX group got a PDP-11, the model broke down, because they had enough memory for multiple processes, but fork() didn't allow them to run concurrently, because their first PDP-11 didn't have an MMU. So they whined until they got one with an MMU instead of fixing their broken design.

bluepuma7718d ago

> It was needed back in the era of really expensive memory.

Well, it seems we are back in an era with really expensive memory.

1 more reply

BobbyTables218d ago

The QNX approach is also pretty much how the dynamic linker loads shared libraries today in Linux .

“An era of really expensive memory”. That sounds familiar…

vanviegen18d ago

I think GP was saying that in QNX the spawning process was responsible for dynamically linking it's child process before running it. With Linux, I think it's the spawned process taking care of it's own dynamic linking.

1 more reply

lukan19d ago

It is almost as if you agree with the authors ..

"In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability"

(But thanks for the good explanation)

duped18d ago

> It links to a .so file which reads the executable header, allocates memory, loads the program, gets it ready to run, and starts it. The program loader runs in user space and is unprivileged. This is probably the right way to do it.

aiui this is what exec does, the problem outlined here is the split between process creation (expensive, kernel space, has to be done each time even if spawning the same process "template" repeatedly) and loading (cheap and in userspace).

cryptonector18d ago

> > The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design.

> No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program.

Ironically vfork() is even better in this regard. I wish Unix had only ever had vfork().

dcrazy18d ago

Don’t pretty much all OSes implement process startup in userspace? On macOS, the kernel creates a process with an image of dyld and points it at dyld_start, which actually takes care of parsing the Mach-O header. I assumed ld.so does the same job on Linux.

purkka18d ago

Nope, the kernel can load static ELF binaries. ld.so is only needed for dynamically linked binaries, and in fact many Go applications (for example, as they're statically linked) ship as containers with nothing but the single binary.

2 more replies

fc417fc80218d ago

Yes, it can all be done in userspace. When the "fork in the road" paper came up a while back someone linked to an example. https://grugq.github.io/docs/ul_exec.txt

derriz18d ago

But why is having a pair of separate independent operations, fork and exec, required to achieve this? A single fexec call could be implemented to work in the way you describe, no?

dooglius17d ago

Fork isn't necessary for this, you could just exec directly?

cryptonector18d ago

Cygwin's fork() is similar to what you describe for QNX.

1 more reply

anarazel19d ago

It is somewhat interesting that the most widely used "big" OS that doesn't use fork, i.e. Windows, has dog slow process creation...

I agree that there should be non-fork primitives, I'm just not that sure that performance is the best argument.

mort9619d ago

The problem with fork isn't really that it's slow. The problem is that if you want it to be not-slow, it locks you into a bunch of OS design decisions: you more or less need a memory subsystem where all writable pages are refcounted and copy-on-write when the refcount is bigger than 1, and you need overcommit.

Now these decisions aren't objectively bad, but they have significant trade-offs and it's probably not a good idea that they're forced simply because we use fork()+exec() for process creation.

10 more replies

pjmlp19d ago

Because that OS best practices is to use threads.

Traditionally Windows applications that create processes all the time come from UNIX heritage.

Contrary to UNIX, Windows NT was designed with threads first mentality, from the get go.

While on UNIX they were added after fact, and to this day there are gotchas mixing posix threads with signals, fork and exec.

PaulDavisThe1st19d ago

A more accurate way to describe this is that Windows' (NT onward) core execution context model is a bunch of threads that by default share memory, whereas Unixen have a core task context model of a bunch of threads that by default do not share memory.

Both systems are implemented using threads as the execution context, but in Unix, the history means that that you fork+exec most of the time, resulting in a two tasks that do not share memory any more. By contrast, on Windows (NT onward) the common case when creating a new execution context is to create a thread that shares memory with others in its process.

Both systems allow the easy use of the other's core abstraction. On Unix, you can either code like its 1986 and use fork without exec, or use clone(3) or any of its higher level abstractions like pthreads.

You're right that POSIX semantics get tangled when using threads.

3 more replies

nine_k18d ago

POSIX threads having problems with signals is, imho, mostly the problem with signals in general. They are pretty poorly designed: https://lwn.net/Articles/414618/

sunshowers19d ago

The problem is that threads are not fault boundaries but processes are. So they're not interchangeable when you care about resilience and misbehaving code.

1 more reply

knome19d ago

the only difference between a thread and a process on linux is how many structures they share. the function is identical.

1 more reply

zozbot23419d ago

Windows was designed with threads-first mentality because on pre-386 machines you don't have viable process memory protection, so your tasks share memory by necessity. This is not a great argument.

4 more replies

aseipp19d ago

I suspect it's a long tail sort of thing; it mostly doesn't matter except when it really matters. It's interesting that the stated motivation for the patch is in the context of agentic tools spawning subcommands. There's some related prior art in this area where the payoffs could be much greater, like fuzzing: https://gts3.org/assets/papers/2017/xu:os-fuzz.pdf is an example. It would be very interesting to see this patch applied to e.g. AFL++

nvme0n1p119d ago

That's not the reason for the performance difference. Windows does have a fork primitive (ZwCreateProcess) and it's still slower than Linux's equivalent.

1 more reply

aseipp19d ago

This paper is great and I also really like one of its references [29] as it goes into some more subtle parts of scalable interfaces, including fork. It's a gem IMO: The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors https://people.csail.mit.edu/nickolai/papers/clements-sc.pdf

omoikane19d ago

Discussion at the time:

https://news.ycombinator.com/item?id=19621799 - A fork() in the road (2019-04-10, 178 comments)

jwilkOP19d ago

Discussed also in 2021: https://news.ycombinator.com/item?id=29709802 (16 comments)

pizlonator19d ago

Fork is marvelous for the zygote pattern

Hard to come up with an optimization that is equally efficient and elegant

toast019d ago

The zygote pattern[1] is a great optimization to deal with the cost of forking, but IMHO, being able to inexpensively spawn a carefully tailored process regardless of the size and scope of the current process would be better.

I would guess it would be a small difference in measurable performance between zygote and a direct clean spawn, but it's one less trick an application needs to do, and it would be very helpful for libraries that spawn things. Spawning inside a library isn't always a great thing to do, but some things would really benefit from process level isolation.

[1] In case one isn't aware, the zygote pattern involves forking a 'zygote' process during application startup, and having that process do any forks that need to happen during application runtime. This reduces the cost of forking in large applications, because the zygote will have few fds open and use little memory. This lets your large application spawn new processes without delaying the application or the startup of the new processes. Some applications will spawn many zygotes to allow parallelism for spawning at runtime.

2 more replies

vlovich12319d ago

The paper explicitly covers it that various memory COW/snapshot mechanisms are probably faster and safer than the zygote pattern. As it stands getting the zygote pattern correct and safe is something you have to plan for upfront. You can’t retrofit it which is why the paper mentions it has poor composability. Also the advantages of the zygote pattern can be overstated since the memory sharing benefit is minimal since it has to happen so early and modern OSes already transparently CoW duplicate pages in the background.

1 more reply

p_l18d ago

And so easy to make into bottleneck.

Yes, zygote pattern makes it easy to make fork() into bottleneck - it requires a lot more discipline and low level tricks (linker scripts, compiler-specific extensions, custom sections, low level dependencies on pagesize that get "fun" on ARM servers).

If you don't, you might wake up with fork() causing latency issues.

cyberax18d ago

Unless you want to create a thread in your zygote. Then it breaks down.

Raw fork() is terrible. Instead we need a proper primitive to stop and make a snapshot of a process.

1 more reply

cryptonector18d ago

Ah, my one time on the HN front-page: Fork() is evil; vfork() is goodness; afork() would be better; clone() is stupid (https://news.ycombinator.com/item?id=30502392).

up2isomorphism18d ago

Not sure if fork is outdated or not, but people calling it a “hack” obviously have pretty bad engineering taste.

uecker19d ago· 26 in thread

The elegance of the fork() + exec() model is that every kind of configuration can be done after the fork using all the usual APIs. Every attempt to replace it with a combined call that I have seen so far seemed fundamentally poorer because it needs to add all configuration options as parameters to the call and then do this in away that you can extend it later and does not become a mess.

amluto19d ago

I have the entirely opposite opinion. IMO a big mistake of the UNIXy model is that so much state is preserved across the creation of a process. For example, there are APIs to have a specific thing be fd number 4 so you can run a program and have it find that thing at fd 4. This is weird.

Windows, for all its many, many faults, did not use fork+exec and instead mostly has options for how one creates a process. It wasn’t done elegantly, but it was the right decision.

uecker19d ago

Well, a lot of the power of the UNIX shell comes form this and I see this as a major advantage over Windows. So no, I do not think Windows got it right.

Any kind of replacement should aim for the same conceptual simplicity and power. Sadly, I fear that people driving development nowadays are more interested in building unbreakable walled gardens for advertisement or app stores, or trying to squeeze down the some small gain when used on the cloud. I am more interested in general computing on the user side.

1 more reply

__david__19d ago

Having fd 4 mean something specific is no weirder than having fds 0,1, and 2 mean something specific, which is probably never going to change. At some point you just gotta embrace the Unix.

JdeBP19d ago

Heh! The Unix didn't embrace the idea of file descriptor 3 meaning something specific. (-:

* https://jdebp.uk/FGA/bernstein-on-ttys/cttys.html

Interestingly, on MS/PC/DR-DOS file descriptor 3 was stdaux. and file descriptor 4 was stdprn.

171862744019d ago

Is it weirder, that you can pass an variable precisely into argument 4? You do need to pass information to a subprocess and there needs to be some agreement on what means what. Sure, maybe you could use names instead of fds, but that sounds needlessly complicated.

2 more replies

chasil19d ago

Well, Cygwin and Busybox have shown me that fork-heavy activities are about 100x slower on Windows than Linux.

The Windows approach may be correct, but it suffers in performance from the POSIX perspective.

I have heard that WSL1 iimproves this.

1 more reply

burnt-resistor19d ago

You're simply failing to grasp the value of the simplicity, compatibility, and portability of POSIX/*nix. Inventing yet another way to create a process would be complex and break things. It's a-la-carte to enable configuration after fork of the new CoW or non-CoW process but before exec (unless vfork or similar were used). This is the model.

If you want to greenfield re-engineer the world with all new system calls and a totally different execution model, feel free to go right ahead.

1 more reply

jkrejcha18d ago

I kinda disagree, though I do see the usefulness here. While fork/exec can be useful in some cases, it'd be honestly pretty neat if the APIs took a pidfd argument (maybe with 0 meaning current process). Only program is setuid/setgid binaries I suppose but maybe this case is better handled by special casing `exec`.

For example

   pidfd_t ps = spawn(); // creates a process stopped (kernel does this anyway by default)
   setuid(ps, 33);
   capset(ps, ...);
   socket(ps, ...);
   mmap(ps, ...);
   process_vm_writev(ps, ...);
   exec(ps, ...);
   signal(ps, SIGCONT);
   // error handling elided

I guess this is a little bit me being a bit of critical of the usual syscall APIs for not thinking about "what if I want to do this to another process I have access to" but...

It also makes things like thread safety even reasonably doable with fork. I do agree though that stuff like CreateProcess which take in a gazillion parameters don't really make for the greatest of userspace APIs

uecker18d ago

Maybe, a few people proposed this. It is a lot better than a single spawn call.

But how often would one actually need this? And what are the semantics? Refer arguments (e.g. file descriptors) to the current process or the other one? How are cross-permissions handled? It seems a lot of complexity...

Someones proposed a ptrace_syscall which could achieve the same thing.

jkrejcha18d ago

> But how often would one actually need this?

Well, the idea is that it'd probably be close to the default API for spawning processes (and could even be the bedrock for posix_spawn and friends in libc (and potentially even "simple" fork cases[1])). fork/clone would be the special case

In most cases, most programs don't need special setup. Something like `ptrace_syscall` would also work for this and would be probably the way to do it with the backwards compat limitations of nowadays

ptrace-ability seems to be generally how permissions for this sort of thing are handled in general (see also procfs, process_vm_writev, ptrace, etc). The complication is a little bit around setuid programs but either you could special case execve to imply SIGCONT for setuid or have execve also imply a SIGCONT as well

[1]: Probably would be rare for a compiler to optimize it though

jcranmer18d ago

Calling that elegant is a path dependence of the history of fork+exec.

In an alternative world where fork+exec never existed, a lot of those "usual APIs" would probably have had an explicit pid argument to them that let you modify process configuration from a different process. (This is how Fuschia works, e.g.). There's a lot of benefit to this world: the most obvious is that you don't have to magic up some IPC system just to report configuration errors, but there's actually a good amount of utility in being able to have a manager process that is tweaking attributes of its children (e.g., debuggers would love it).

trumpdong18d ago

Or you could call ptrace_syscall (that doesn't currently exist) on your child processes that you are tracing because you'd always be tracing them by default, or get an io_uring for the child process, or...

1 more reply

uecker18d ago

Weren't there enough parallel paths of development in this world?

fonheponho18d ago

> The elegance of the fork() + exec() model is that every kind of configuration can be done after the fork using all the usual APIs.

Unfortunately, the opposite is true, when the parent process is multi-threaded. In the child process, only one thread exists (the thread returning from fork()), but the memory is an exact copy of the parent's. As a result, the child may inherit locks (resident in memory) that are in acquired state, but have no owner threads -- the threads that are responsible for eventually releasing those locks in the child's copy of the process memory do not exist in the child. If the single thread in the child process (returning from fork()) attempts to take such a lock (before exec), it deadlocks. This is why POSIX says that only async-signal-safe functions may be called in a child process, between fork and exec. And then, for example, "malloc" is not such a function (at least per POSIX), so the fork-to-exec environment in the child process is an extremely uncomfortable one. You've got to preallocate everything in the parent, can't report errors to stderr, etc.

https://pubs.opengroup.org/onlinepubs/9799919799/functions/f...

https://pubs.opengroup.org/onlinepubs/9799919799/functions/V...

The fork(2) Linux manual page spells out the sam restriction.

https://man7.org/linux/man-pages/man2/fork.2.html

https://man7.org/linux/man-pages/man7/signal-safety.7.html

"pthread_atfork" exists, but is effectively unusable.

https://pubs.opengroup.org/onlinepubs/9799919799/functions/p...

uecker18d ago

Yes, threads are a complication, but this still not "the opposite".

fanf219d ago

Yeah. The right way to eliminate fork() is to make the usual APIs that modify process state take an explicit process handle, so the same APIs can be used to set up an empty process. They can also be composed in other ways, eg for IPC or debugging.

zzo38computer18d ago

I agree with it, although still the fork is expensive like they mention. There is clone with some flags, although that does not really solve it.

I think one problem is that it is already how it is; making an entirely new operating system (that is not Linux, not GNU, and not POSIX) would solve it, but that is not the case here, so it would need to be done as it is.

One possibility would be a new function that creates a new empty child process, but the parent process specifies what system calls the child process executes, and can stop if specifying that exec or exit is (successfully) called by the child process, or if the parent process gives it the program memory to execute directly instead of using a file (since that use is also useful). The new function can still have some of the clone flags available. (I don't actually know how much better it would work.)

There are other possibilities as well.

The existing methods can also remain available for when they are helpful, but functions such as popen might be changed to use the new method.

trumpdong18d ago

It should be spawn, configure, exec. Configure can be done if the process starts with a ptrace attachment and no threads, so you can force it to do syscalls. Linux doesn't even have a concept of "process with no threads", so it'd probably have to have a dummy thread.

__david__19d ago

I agree. I think the current way is very nice to use (in c). I think the best way would be to have something similar to vfork() but not bound by posix rules. Then make the normal posix apis (close, setuid, etc.) act like the Rust “builder” pattern. Possibly giving them a prefix for explicitness. That way the “fill out a giant structure” people could have their wish and the people that just want a faster posix experience don’t have to learn an entirely new concept and api surface. It would be future extensible that way, too (just add more prefixed calls to the builder).

cryptonector18d ago

> something similar to vfork() but not bound by posix rules

POSIX says nothing much about vfork() anymore. It was a mistake removing it. Zealots failed to understand that vfork() >> fork(). https://news.ycombinator.com/item?id=30502392

pjc5018d ago

The flip side of this is that you have to be aware of the entire state of the process, including everything done in libraries, in order to correctly start a new process.

Quick, what's the highest numbered open file descriptor in the your program?

This gets even worse if you have multiple threads running. Without looking it up, what is the state of all the various synchronization primitives in a forked process?

garaetjjte19d ago

That's mostly papering over design mistake that most syscalls doesn't accept target pid. Otherwise you could just create suspended process, configure it with syscalls that explicitly take target pid, and start it.

uecker19d ago

Maybe, I am not saying fork() + exec() model couldn't be improved, but most people saying it is "terrible" and it needs to die seem to go on to propose something substantially worse.

trumpdong18d ago

Or have a syscall that runs any other syscall in a different process.

matheusmoreira19d ago

The new system calls described in the article have an extensible declarative command interface built into them to do things like close or duplicate file descriptors. Not opposed to it but it definitely stood out to me.

PaulDavisThe1st19d ago

Whatever elegance fork(2) has (or doesn't) have, clone(2) has more.

sanderjd19d ago· 18 in thread

I just ran into this recently, where I had an obscure bug caused by needing to close more file descriptors in the forked process. "I want a clone of the current process" is just way less common in my experience than "I want a completely new process". It feels crazy that we don't have a way to directly express the latter thing, and can only approximate it by cloning and then fixing things up in post.

171862744019d ago

But you generally want to communicate with that process, so you do need to setup e.g. file descriptors and stuff, which needs information from the parent process to be passed.

yxhuvud19d ago

Yes, you do want to pass in some stuff. But by default you get every single open file descriptor and a copy of every single stack that any threads use for execution.

It shares way too much, and have huge use cases where it is really, really bad.

gmueckl18d ago

A variant of exec could take an initial table of file descriptors in the current process that get cloned into the new child. Pipe creation could also get rolled into this mechanism. That should take care of the most obvious leaky bit of fork()/exec(), at least.

stefan_19d ago

Keep in mind that this is the only way to start any process. Even if you just want to launch some throwaway utility program.

jonhohle19d ago

Most programming languages abstract this out to be able to connect or drop the 3 standard pipes. Typically this is the only thing that can be shared anyway unless the other program is specifically shared and expects other file handles to be available, in which case fork might be the right system call anyway.

1 more reply

sanderjd18d ago

Nevertheless, inclusion would be a better default than exclusion in most use cases I've ever had for process spawning.

7jjjjjjj19d ago

>It feels crazy that we don't have a way to directly express the latter thing

Isn't that what posix_spawn is for?

toast019d ago

posix_spawn addresses the need from userspace. Under the hood, it's still doing more or less a fork/exec, with the baggage that comes with it. A syscall would be nicer.

yxhuvud19d ago

And how do you think posix_spawn is implemented?

JdeBP19d ago

This is an oft-overlooked point. An obvious place to look for improving fork+execve is to see whether posix_spawn can be given more efficient kernel mechanisms to be based upon.

And of course that has already been done. On NetBSD, posix_spawn() is a fully-fledged system call and much of the work is done in kernel mode.

* https://blog.netbsd.org/tnf/entry/posix_spawn_syscall_added

1 more reply

stabbles19d ago

Isn't that covered by O_CLOEXEC?

sanderjd18d ago

I think it is error prone to need to iterate file descriptors and set this in order to inherit nothing. Excluding by default would make sense IMO.

anarazel19d ago

There's a bunch of nastiness around that too. If you have e.g. library state that assumes the fd still works you can get her very confusing bugs once another file is opened into that fd number...

1 more reply

dnw19d ago

What do you mean by "a completely new process"?

wongarsu19d ago

The equivalent of CreateProcessW https://learn.microsoft.com/en-us/windows/win32/api/processt...

sanderjd19d ago

A process that shares nothing with the process that spawned it.

jerf19d ago

A thing that makes that complicated is that while you want that conceptually, you don't want that in reality. For instance, if the spawning process is in a container of some sort and it spawned a process that "shares nothing with the process that spawned it", the spawned process would no longer be in that container, because the state of "being in the container" is one of the things it shares with the parent process.

This is just an example of I don't even know how many things a modern-day process will share from its parent.

By "complicated" I do not even remotely mean "unsolvable". I just mean that if you really dig down into what it means to "share nothing" in a modern operating system, it's a lot richer than it was back when fork+exec was a practical solution. There's a lot of fuzzy things that could go either way when you say "shares nothing".

3 more replies

JoBrad19d ago

That’s how you get zombie processes and memory leaks.

hparadiz19d ago· 9 in thread

Maybe tangentially related but I always think it's silly that every linux process has the same libgcc_so.so.1 loaded into memory for each process even though the raw binary for the library is exactly the same so you end up with like 800 copies of libgcc_so.so.1 in memory.

I mean maybe this has been optimized for already and I don't know what I'm talking about but maybe someone with more knowledge about the kernel knows? Is this something we simply can't optimize for because of security implications?

20198419d ago

Shared libraries (and mmapped files in general) are deduplicated; it's nowhere near as bad as you think. The kernel loads a .so into memory once and then maps that memory into every process that mmaps it.

Editing to add: this deduplication is one of the greatest upsides to dynamic linking. Common libs like libgcc and libc only have to exist in memory once and can stay in CPU caches, whereas if they were statically linked into every binary, each binary would have a copy of that library that wouldn't be shared with anything else and you'd waste a lot of memory.

sjmulder19d ago

Doesn't the loaded code have to be patched for relocations?

3 more replies

saidinesh519d ago

Typically libgcc_so.so is loaded by the linker, which uses an mmap call to map the binary into the address space.

> The kernel keeps track of which file is mapped where, and can detect when a request is made to map an already mapped file again, avoiding physical memory allocation if possible.

Relevant stack overflow answer: https://stackoverflow.com/questions/61950951/linux-shared-li...

mlaretallack19d ago

In Linux, when a shared lib is loaded by multiple processes, its loaded once and not duplicated in ram. Only if a memory page is modified by the process will the memory be duplicated. (Hope I have explained that correctly)

monocasa19d ago

Those mappings by default all go to the same shared memory.

Unices have been sharing executable memory between processes longer than there's been mmap for user space to do the same thing themselves. I remember seeing it in the 2BSD kernel for instance.

BoingBoomTschak19d ago

Eh? Aren't shared libraries actually shared in memory?

171862744019d ago

Yeah, that's kind of the point.

johnthescott18d ago

shared libraries does not imply shared in ram only.

sirsinsalot19d ago

I have a rule for myself. If I think something is silly or stupid, I assume I don't understand it. I usually find I do not understand it, and it no longer seems silly when I do understand it.

In this case too, you think it is silly because you don't understand it. Your assumptions are wrong, making it seem silly.

mrkeen19d ago· 8 in thread

> fork() is a relatively expensive system call; it must copy the entire process state (including memory) for the child process. Many optimizations have been made over the years, but a fork is still a fundamentally costly operation. To make things worse, a fork() call is often immediately followed by an exec(), which will discard all of that memory that was so carefully copied for the child.

It's weird to leave out a mention of copy-on-write - the optimisation that means that you don't copy over all the memory.

tux319d ago

This was left implicit in the article, but what they mean by copying the process state here is the memory management structures. That's mainly the page tables and the VMAs.

That means you have to allocate new pages to hold a copy of all these structures, even if the actual memory pointed by the pages is shared. And walking all those structures to make a copy is still costly.

thamer18d ago

Redis is the kind of process where this matters a lot, and while fork() doesn't copy the memory, it still needs to copy the page table. For a process holding tens of GBs of RAM, fork() can take a long time, and there's one every time Redis dumps its .rdb file or rewrites its binary log ("AOF").

Even back in 2012 this blog post showed the high cost of this operation: https://redis.io/blog/testing-fork-time-on-awsxen-infrastruc...

On an m2.xlarge using ~25GB of RAM, fork() took 5.67 seconds. That's a long pause when Redis clients typically experience single-digit msec latency for most operations. Yes, that's only the time needed to copy the page table. It's surprising they don't mention huge pages, it seems like it would be a key consideration here.

No doubt hardware is faster 14 years later, but Redis instances likely use more RAM too. It'd be interesting to see this benchmark revisited.

epcoa19d ago

> It's weird to leave out a mention of copy-on-write

For the intended audience of such a paper this is base knowledge.

FooBarWidget19d ago

It says state. Copy on write still means it's O(number of page table entries) even if you don't copy the contents. It's a well known issue that forking a program with large virtual memory size is slow.

mort9619d ago

It says "(including memory)". It's pretty natural to read this as "(including the contents of allocated pages)".

m00x19d ago

On modern hardware a cow page copy should only take 1-5ms. Redis forks to save the db to disk and it's been a solid design choice.

I guess it depends on how sensitive your application is to main thread pauses.

3 more replies

cls5919d ago

Even with copy-on-write, fork() still has to pay the setup cost for COW. If the parent process has a lot of busy threads (e.g. Java), you can end up doing a lot of unnecessary COW before exec() fires.

josefx18d ago

Isn't that what vfork tried to address? No COW, the child starts in its parents address space and only gets its own after calling exec.

1 more reply

Sophira19d ago· 8 in thread

I'm guessing that a big part of the problem with moving away from fork() in general is that each new process needs a copy of the parent process' environment anyway, right?

zerobees19d ago

The LWN article is incorrect in saying that it "must copy the entire process state (including memory) for the child process". There are some kernel structures and page tables that need to be initialized, plus you need a new stack, but it's not nearly as dramatic as implied. Most of the parent's memory is "incorporated by reference", so to speak.

In fact, if you profile it, in the fork() + execve() model, execve() is far more expensive, because not only does it replace the old process with a new one, but it also involves running the dynamic linker, which opens, parses, and mmaps library files.

It still makes sense to get rid of the fork() overhead if you're going to throw away the cloned process state soon thereafter, but if you wanted to make process execution radically faster, rethinking the exec architecture would probably offer more significant gains.

corbet19d ago

The kernel does not copy every page, but it does have to copy all of the VMAs. Setting memory to COW (which can involve changing a lot of page-table-entries) is not free either. I guess I could have mentioned copy-on-write explicitly, but I do not believe that what I wrote was incorrect.

nasretdinov19d ago

Fork becomes more and more expensive the higher the RSS of the process, roughly 1ms per 1Gb of the process size with 4kb pages. Given that modern servers can easily support 1-2Tb of RAM the fork() part can easily take several hundred milliseconds, blocking everything in the meantime. So for larger programs you kinda have to have a "fork helper" process if you need to execute external programs for some reason.

sanderjd19d ago

A lot of times you actively don't want the parent environment or any of the memory or file descriptors. And then you have to actively do work to fix all that stuff up after the fork.

dijit19d ago

I'm a bit naive, but I don't think that's necessarily a requirement.

It might be commonly held convention, and thus, an assumption, in Linux (and, broadly, UNIX) but I don't think it's true inside VAX or even Windows, so I don't think it's a requirement.

Unless I've missed something (which is totally possible, this is not an area of OS design I've spent much time).

lanstin19d ago

But also UID, groups, controlling TTY, process group, capabilities, pipes, shared memory, etc. and the file descriptors while maybe not inherently needed are how a lot of Unix plumbing works.

sjmulder19d ago

Even DOS has environment inheritance!

lokar19d ago

the environment is not that big

ajkjk19d ago· 7 in thread

Fork always seemed conceptually terrible even when I first learned about it.. If you want to do one thing (start a process) you should not have to use a mysterious incantation that does a different unrelated thing (forks your process) in order to do it.

I am curious about what the best way to handle the example in the article of one process spawning many git subprocesses is. Surely it just doesn't make sense to repeatedly start git from scratch in the course of a long-running parent operation. What's the low cost abstraction for the same result, though?

spacechild118d ago

Yeah, as someone who originally came from Windows, the fork+exec model never made sense to me. Now I know it's just a historical quirk, but for some reason there are still people who pretend that fork+exec is actually a good thing...

kps18d ago

Fork is conceptually simple. Without bringing in any other layers, you start a process with the one thing known to exist: yourself.

Otherwise you need multiple steps to create a process, fill it with something to run, and arrange for it to execute. Or like Win32 you permanently smush them together with other layers, like filesystems and object loaders and linkers.

IshKebab18d ago

It's not conceptually simple. No other object creation API works by copying an existing thing and then modifying it. You don't create a new file by copying an existing one and then modifying it. You don't create a new window by copying an existing one and modifying it.

Attempting to justify clone/exec as a reasonable design is just Stockholm syndrome.

1 more reply

Too18d ago

Fill with what stuff exactly?

The only thing I want to inherit from the parent process is its cwd and environment variables, even those are often overridden. The rest can easily be passed explicitly through other channels like pipes or command line arguments.

Back to the example from the article. It makes no sense that a git-subprocess forked from a web server need to have any process state inherited from the web server.

1 more reply

ajkjk18d ago

I gues that way of thinking makes sense if you have a certain model of what a process is, in terms of the data structures and runtime state etc. But, tbh, I think of processes as glorified function calls, which happen to have that stuff involved as an implementation detail. And if spawning a process call is supposed to act like a function call, then of course it should not inherit state. You should call the function you want to call, not call yourself with an instruction to switch over to it instead.

1 more reply

wmf19d ago

libgit2 exists. You could imagine communicating with some gitd over a pipe/socket but I don't know why that would be a good idea. Short of that you have to spawn processes.

trumpdong18d ago

On Windows maybe it would be a COM server, using IPC built into the OS. The client sees it like a local function call.

lokar19d ago· 6 in thread

This seems unnecessary to me. In the example, the core of git should be a library yo can link so you don't need to run the binary. That would be better in every way.

171862744019d ago

But when you use a process, you get tons of things for free, the subtask is invoked in parallel, you get isolation and you can control execution for free. Unless you are already writing a multithreaded program or already accept passing objects in memory, using a process is actually easier to write than using a library.

If I use a library, I also need to start using threads and need to invent some core synchronization mechanism. I essentially are reinventing a small scheduler, when I already get this from the OS for free. Also know any crash in the third-party code will crash the whole program, the third-party code has access to the whole address space. With invoking a process you also have a standardized API implemented by the OS.

lokar18d ago

I'm not sure what you mean by inventing a sync mechanism, all languages come with one. Same with a scheduler, either your language runtime or the OS (or both) will deal with scheduling.

omoikane19d ago

Launching git repeatedly was probably not the best example. But it's hard to think of good examples where launching processes repeatedly is the most performant thing to do, probably because launching processes had been expensive and everyone has learned to do something else (libraries, zygotes, etc). Maybe a different question is: if launching processes were cheap, is there something we would implement as processes instead of libraries?

I can recall just one program that's intentionally not implemented as a library, but I think people have since built a library on top of it:

https://dechifro.org/dcraw/#:~:text=Why%20don%27t%20you%20im...

sanderjd19d ago

There are lots of reasons to want to spawn fresh processes, which aren't solved by linking a library.

lokar19d ago

Sure, but not many times a second

2 more replies

aerzen19d ago

Spawning processes should not be on the hot path of any program.

2 more replies

asveikau19d ago· 5 in thread

The things you can do between fork and exec are sometimes underestimated. Off the top of my head, you can call dup2(), you can set a process group id, probably a few other things.

If you contrast that with win32, where you optionally pack a bunch of initial values into a struct, win32 is a much more narrow, less pleasant, less freeform interface, where it is harder to introduce more features.

But I think there is already posix_spawn to imitate that philosophy on Unix-like OSs.

dcrazy18d ago

posix_spawn is emulated on Linux, but it is a native syscall on macOS (and possibly other OSes?). As discussed in the linked article, there is interest in changing Linux to adopt this model, where posix_spawn is its own fundamental primitive.

asveikau18d ago

Yeah, I think it is a reasonable transition path or implementation detail for some systems to implement it in userland atop fork(2), and others to natively spawn a new process without copying the old address space.

loeg18d ago

> The things you can do between fork and exec are sometimes underestimated. Off the top of my head, you can call dup2(), you can set a process group id, probably a few other things.

What do you mean underestimated? You can do anything between fork and exec; there are no limitations.

asveikau18d ago

That's not true. Just one example, if you do anything with threads you are pretty screwed. For example if another thread holds a mutex at the time of fork(2), and you also want that mutex.

2 more replies

dcrazy18d ago

That’s not true. man 7 signal-safety

1 more reply

ggm19d ago· 3 in thread

Aesthetically I have no intention of moving beyond. I'm content with my kernels scheduler and how it maps "heavyweight" processes to cores.

I do use threaded code. It's significantly harder to write and reason about. (45 years in to a CS career, ageing out)

You have to be clever to do better than clever people. Clever people bootstrapped me into fork()/exec() and I know my limits.

redleader5519d ago

When cores start needing more than 9 bits to be represented and RAM is in terabytes, many of the old assumptions need to change. Schedulers need to be implemented in userspace, RAM needs to be allocated in GB, not in 4k, io needs to require less round-trips between kernel and user space and NICs need to do a lot more work before the data reaches the CPU.

skydhash19d ago

Does it need to be the same OS? Most consumer device are in the low 16GB range for memory with some outliers in the 64 and 128 GB. 32 cores are still in the realm of specialized devices.

Yes, we’re not the one paying for Linux development, but its subsystems are so complicated for general purpose computing. Like fitting formula 1 car parts onto a camry.

1 more reply

skydhash19d ago

I’m using Emacs and various cli tools and while threads are nice to have, they can easily ramp up the complexity of a program beyond what is necessary. I much prefer the boilerplate of setting up a thread pool and tasks queue, rather than dealing with all the await/async syntactic sugar.

jcalvinowens19d ago· 2 in thread

It is a weirdly common misconception that that fork() is cheap... it is O(N) on the size of the process, and it always has been.

Yes, it's copy on write... but there is a linear relationship between the size of the process and the number of page table entries required to represent it.

themafia18d ago

> the number of page table entries

This is not exactly fixed since you can vary the amount of memory each page maps with things like hugepages and the same process can run with different page sizes.

IshKebab18d ago

You can in theory, but it's rare in practice because it isn't always enabled and it requires root to configure.

ComputerGuru19d ago· 2 in thread

I'm not surprised Chen's patch was rejected; that's an extremely niche usecase not worth supporting. With my shell developer hat on, I agree with the closing "developers would likely welcome a native implementation that isn't (unlike the current implementation) hiding fork() and exec() under the covers".

smj-edison19d ago

It sounds like they're interested in the concept though, just not that specific implementation.

sanderjd19d ago

Yeah this seems like a promising discussion.

1 more reply

codedokode19d ago· 2 in thread

The problem with replacing exec/fork is that you usually want to configure new process: for example, set up signal handlers, close or open FDs, switch namespaces, setup seccomp, adjust permissions. And all the system calls to do it apply only to the current process and you need something to replace them. The proposal in the article was to create a new API for this.

My idea is that we could make a new syscall, for example "spawn", that creates a new empty process, loads some lightweight "loader" into it, and passes arbitrary configuration data. The loader configures the process and exec()'s the main program. This allows to avoid forking the memory and keep existing APIs, but still requires to fork file descriptors and other things.

nyrikki18d ago

Luckily someone with a time machine saw your post and added it to POSIX.1-2001 :)

(Sorry if you weren't joking) but yes, posix_spawn() has been a thing and in glibc fork is just a alias to clone()

Not exactly that OP idea, but fork/exec is legacy really.

MayCXC17d ago

other people in the thread say that posix_spawn is more or less implemented as a fork+exec wrapper though? it sounds like the idea is more like if there were a separate deferred_fork that made an intermediate "process factory" that let you set up a process without actually creating a new one until the exec. obviously the if() construct would have to be replaced with an in-process handle that mimics calls to the posix api.

debatem119d ago· 2 in thread

There are a lot of slightly different fork-exec-like things in the concept space and it's hard to imagine one approach satisfying them all. IMO it would be interesting to take an approach analogous-ish to sched_ext_ops where you built the rough flow chart of a combined fork-exec, but with hooks built to enable ebpf to change behavior or skip the bits these sophisticated users don't want/need.

MBCook19d ago

Fork/exec is great if you actually want the traditional copy of your process for some reason.

For launching something totally new, like the example in the article of some tool calling git, I think it does make a ton of sense to make something new.

Especially since I suspect that is by far the more common case. I suspect “I want a clone of me“ is relatively rarely used at this point.

debatem118d ago

Relatively rarely, but in some performance sensitive use cases. Mine happens to be fuzzers, where a very cheap fork-like primitive would be a really big win.

1 more reply

LoganDark18d ago· 1 in thread

Huh, LWN has moved to (sometimes) requiring a click to proceed past the subscription pitch to the actual article. I feel like this may have an inverse effect (insistent begging to the point of inserting additional obstacles = angry/insulted users that are less likely to pay).

corbet17d ago

It's an experiment. Compared to the text-obscuring popovers that are prevalent elsewhere on the net, it seems pretty low-key; as far as I know, this is the first complaint I've seen. I don't know if we will continue experimenting with those or not...better ideas for getting people to subscribe to the site would be more than welcome.

burnt-resistor19d ago· 1 in thread

> "If you are repeatedly creating large processes, you are already doing it wrong. The fix is in user space, not the kernel."

Every couple of years, someone claims they have "the solution" implying everyone else who came before them didn't know what they were doing.

yxhuvud19d ago

It can also mean that neither the hardware side or the software side is static, but change over time. That means that their demands and what they allow also change over time. This leads to the insight that what was perhaps a good idea on 70s hardware/software is not necessarily a good, or even ok, idea 50 years later on modern hardware executing OSes and programs that have been kept up to date.

mike_hock19d ago· 1 in thread

The most astonishing part is that this is dated June 5th, 2026.

I.e. a year that starts with 20, not 19.

JdeBP19d ago

These discussions were definitely had back in the 20th century too. The spawn model versus the fork+execve model has been an on-going debate since the time of MS/PC/DR-DOS.

a-dub19d ago· 1 in thread

i thought this was all fixed with special modes of clone that are optimized and don't actually copy anything (ie, it creates a new deficient process that can pretty much only exec)?

zbentley17d ago

Kind of. Those exist, but because Linux’s formal ABI is syscalls and not libraries that combine them in known-safe ways, the clone speedups that make fork faster are a confusing and fragile API for low-level programmers to use.

That, and even those clone-without-pagetable-copy improvements leave a lot of slowness on the table. Being able to skip even disable-able functionality intended for fork would simplify code. Also, for programs that launch the same subprocess many times, a better API might allow caching away some of the pre-entrypoint initialization of exec.

Panzerschrek19d ago

The whole approach of using fork seems to be unnatural for me. In many cases (even in the majority of them) it's not needed to inherit the whole structure of the parent process, but to start a given executable. Windows does this better with its CreateProcessW interface.

ktpsns19d ago

There is lots of discussion on this old API here on hacker news, for instance https://news.ycombinator.com/item?id=31739794

mpweiher18d ago

I've always liked the Mach approach. You've got a few primitives:

- address space

- memory objects

- threads

Mix and match. A Task (process) is not a primitive, but a composite object combining address space with one or more threads. How you fill the address space with actual memory objects is up to you. Map from disk or COW your own address space...have fun!

https://developer.apple.com/library/archive/documentation/Da...

trumpdong18d ago

I liked the other proposal where you can create a blank process and then force it to make syscalls, ending with execve. That doesn't require a bunch of special data structures to hold the syscalls you want to do.

medoc18d ago

Related: executing small commands from the Recoll indexer: https://www.recoll.org/pages/idxthreads/forkingRecoll.html

foo-bar-baz52918d ago

This isn’t moving beyond fork and exec at all. It’s adding a complicated API for a marginal gain for a niche use case, and ignoring the actual big bottleneck of fork

stevefan199918d ago

If fork and exec can exhibit persistent and algebraic behavior (beyond its CoW nature) that would not only be more useful but more interesting to use, for example using it for doing lazy evaluation

tus66618d ago

How can you write for LWN and not have heard of clone(CLONE_THREAD) and multithreading?

high_byte18d ago

all this for 2%

j / k navigate · click thread line to collapse

345 comments

164 comments · 27 top-level

rom1v19d ago· 35 in thread

Related to the discussion: "A fork() in the road": https://www.microsoft.com/en-us/research/wp-content/uploads/...

> ABSTRACT

Animats19d ago

> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design.

This allowed large programs to run on small PDP-11 machines. It was needed back in the era of really expensive memory. That's why.

afiori18d ago

This comment starts with a no, but agrees with the parent...

not_a_bijection18d ago

bluepuma7718d ago

> It was needed back in the era of really expensive memory.

Well, it seems we are back in an era with really expensive memory.

1 more reply

BobbyTables218d ago

The QNX approach is also pretty much how the dynamic linker loads shared libraries today in Linux .

“An era of really expensive memory”. That sounds familiar…

vanviegen18d ago

1 more reply

lukan19d ago

It is almost as if you agree with the authors ..

"In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability"

(But thanks for the good explanation)

duped18d ago

cryptonector18d ago

> > The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design.

> No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program.

Ironically vfork() is even better in this regard. I wish Unix had only ever had vfork().

dcrazy18d ago

purkka18d ago

2 more replies

fc417fc80218d ago

Yes, it can all be done in userspace. When the "fork in the road" paper came up a while back someone linked to an example. https://grugq.github.io/docs/ul_exec.txt

derriz18d ago

But why is having a pair of separate independent operations, fork and exec, required to achieve this? A single fexec call could be implemented to work in the way you describe, no?

dooglius17d ago

Fork isn't necessary for this, you could just exec directly?

cryptonector18d ago

Cygwin's fork() is similar to what you describe for QNX.

1 more reply

anarazel19d ago

It is somewhat interesting that the most widely used "big" OS that doesn't use fork, i.e. Windows, has dog slow process creation...

I agree that there should be non-fork primitives, I'm just not that sure that performance is the best argument.

mort9619d ago

Now these decisions aren't objectively bad, but they have significant trade-offs and it's probably not a good idea that they're forced simply because we use fork()+exec() for process creation.

10 more replies

pjmlp19d ago

Because that OS best practices is to use threads.

Traditionally Windows applications that create processes all the time come from UNIX heritage.

Contrary to UNIX, Windows NT was designed with threads first mentality, from the get go.

While on UNIX they were added after fact, and to this day there are gotchas mixing posix threads with signals, fork and exec.

PaulDavisThe1st19d ago

You're right that POSIX semantics get tangled when using threads.

3 more replies

nine_k18d ago

POSIX threads having problems with signals is, imho, mostly the problem with signals in general. They are pretty poorly designed: https://lwn.net/Articles/414618/

sunshowers19d ago

The problem is that threads are not fault boundaries but processes are. So they're not interchangeable when you care about resilience and misbehaving code.

1 more reply

knome19d ago

the only difference between a thread and a process on linux is how many structures they share. the function is identical.

1 more reply

zozbot23419d ago

Windows was designed with threads-first mentality because on pre-386 machines you don't have viable process memory protection, so your tasks share memory by necessity. This is not a great argument.

4 more replies

aseipp19d ago

nvme0n1p119d ago

That's not the reason for the performance difference. Windows does have a fork primitive (ZwCreateProcess) and it's still slower than Linux's equivalent.

1 more reply

aseipp19d ago

omoikane19d ago

Discussion at the time:

https://news.ycombinator.com/item?id=19621799 - A fork() in the road (2019-04-10, 178 comments)

jwilkOP19d ago

Discussed also in 2021: https://news.ycombinator.com/item?id=29709802 (16 comments)

pizlonator19d ago

Fork is marvelous for the zygote pattern

Hard to come up with an optimization that is equally efficient and elegant

toast019d ago

2 more replies

vlovich12319d ago

1 more reply

p_l18d ago

And so easy to make into bottleneck.

If you don't, you might wake up with fork() causing latency issues.

cyberax18d ago

Unless you want to create a thread in your zygote. Then it breaks down.

Raw fork() is terrible. Instead we need a proper primitive to stop and make a snapshot of a process.

1 more reply

cryptonector18d ago

Ah, my one time on the HN front-page: Fork() is evil; vfork() is goodness; afork() would be better; clone() is stupid (https://news.ycombinator.com/item?id=30502392).

up2isomorphism18d ago

Not sure if fork is outdated or not, but people calling it a “hack” obviously have pretty bad engineering taste.

uecker19d ago· 26 in thread

amluto19d ago

Windows, for all its many, many faults, did not use fork+exec and instead mostly has options for how one creates a process. It wasn’t done elegantly, but it was the right decision.

uecker19d ago

Well, a lot of the power of the UNIX shell comes form this and I see this as a major advantage over Windows. So no, I do not think Windows got it right.

1 more reply

__david__19d ago

Having fd 4 mean something specific is no weirder than having fds 0,1, and 2 mean something specific, which is probably never going to change. At some point you just gotta embrace the Unix.

JdeBP19d ago

Heh! The Unix didn't embrace the idea of file descriptor 3 meaning something specific. (-:

* https://jdebp.uk/FGA/bernstein-on-ttys/cttys.html

Interestingly, on MS/PC/DR-DOS file descriptor 3 was stdaux. and file descriptor 4 was stdprn.

171862744019d ago

2 more replies

chasil19d ago

Well, Cygwin and Busybox have shown me that fork-heavy activities are about 100x slower on Windows than Linux.

The Windows approach may be correct, but it suffers in performance from the POSIX perspective.

I have heard that WSL1 iimproves this.

1 more reply

burnt-resistor19d ago

If you want to greenfield re-engineer the world with all new system calls and a totally different execution model, feel free to go right ahead.

1 more reply

jkrejcha18d ago

For example

   pidfd_t ps = spawn(); // creates a process stopped (kernel does this anyway by default)
   setuid(ps, 33);
   capset(ps, ...);
   socket(ps, ...);
   mmap(ps, ...);
   process_vm_writev(ps, ...);
   exec(ps, ...);
   signal(ps, SIGCONT);
   // error handling elided

I guess this is a little bit me being a bit of critical of the usual syscall APIs for not thinking about "what if I want to do this to another process I have access to" but...

uecker18d ago

Maybe, a few people proposed this. It is a lot better than a single spawn call.

Someones proposed a ptrace_syscall which could achieve the same thing.

jkrejcha18d ago

> But how often would one actually need this?

[1]: Probably would be rare for a compiler to optimize it though

jcranmer18d ago

Calling that elegant is a path dependence of the history of fork+exec.

trumpdong18d ago

1 more reply

uecker18d ago

Weren't there enough parallel paths of development in this world?

fonheponho18d ago

> The elegance of the fork() + exec() model is that every kind of configuration can be done after the fork using all the usual APIs.

https://pubs.opengroup.org/onlinepubs/9799919799/functions/f...

https://pubs.opengroup.org/onlinepubs/9799919799/functions/V...

The fork(2) Linux manual page spells out the sam restriction.

https://man7.org/linux/man-pages/man2/fork.2.html

https://man7.org/linux/man-pages/man7/signal-safety.7.html

"pthread_atfork" exists, but is effectively unusable.

https://pubs.opengroup.org/onlinepubs/9799919799/functions/p...

uecker18d ago

Yes, threads are a complication, but this still not "the opposite".

fanf219d ago

zzo38computer18d ago

I agree with it, although still the fork is expensive like they mention. There is clone with some flags, although that does not really solve it.

There are other possibilities as well.

The existing methods can also remain available for when they are helpful, but functions such as popen might be changed to use the new method.

trumpdong18d ago

__david__19d ago

cryptonector18d ago

> something similar to vfork() but not bound by posix rules

POSIX says nothing much about vfork() anymore. It was a mistake removing it. Zealots failed to understand that vfork() >> fork(). https://news.ycombinator.com/item?id=30502392

pjc5018d ago

The flip side of this is that you have to be aware of the entire state of the process, including everything done in libraries, in order to correctly start a new process.

Quick, what's the highest numbered open file descriptor in the your program?

This gets even worse if you have multiple threads running. Without looking it up, what is the state of all the various synchronization primitives in a forked process?

garaetjjte19d ago

uecker19d ago

Maybe, I am not saying fork() + exec() model couldn't be improved, but most people saying it is "terrible" and it needs to die seem to go on to propose something substantially worse.

trumpdong18d ago

Or have a syscall that runs any other syscall in a different process.

matheusmoreira19d ago

PaulDavisThe1st19d ago

Whatever elegance fork(2) has (or doesn't) have, clone(2) has more.

sanderjd19d ago· 18 in thread

171862744019d ago

But you generally want to communicate with that process, so you do need to setup e.g. file descriptors and stuff, which needs information from the parent process to be passed.

yxhuvud19d ago

Yes, you do want to pass in some stuff. But by default you get every single open file descriptor and a copy of every single stack that any threads use for execution.

It shares way too much, and have huge use cases where it is really, really bad.

gmueckl18d ago

stefan_19d ago

Keep in mind that this is the only way to start any process. Even if you just want to launch some throwaway utility program.

jonhohle19d ago

1 more reply

sanderjd18d ago

Nevertheless, inclusion would be a better default than exclusion in most use cases I've ever had for process spawning.

7jjjjjjj19d ago

>It feels crazy that we don't have a way to directly express the latter thing

Isn't that what posix_spawn is for?

toast019d ago

posix_spawn addresses the need from userspace. Under the hood, it's still doing more or less a fork/exec, with the baggage that comes with it. A syscall would be nicer.

yxhuvud19d ago

And how do you think posix_spawn is implemented?

JdeBP19d ago

This is an oft-overlooked point. An obvious place to look for improving fork+execve is to see whether posix_spawn can be given more efficient kernel mechanisms to be based upon.

And of course that has already been done. On NetBSD, posix_spawn() is a fully-fledged system call and much of the work is done in kernel mode.

* https://blog.netbsd.org/tnf/entry/posix_spawn_syscall_added

1 more reply

stabbles19d ago

Isn't that covered by O_CLOEXEC?

sanderjd18d ago

I think it is error prone to need to iterate file descriptors and set this in order to inherit nothing. Excluding by default would make sense IMO.

anarazel19d ago

There's a bunch of nastiness around that too. If you have e.g. library state that assumes the fd still works you can get her very confusing bugs once another file is opened into that fd number...

1 more reply

dnw19d ago

What do you mean by "a completely new process"?

wongarsu19d ago

The equivalent of CreateProcessW https://learn.microsoft.com/en-us/windows/win32/api/processt...

sanderjd19d ago

A process that shares nothing with the process that spawned it.

jerf19d ago

This is just an example of I don't even know how many things a modern-day process will share from its parent.

3 more replies

JoBrad19d ago

That’s how you get zombie processes and memory leaks.

hparadiz19d ago· 9 in thread

20198419d ago

sjmulder19d ago

Doesn't the loaded code have to be patched for relocations?

3 more replies

saidinesh519d ago

Typically libgcc_so.so is loaded by the linker, which uses an mmap call to map the binary into the address space.

> The kernel keeps track of which file is mapped where, and can detect when a request is made to map an already mapped file again, avoiding physical memory allocation if possible.

Relevant stack overflow answer: https://stackoverflow.com/questions/61950951/linux-shared-li...

mlaretallack19d ago

monocasa19d ago

Those mappings by default all go to the same shared memory.

Unices have been sharing executable memory between processes longer than there's been mmap for user space to do the same thing themselves. I remember seeing it in the 2BSD kernel for instance.

BoingBoomTschak19d ago

Eh? Aren't shared libraries actually shared in memory?

171862744019d ago

Yeah, that's kind of the point.

johnthescott18d ago

shared libraries does not imply shared in ram only.

sirsinsalot19d ago

I have a rule for myself. If I think something is silly or stupid, I assume I don't understand it. I usually find I do not understand it, and it no longer seems silly when I do understand it.

In this case too, you think it is silly because you don't understand it. Your assumptions are wrong, making it seem silly.

mrkeen19d ago· 8 in thread

It's weird to leave out a mention of copy-on-write - the optimisation that means that you don't copy over all the memory.

tux319d ago

This was left implicit in the article, but what they mean by copying the process state here is the memory management structures. That's mainly the page tables and the VMAs.

thamer18d ago

Even back in 2012 this blog post showed the high cost of this operation: https://redis.io/blog/testing-fork-time-on-awsxen-infrastruc...

No doubt hardware is faster 14 years later, but Redis instances likely use more RAM too. It'd be interesting to see this benchmark revisited.

epcoa19d ago

> It's weird to leave out a mention of copy-on-write

For the intended audience of such a paper this is base knowledge.

FooBarWidget19d ago

mort9619d ago

It says "(including memory)". It's pretty natural to read this as "(including the contents of allocated pages)".

m00x19d ago

On modern hardware a cow page copy should only take 1-5ms. Redis forks to save the db to disk and it's been a solid design choice.

I guess it depends on how sensitive your application is to main thread pauses.

3 more replies

cls5919d ago

josefx18d ago

Isn't that what vfork tried to address? No COW, the child starts in its parents address space and only gets its own after calling exec.

1 more reply

Sophira19d ago· 8 in thread

I'm guessing that a big part of the problem with moving away from fork() in general is that each new process needs a copy of the parent process' environment anyway, right?

zerobees19d ago

corbet19d ago

nasretdinov19d ago

sanderjd19d ago

A lot of times you actively don't want the parent environment or any of the memory or file descriptors. And then you have to actively do work to fix all that stuff up after the fork.

dijit19d ago

I'm a bit naive, but I don't think that's necessarily a requirement.

It might be commonly held convention, and thus, an assumption, in Linux (and, broadly, UNIX) but I don't think it's true inside VAX or even Windows, so I don't think it's a requirement.

Unless I've missed something (which is totally possible, this is not an area of OS design I've spent much time).

lanstin19d ago

But also UID, groups, controlling TTY, process group, capabilities, pipes, shared memory, etc. and the file descriptors while maybe not inherently needed are how a lot of Unix plumbing works.

sjmulder19d ago

Even DOS has environment inheritance!

lokar19d ago

the environment is not that big

ajkjk19d ago· 7 in thread

spacechild118d ago

kps18d ago

Fork is conceptually simple. Without bringing in any other layers, you start a process with the one thing known to exist: yourself.

IshKebab18d ago

Attempting to justify clone/exec as a reasonable design is just Stockholm syndrome.

1 more reply

Too18d ago

Fill with what stuff exactly?

Back to the example from the article. It makes no sense that a git-subprocess forked from a web server need to have any process state inherited from the web server.

1 more reply

ajkjk18d ago

1 more reply

wmf19d ago

libgit2 exists. You could imagine communicating with some gitd over a pipe/socket but I don't know why that would be a good idea. Short of that you have to spawn processes.

trumpdong18d ago

On Windows maybe it would be a COM server, using IPC built into the OS. The client sees it like a local function call.

lokar19d ago· 6 in thread

This seems unnecessary to me. In the example, the core of git should be a library yo can link so you don't need to run the binary. That would be better in every way.

171862744019d ago

lokar18d ago

I'm not sure what you mean by inventing a sync mechanism, all languages come with one. Same with a scheduler, either your language runtime or the OS (or both) will deal with scheduling.

omoikane19d ago

I can recall just one program that's intentionally not implemented as a library, but I think people have since built a library on top of it:

https://dechifro.org/dcraw/#:~:text=Why%20don%27t%20you%20im...

sanderjd19d ago

There are lots of reasons to want to spawn fresh processes, which aren't solved by linking a library.

lokar19d ago

Sure, but not many times a second

2 more replies

aerzen19d ago

Spawning processes should not be on the hot path of any program.

2 more replies

asveikau19d ago· 5 in thread

The things you can do between fork and exec are sometimes underestimated. Off the top of my head, you can call dup2(), you can set a process group id, probably a few other things.

But I think there is already posix_spawn to imitate that philosophy on Unix-like OSs.

dcrazy18d ago

asveikau18d ago

loeg18d ago

> The things you can do between fork and exec are sometimes underestimated. Off the top of my head, you can call dup2(), you can set a process group id, probably a few other things.

What do you mean underestimated? You can do anything between fork and exec; there are no limitations.

asveikau18d ago

That's not true. Just one example, if you do anything with threads you are pretty screwed. For example if another thread holds a mutex at the time of fork(2), and you also want that mutex.

2 more replies

dcrazy18d ago

That’s not true. man 7 signal-safety

1 more reply

ggm19d ago· 3 in thread

Aesthetically I have no intention of moving beyond. I'm content with my kernels scheduler and how it maps "heavyweight" processes to cores.

I do use threaded code. It's significantly harder to write and reason about. (45 years in to a CS career, ageing out)

You have to be clever to do better than clever people. Clever people bootstrapped me into fork()/exec() and I know my limits.

redleader5519d ago

skydhash19d ago

Does it need to be the same OS? Most consumer device are in the low 16GB range for memory with some outliers in the 64 and 128 GB. 32 cores are still in the realm of specialized devices.

Yes, we’re not the one paying for Linux development, but its subsystems are so complicated for general purpose computing. Like fitting formula 1 car parts onto a camry.

1 more reply

skydhash19d ago

jcalvinowens19d ago· 2 in thread

It is a weirdly common misconception that that fork() is cheap... it is O(N) on the size of the process, and it always has been.

Yes, it's copy on write... but there is a linear relationship between the size of the process and the number of page table entries required to represent it.

themafia18d ago

> the number of page table entries

This is not exactly fixed since you can vary the amount of memory each page maps with things like hugepages and the same process can run with different page sizes.

IshKebab18d ago

You can in theory, but it's rare in practice because it isn't always enabled and it requires root to configure.

ComputerGuru19d ago· 2 in thread

smj-edison19d ago

It sounds like they're interested in the concept though, just not that specific implementation.

sanderjd19d ago

Yeah this seems like a promising discussion.

1 more reply

codedokode19d ago· 2 in thread

nyrikki18d ago

Luckily someone with a time machine saw your post and added it to POSIX.1-2001 :)

(Sorry if you weren't joking) but yes, posix_spawn() has been a thing and in glibc fork is just a alias to clone()

Not exactly that OP idea, but fork/exec is legacy really.

MayCXC17d ago

debatem119d ago· 2 in thread

MBCook19d ago

Fork/exec is great if you actually want the traditional copy of your process for some reason.

For launching something totally new, like the example in the article of some tool calling git, I think it does make a ton of sense to make something new.

Especially since I suspect that is by far the more common case. I suspect “I want a clone of me“ is relatively rarely used at this point.

debatem118d ago

Relatively rarely, but in some performance sensitive use cases. Mine happens to be fuzzers, where a very cheap fork-like primitive would be a really big win.

1 more reply

LoganDark18d ago· 1 in thread

corbet17d ago

burnt-resistor19d ago· 1 in thread

> "If you are repeatedly creating large processes, you are already doing it wrong. The fix is in user space, not the kernel."

Every couple of years, someone claims they have "the solution" implying everyone else who came before them didn't know what they were doing.

yxhuvud19d ago

mike_hock19d ago· 1 in thread

The most astonishing part is that this is dated June 5th, 2026.

I.e. a year that starts with 20, not 19.

JdeBP19d ago

These discussions were definitely had back in the 20th century too. The spawn model versus the fork+execve model has been an on-going debate since the time of MS/PC/DR-DOS.

a-dub19d ago· 1 in thread

i thought this was all fixed with special modes of clone that are optimized and don't actually copy anything (ie, it creates a new deficient process that can pretty much only exec)?

zbentley17d ago

Panzerschrek19d ago

ktpsns19d ago

There is lots of discussion on this old API here on hacker news, for instance https://news.ycombinator.com/item?id=31739794

mpweiher18d ago

I've always liked the Mach approach. You've got a few primitives:

- address space

- memory objects

- threads

https://developer.apple.com/library/archive/documentation/Da...

trumpdong18d ago

medoc18d ago

Related: executing small commands from the Recoll indexer: https://www.recoll.org/pages/idxthreads/forkingRecoll.html

foo-bar-baz52918d ago

This isn’t moving beyond fork and exec at all. It’s adding a complicated API for a marginal gain for a niche use case, and ignoring the actual big bottleneck of fork

stevefan199918d ago

If fork and exec can exhibit persistent and algebraic behavior (beyond its CoW nature) that would not only be more useful but more interesting to use, for example using it for doing lazy evaluation

tus66618d ago

How can you write for LWN and not have heard of clone(CLONE_THREAD) and multithreading?

high_byte18d ago

all this for 2%

j / k navigate · click thread line to collapse