Systemd vs. Docker (opens in new tab)

(lwn.net)

99 pointstomaac10y ago93 comments

93 comments

js210y ago

One of these is killing "zombie" processes that have been abandoned by their calling session.

That's funny terminolgy, isn't it? Killing a process usually means sending it a signal, typically TERM or KILL, that causes it to exit. But a zombie process is one that has already exited, but hasn't been waited for by its parent, where its parent is either the process that spawned it, or if that process has died, the process with PID 1. This is usually referred to as reaping the zombie process, not killing it. AFAIK, a signal sent to a zombie process is simply ignored.

Or do the quotes around zombie imply a different meaning, such as "zombie-like"?

ChrisArgyle10y ago

If we're being technical the author should have wrote "reaping" not "killing". It's a very different process.

The use of quotes is probably an acknowledgement that the term "zombie" is not universal. For example Linux uses "defunct" instead.

Basically, zombie processes happen when a child process exits but the parent process--the one that spawned it--doesn't reap it. [1]

[1] https://en.wikipedia.org/wiki/Zombie_process

masklinn10y ago

No, it's a zombie in the normal sense, the killing here is not sending it a signal but reaping zombie processes (in the sense of personified death reaping souls) by waiting on it.

Things would probably be clearer if the quotes were around "killing" rather than "zombie", mayhaps the interviewer/writer was unfamiliar with the terminology.

ibotty10y ago

I strongly doubt that! Josh knows the terminology. It was surly just an oversight.

thwarted10y ago

Poettering says that PID 1 has special requirements. One of these is killing "zombie" processes that have been abandoned by their calling session. This is a real problem for Docker since the application runs as PID 1 and does not handle the zombie processes. For example, containers running the Oracle database can end up with thousands of zombie processes.

Why does Poettering keep claiming this when he's the one who submitted the patch that adds the PR_SET_CHILD_SUBREAPER prctl(2) [0] functionality?

[0] http://man7.org/linux/man-pages/man2/prctl.2.html

masklinn10y ago

That doesn't have anything to do with Poettering's quote.

PR_SET_CHILD_SUBREAPER moves the ownership of an orphaned process to whichever process was selected rather than the default PID1, and that only works for descendant of the subreaper.

The problem pointed by the quote is that normal software doesn't go around checking if it has zombie children and waiting on them, so in a container with random software S set as PID1 and creating subprocesses, zombies may accumulate until resources are exhausted[0].

PR_SET_CHILD_SUBREAPER is a way to cause that problem on a system with a proper init (or to test that your init works properly without needing to boot into it)

It's not a new observation: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zomb...

Previous HN discussion: https://news.ycombinator.com/item?id=8916785

[0] by default the limit is 32k processes after which the kernel will simply refuse to create new ones

thwarted10y ago

Yes it does. He's claiming that systemd should manage the container processes as pid1, because systemd will then clean up the zombies. But anything that reaps zombies can be pid1 -- systemd isn't special in this regard. And even if you did use something that didn't reap zombies as pid1, you could leverage PR_SET_CHILD_SUBREAPER as some other non-pid1 process to grab zombies for descendants it spawns.

If you do use PR_SET_CHILD_SUBREAPER, then you need to reap whatever gets reparented to you; if you don't do this then the process table will eventually fill up with zombies. He is correct that few programs do that, but there's nothing that requires that to be done by pid1 if all the processes within the container are spawned by something that provides that functionality and uses PR_SET_CHILD_SUBREAPER.

1 more reply

cpuguy8310y ago

This is true, but what if the thing that spins up the actual container process sets this?

1 more reply

pas10y ago

I guess he's saying, that you can't just take any random binary and run it in a Docker container, because if that binary spawns a lot of children but does not wait for them, then you'll have a lot of zombies.

Docker could run a minimal pid1 in each container to address this. Though if this had been a big issue I guess this would have been already fixed.

Naturally, a proof of concept of the problem would be great. (Let's say a Dockerfile.)

vidarh10y ago

It has been a reasonably big issue. E.g. I kept seeing zombies with Consul for a while until we realised that every single Consul Docker container on Dockerhub just had Consul run as pid 1 in the container (this is a while ago, no idea if that's still the case), without realising that Consul health checks then could end up as zombies if you weren't very careful about how you wrote them (e.g. typical example: Spawning curl from a shell script, with a timeout on the health check that was shorter than any timeouts on the curl requests).

It's usually fairly simple to fix (e.g. for Consul above, I raised it with the Consul guys and they said they'd look at adding waiting on children to it as a precaution - it's just a couple of lines -, but people building containers could also introduce a minimal init, or you can write your health checks to guard against it), but it happens all over the place, and people are often unaware and so not on the lookout for it and it may not be immediately obvious.

The reason I raised it as an issue for Consul, for example, even though it wasn't really their fault, but an issue with the containers, is that people need to be aware of the problem when packaging the containers, need to be aware that a given application may spawn children, and that they may not wait for them. Even a lot of people aware of the zombie issue end up packaging software that they didn't realise where spawning child processes that could end up as zombies (in this case, it took running it in a container without a proper pid 1, using health checks which not everyone will do, and writing the health checks in a particular way in order to notice the effects).

Thankfully there are a number of tiny little inits. E.g. there's suckless sinit [1], Tini[2] , and here's a tiny little proof of concept Go init [3] I wrote (though frankly, suckless or Tini compiled with musl will give you a much smaller binary) as what little you actually need to do is very trivial.

[1] http://git.suckless.org/sinit

[2] https://github.com/krallin/tini

[3] https://gist.github.com/vidarh/91a110792c86d6c3bb41

2 more replies

js210y ago

Just to clarify: even with a proper init, if a process spawns children and doesn't wait on them, you still have zombies until the parent either dies (allowing init to inherit the zombie, at which point it waits on it), or the parent waits. This is the reason behind the double-fork trick.

drothlis10y ago

See also the article linked earlier in the comments: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zomb...

atemerev10y ago

Supervisord is the officially blessed solution:

https://docs.docker.com/engine/admin/using_supervisord/

2 more replies

storrgie10y ago

why not systemd-nspawn (zoidberg voice)

Seems like the way Fedora is packaging systemd for 24 is going to move systemd-nspawn to a level of maturity that will likely surpass some of the clunky issues folks have with running docker.

pmoriarty10y ago

CoreOS's Rocket is built around systemd?

That alone disqualifies it for me right there.

philips10y ago

rkt isn't built around systemd. It does use it internally and integrates well with it.

Inside of rkt there is an internal logical separation between the tool that sets up the container filesystems and the one that executes them. We call those things stages[1].

Now inside of rkt we have a few different "stage1" options today:

- systemd: this means that your container has a real init system

- clear containers: execute the container inside of a virtual machine with lkvm.[2]

- direct execution w/ fly: no init system is involved for special privileged containers.[3]

If someone wanted to contribute a stage1 that used a different init system that would be great. But, today systemd works fine and is generally an implementation detail. We also get some bonuses by using systemd on systemd systems like machinectl integration, and journald integration.

Also, I should note that rkt should work on non-systemd systems as well. Again, because, systemd is an internal detail.

[1] https://coreos.com/rkt/docs/latest/devel/architecture.html#s... [2] https://coreos.com/blog/rkt-0.8-with-new-vm-support/ [3] https://coreos.com/blog/rkt-0.15.0-introduces-rkt-fly.html

bryanlarsen10y ago

Why the systemd hate? Because it's a big monolithic project that takes over your system? You do realize that Docker is much more monolithic and opinionated than systemd, right?

kordless10y ago

When you ask questions, especially leading ones, it causes a good deal of confusion around the topic at hand. The reasons behind this are complex, but they have something to do with our tendency to double bind each other.

Someone has the right to say why something is "disqualified" for them, even if it is devoid of context. What is awesome here is that the leading expert for this topic is replying directly to the negative (empty) opinion and actually presents a (rich) alternate opinion.

How does you asking unanswerable questions contribute to resolving the conversation to something we can all learn from?

1 more reply

qwertyuiop92410y ago

Yes, but I will never have to use docker if I don't want to. For that matter, docker doesn't try to be cron, it doesn't want to handle mounting, didn't subsume udev, and doesn't encourage other project to link against it, and to drop all compatibility with non-linux systems. Systemd does, did, and is doing that right now.

1 more reply

agentgt10y ago

I really hope unikernels take off because I really hate dealing with both (particularly docker more so than systemd).

1 more reply

atemerev10y ago

Opinionated, yes. Monolithic, no. Huge mess of everything that deeply integrates in any system — of course not, your containers don't need to know anything about Docker and host system, you are absolutely free in choices. It's even possible to run (gasp!) multiple services with supervision inside Docker.

1 more reply

baldfat10y ago

> That alone disqualifies it for me right there

For philosophy reasons? Can people just not accept that systemd is the main solution that the community has accepted and move along?

jimktrains210y ago

Or they can move to one of the BSDs and use jails which are much more stable, secure, and tested than linux containers.

5 more replies

michaelmrose10y ago

The concept of the community is an abstraction and in this case a bad one. There is no community. There are a million different individuals and within that thousands of communities each composed of some subset of those individuals.

There is no reason each subset or each individual even shouldn't have their own opinion and based their actions upon it.

1 more reply

atemerev10y ago

I don't always run containerized applications, but when I do, I prefer them completely systemd-free, thank you.

Sometimes I wonder if systemd is actually a part of big plan of moving everyone to microservices and containers and maybe even unikernels — anything, just anything without this abomination.

bryanlarsen10y ago

Can you explain your position to me? I can understand somebody who dislikes systemd and dislikes docker. I can understand somebody who likes both systemd and docker. But disliking systemd but liking docker? That I don't understand. Any effective criticism of systemd that I've heard generally can also be applied to docker.

Like yours: "I wonder if systemd is actually a part of big plan of moving everyone to microservices and containers and maybe even unikernels" works even better if you replace systemd with docker.

atemerev10y ago

Docker is just a toolkit for composing and networking layered OS images. It improves isolation of things and adheres to simple principles (immutable containers, restarting instead attempting to recover, etc.) It structures things better. Inter-container communication is deliberately simple (env variables and, recently, networking).

Systemd spits on isolation, it embraces integration of everything. Supervision, logging, communication, IO, configuration, state management — everything goes through systemd. Everything is binary and opaque. Docker is transparent.

2 more replies

j / k navigate · click thread line to collapse

93 comments

js210y ago

One of these is killing "zombie" processes that have been abandoned by their calling session.

Or do the quotes around zombie imply a different meaning, such as "zombie-like"?

ChrisArgyle10y ago

If we're being technical the author should have wrote "reaping" not "killing". It's a very different process.

The use of quotes is probably an acknowledgement that the term "zombie" is not universal. For example Linux uses "defunct" instead.

Basically, zombie processes happen when a child process exits but the parent process--the one that spawned it--doesn't reap it. [1]

[1] https://en.wikipedia.org/wiki/Zombie_process

masklinn10y ago

No, it's a zombie in the normal sense, the killing here is not sending it a signal but reaping zombie processes (in the sense of personified death reaping souls) by waiting on it.

Things would probably be clearer if the quotes were around "killing" rather than "zombie", mayhaps the interviewer/writer was unfamiliar with the terminology.

ibotty10y ago

I strongly doubt that! Josh knows the terminology. It was surly just an oversight.

thwarted10y ago

Why does Poettering keep claiming this when he's the one who submitted the patch that adds the PR_SET_CHILD_SUBREAPER prctl(2) [0] functionality?

[0] http://man7.org/linux/man-pages/man2/prctl.2.html

masklinn10y ago

That doesn't have anything to do with Poettering's quote.

PR_SET_CHILD_SUBREAPER moves the ownership of an orphaned process to whichever process was selected rather than the default PID1, and that only works for descendant of the subreaper.

PR_SET_CHILD_SUBREAPER is a way to cause that problem on a system with a proper init (or to test that your init works properly without needing to boot into it)

It's not a new observation: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zomb...

Previous HN discussion: https://news.ycombinator.com/item?id=8916785

[0] by default the limit is 32k processes after which the kernel will simply refuse to create new ones

thwarted10y ago

1 more reply

cpuguy8310y ago

This is true, but what if the thing that spins up the actual container process sets this?

1 more reply

pas10y ago

Docker could run a minimal pid1 in each container to address this. Though if this had been a big issue I guess this would have been already fixed.

Naturally, a proof of concept of the problem would be great. (Let's say a Dockerfile.)

vidarh10y ago

[1] http://git.suckless.org/sinit

[2] https://github.com/krallin/tini

[3] https://gist.github.com/vidarh/91a110792c86d6c3bb41

2 more replies

js210y ago

drothlis10y ago

See also the article linked earlier in the comments: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zomb...

atemerev10y ago

Supervisord is the officially blessed solution:

https://docs.docker.com/engine/admin/using_supervisord/

2 more replies

storrgie10y ago

why not systemd-nspawn (zoidberg voice)

Seems like the way Fedora is packaging systemd for 24 is going to move systemd-nspawn to a level of maturity that will likely surpass some of the clunky issues folks have with running docker.

pmoriarty10y ago

CoreOS's Rocket is built around systemd?

That alone disqualifies it for me right there.

philips10y ago

rkt isn't built around systemd. It does use it internally and integrates well with it.

Inside of rkt there is an internal logical separation between the tool that sets up the container filesystems and the one that executes them. We call those things stages[1].

Now inside of rkt we have a few different "stage1" options today:

- systemd: this means that your container has a real init system

- clear containers: execute the container inside of a virtual machine with lkvm.[2]

- direct execution w/ fly: no init system is involved for special privileged containers.[3]

Also, I should note that rkt should work on non-systemd systems as well. Again, because, systemd is an internal detail.

[1] https://coreos.com/rkt/docs/latest/devel/architecture.html#s... [2] https://coreos.com/blog/rkt-0.8-with-new-vm-support/ [3] https://coreos.com/blog/rkt-0.15.0-introduces-rkt-fly.html

bryanlarsen10y ago

Why the systemd hate? Because it's a big monolithic project that takes over your system? You do realize that Docker is much more monolithic and opinionated than systemd, right?

kordless10y ago

How does you asking unanswerable questions contribute to resolving the conversation to something we can all learn from?

1 more reply

qwertyuiop92410y ago

1 more reply

agentgt10y ago

I really hope unikernels take off because I really hate dealing with both (particularly docker more so than systemd).

1 more reply

atemerev10y ago

1 more reply

baldfat10y ago

> That alone disqualifies it for me right there

For philosophy reasons? Can people just not accept that systemd is the main solution that the community has accepted and move along?

jimktrains210y ago

Or they can move to one of the BSDs and use jails which are much more stable, secure, and tested than linux containers.

5 more replies

michaelmrose10y ago

There is no reason each subset or each individual even shouldn't have their own opinion and based their actions upon it.

1 more reply

atemerev10y ago

I don't always run containerized applications, but when I do, I prefer them completely systemd-free, thank you.

Sometimes I wonder if systemd is actually a part of big plan of moving everyone to microservices and containers and maybe even unikernels — anything, just anything without this abomination.

bryanlarsen10y ago

Like yours: "I wonder if systemd is actually a part of big plan of moving everyone to microservices and containers and maybe even unikernels" works even better if you replace systemd with docker.

atemerev10y ago

2 more replies

j / k navigate · click thread line to collapse