So, in actuality, I think your assertion just taught us all something, because despite knowing that the OOM killer and that the Magic SysRq key[1] exists, I didn't know you could configure this as an input!
systemd-run --user --scope --unit=ff-$$.scope \
-p MemoryMax=4G -p MemoryHigh=3G \
-p MemorySwapMax=0 \
firejail firefox "$@"cgroups v1 has a pretty nice API but it requires root. V2 does not require root but it’s a lot coarser and not as simple or reliable: https://unix.stackexchange.com/questions/753929/receive-a-me...
There surely is something absurd about having to register specific processes as exempt from the OOM killer. But given that the OOM killer exists, and could kill xlock...how should that be fixed?
The right way for this to work is for the X server to have an extension that lets a screen locker say "hey, I'm locking the screen now", and the X server should respond to that by pretending that the screen locker client is the only client that exists: no other client gets input or gets to draw. And if the screen locker crashes (or is killed), the X server should just put itself into a permanently-locked state where it will never again send any input to anything, and won't ever draw anything except a blank screen. That's not a desirable situation, of course, but it's better than unlocking the screen.
NT: Yes? Why not?
(note that this refers to the Windows NT kernel's operation because it had historically a POSIX emulation layer (NT Personalities), not the modern WSL which is just Linux in a Hyper-V)
No, you just account for it (commit the charge) in the bookkeeping. If a 1GB process forks, you decrement the amount of free memory by 1GB to ensure other processes don't overcommit such that you won't have 1GB of free memory if and when you actually needed to allocate that memory. If the forked process immediately exits, you just bump the free memory counter back up. This is what Solaris and Windows do.
But precise accounting of memory is difficult if you didn't design for it in the first place. For example, you have to figure in the memory needed for page structures. (Though I think Linux can do that in particular, bugs notwithstanding.) Last time I checked (5+ years ago) Linux was incapable of such precise accounting across the board, so even if you disabled overcommit the kernel could still find itself in an OOM situation when the time comes to allocate memory it already promised or perform an operation it implicitly or explicitly guaranteed it could complete.
The expectation that Linux overcommits meant many Linux kernel developers didn't design subsystems in a way that the kernel as a whole could provide reliable, guaranteed, precise memory accounting. For example, some filesystems rely on being able to use the OOM killer to free up memory needed for an operation that it can't back out of once it starts because it wasn't written in a way that it could either predetermine or bound it's memory requirements, or cleanly back out of an operation it started.
To be fair I'm not sure any of the BSDs can do it either, at least when it comes to fork and CoW. IIRC, nor can macOS, though it will dynamically add swap so you won't get an OOM kill until you run out of disk space.
I don't think Linux was plausibly going to remove the OOM killer in 2004 or later. So the right solution for Linux is very much to tweak it to be less painful.
Nothing like statically allocating memory can work when overcommit is enabled because the kernel is free to compress memory, page it out and etc. and then murder you the next time you try to perform any operation that it doesn't have the space for, no matter how safe and static your initialization was.
Note that overcommit is very useful in many cases including the ones where swap saves the stability of the system under conditions that would otherwise completely lock up or panic, so it's also not viable to just prevent it from being used.
This doesn't save you if someone other allocates and OOM killer chooses you as victim
It's a funny reply. But what was not funny was the OOM killer killing my screen locker.
Joke all you want, but 22 years later I still stand by that I'd rather get a kernel panic than kill the screen lock.
These days you can do oom score adjusting, which is not as strong as a pardon. I may be taking too much credit, and may misremember the timeline, but I feel like someone took my crappy kernel patch and went "fine, I'll do it the right way", merged that oom score adjusting maybe a year or so later.
Here's an LWN article about it, too: https://lwn.net/Articles/104179/
Writing -1000 to /proc/<pid>/oom_score_adj will cause the OOM killer not to consider the process at all :)
From the man page proc_pid_oom_score_adj(5)
> The value of oom_score_adj is added to the badness score before it is used to determine which task to kill. Acceptable values range from -1000 (OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). [...]. The lowest possible value, -1000, is equivalent to disabling OOM-killing entirely for that task, since it will always report a badness score of 0.
For example, KDE: https://preview.redd.it/plasma-lock-screen-messed-up-v0-zx7h...
GNOME: https://forums.freebsd.org/attachments/index-jpeg.8571/
I think this only works because there is top-down integration between the different parts. The compositor knows when it's supposed to be locked. Whereas the old screen lockers were just very aggressive Xorg apps that suffer from "What if two programs did this?" problems (https://devblogs.microsoft.com/oldnewthing/20110310-00/?p=11...)
An argument can be made that the kernel should not cover for architectural missteps of the X server and that X server should be the one to crash when it's security-critical component was killed for whatever reason.
Also there are other safety and security critical reasons why you'd want to exempt some processes.
Arguably (and it definitely has been argued) the real architectural misstep is the Linux kernel overcommitting by default in the first place.
- no system swap
- enough memory for core system services set aside in a cgroup for them to use
- by default, all prod service binaries load all code pages into ram at start, and lock them in (no paging out code pages at runtime)
- if needed (rare) services can mount some swap in their own cgroup, but very much discouraged
You need to know how much ram you are going to use, and actually stick to that. Very little is wasted in practice, and you don't have to deal with OOMs all the time. Everything is much more predictable.
It's a nice approach particularly because all OOMs become actionable: there's a bug in a service or a limit is wrong or traffic is changing in an unexpected way.
Systems built this way end up being extremely reliable in my experience.
It's an uphill battle both ways though and not everyone is up for that experience.
echo "-1000" > /proc/<pid>/oom_score_adj
to disable OOM killing for a process.https://github.com/torvalds/linux/blob/master/include/uapi/l...
A passenger buying a ticket is malloc(), but passengers don't always utilize the seat (use the memory). Normally this works out fine, but occasionally, there are too many passengers. Thankfully though instead of executing a couple passengers they give you a voucher.
In this worldview, malloc is like me buying a plane ticket at the counter for a specific flight that's going to leave soon. I'd be really annoyed if I were bumped off a flight I just paid for (and would've rather been told "that flight is full, try again later" (malloc returns NULL)). This is, for example what Windows does. Under memory pressure, it'll say to applications, "hey no I'm not in a giving mood for memory right now" (and will sometimes bump the size of the pagefile if configured to do this, but only up to a point).
The thought behind this is that well... applications have to handle malloc returning NULL anyway. Whether that's calling abort and giving up is one matter, another might be to retry the allocation at a later time (maybe after Windows has bumped the pagefile size), another might be to handle an error using some preallocated buffer or whatever.
"The protect command is used to mark processes as protected. The kernel does not kill protected processes when swap space is exhausted. [...] If you protect a runaway process that allocates all memory the system will deadlock."
[1] https://man.freebsd.org/cgi/man.cgi?query=protect&apropos=0&...