Wouldn't it make more sense to just have an 'apt reinstall all --with-frame-pointers' command that power users could run before they wanted to profile something?
Helping people find ~30-3000% perf wins, helping debugging and automated bug reports, is huge. For some sites it may be like 300 steps forward, one step back. But it's also not the end of the road here. Might we go back to frame pointer ommision one day by default if some other emerging stack walkers work well in the future for all use cases? It's a lot of ifs and many years away, and assumes a lot of engineering work continues to be invested for recoving a gain that's usually less than 1%, but anythings possible.
There's a couple of problems with an apt reinstall. One is that people often don't work on performance until the system is melting down -- many times I've been handed an issue where apt is dreadfully slow due to the system's performance issue and just installing a single package can take several minutes -- imagine reinstalling everything, it could turn the outage into over an hour! The other is I'd worry that reinstalling everything introduces so many changes (updating library versions) that the problem could change and you'd have no idea which package update changed it. If there was such an apt reinstall command, I know of large sites (with experience with frame pointer overheads) that would run it and then build their BaseAMI so that it was the default. Which is what Ubuntu is doing anyway.
Right from the article. I find it a difficult subject, as a developer/poweruser I am happy to see framepointers. But I can not speak for others.
> Wouldn't it make more sense to just have an 'apt reinstall all --with-frame-pointers' command that power users could run before they wanted to profile something?
I don't see why it makes any more sense than just changing the default that the distribution uses. For one it's way more work, maintaining another copy of everything for a ~1% performance difference is not an obviously good tradeoff for the distro teams to make. Not to mention it often isn't possible to do this in the cases people want it i.e. they want to continuously profile an existing production system that they can't just run apt on willy nilly.
> I’ve enabled frame pointers at huge scale for Java and glibc and studied the CPU overhead for this change, which is typically less than 1% and usually so close to zero that it is hard to measure.
In 2023 it's not a 1-2% performance penalty anymore and certainly not for most use cases. Only if the 15th register is critical for performance on an x86_64 CPU.
Certain workloads might suffer more, but most will certainly suffer less than a 1-2% hit.
"can make use of this improved debugging information to diagnose and target performance issues that are orders of magnitude more impactful than the 1-2% upfront cost."
Also, can't you get reliable stack dumps when something goes wrong too?
Removing needless instruction and register pressure that 99.99% of users don't rely on in any way and the rest don't need if they fixed their tools is not premature optimization but simple common sense. Which is why its on by default in the first place.
Calling 1% or even 0.1% optimizations that apply accross the board "premature optimizing" is a great example of the culture of wastefulness that has made computers less responsive even though hardware has gotten a million times faster. These things do add up.
If so, I'm all for it. The win from easy access to profiling can dwarf this 1-2%
The Amd64 architecture fixed the underlying problem, so this is pretty much just a holdover. I'm surprised they even enabled it by default.
However, even under zero register pressure having a frame pointer is still an extra register that needs to be touched on every function invocation, extra instructions taking space in the I-cache, etc. It's a small thing, but it's still a cost that has to be paid by all compiled code.
I'm not going to claim that re-enabling frame pointers was the wrong choice -- the people involved in the debate know the tradeoffs and I have to start with the assumption that I would have made the same decision if I were in their position. It does make me slightly sad, though. The idea behind removing frame pointers isn't that backtraces aren't important, it's that computing the frame pointer after-the-fact is possible -- i.e. for normal functions without alloca() or dynamically-sized stack arrays map %rip -> frame size.
The problem seems to be that despite years of experience with "no-frame-pointer" being the default I guess the profiling tools never got as reliable or good as the with-frame-pointer variants. My personal hope was that the problem would fade over time as tools improved, but it seems that's unlikely to ever happen. After all, once no-frame-pointer stops being the default there won't be any pressure for tools to improve. The towel has been thrown in.
[0] Intel APX is extremely confusable with the iAPX 432, a failed non-x86 architecture Intel made that's completely unrelated to doubling the size of the x64 register file.
[1] https://www.intel.com/content/www/us/en/developer/articles/t...
How's that, you'd still need the debug symbols
Also has anyone else noticed that running stuff through Valgrind is really only possible if the program was made with Valgrind in mind? For example, Python and its many extensions generate numerous errors and warnings, so many that any real problem becomes hidden.
I'd say that modern Linux systems are very far from being debuggable.
My tip is to run some thing one program execution, and then run it twice during the same program execution, and diff the reports.
[1] https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-bas...
Yeah it's more like "...will stop omitting frame pointers by default".
The first effect is that it makes one additional general-purpose integer register unavailable for use for code. x86-64 has 16 general-purpose registers, but one of these is the stack pointer and basically can't be used for any other purpose; this would add a second reserved register for the frame pointer. This effect may cause slowdowns if the 15th register was critical for performance.
The second effect is on the ability to identify (and potentially unwind) the stack trace. With frame pointers, the pseudocode for computing a stack trace is essentially:
do
load return address, previous frame pointer from current frame pointer
print return address
move previous frame pointer into current frame pointer
until current frame pointer is invalid
Without frame pointers, the way you have to do this procedure is: while current address has corresponding entry in unwind table:
parse unwind table entry to find a program to run
run this program on the current frame to generate return address
print return address
move return address to current address
It turns out that there is a full Turing-complete program described in the unwind tables to be able to generate a return address. This makes unwinding quite expensive, and also can create lots of security headaches if you want do something like unwind in the kernel (since the unwind table is arbitrary user code!). It can also be pretty unreliable at times, especially in cases where your program crashed due to stack smashing so that you have to expect that the data being randomly overwritten with garbage and thus horrifically inaccurate.Like, even for developers, I assume a random web development shop using Ubuntu and hosting stuff on Ubuntu would likely not ever attempt debugging a binary executable, and likely don't have any employees who could do it if they wanted. Of course there are companies who can and do debugging and profiling of binaries running on their servers, but IMHO those who are capable and willing to do that a relatively small minority of users of Ubuntu systems.
https://www.brendangregg.com/flamegraphs.html
With systems like Phlare/Pyroscope SRE can monitor application performance in a very granular way in realtime.
https://grafana.com/blog/2023/03/15/pyroscope-grafana-phlare...
Stack frames are basically a (single) linked list of information about the call stack. Every frame corresponds to one function call, and says where the local variables are stored, and where the function should return after it has finished.
The head of this list is stored in a register (a scarce resource, superfast memory). So to use frame pointers, you have to spend one register, and also every function has to do some work to maintain the linked list. Two instructions worth of work when the function is enterred, and one when it exits.
The alternative to doing this explicitly is to keep track of it all implicitly which is faster but a bit more complex.
[1] https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-bas...
Consider that you're a big company deploying software to hundreds or thousands of machines, and you hit a difficult-to-diagnose performance issue, crash, etc. You'll very much appreciate if the OS has made it easier for you to debug things.
Put another way, Fedora users/developers might appreciate having frame pointers because they have to debug more frequently, but RHEL/LTS release users might appreciate frame pointers because on the less-frequent occasion when they need to debug, the stakes are much higher.
Can't say it's been useful here, any development I do is miles away from this. It hasn't hurt either so... cool, I guess
Call me pessimistic, but I'm not convinced this being the default will lead to more profiling. There's plenty that could be done without this, that isn't, so I'm not buying it.
Profiling is one part, but the debuggability this enables is going to be huge in the long term I predict.
Including frame pointers should only have a performance effect for sample-based profiling, which does a very large number of backtraces. And the general fact is - people don't profile, and if they do, they don't do it correctly.
Omitting frame pointers has significant performance wins on platforms with about 6 registers, like 32-bit intel x86. It's much less of a win on platforms with about 14 registers, like 64-bit x86 or 32-bit ARM, let alone platforms with about 30 registers, like 64-bit ARM.
Since modern architectures are strongly trending toward designs that support more registers, a frame pointer isn't unreasonable to choose. But that's still no excuse for all the shitty software that refuses to work correctly without them, rather than merely more slowly.
(Note that theoretically it is possible to design an ISA/ABI combo that supports easy and fast unwinding even without frame pointers, but there's always going to be some overhead and to my knowledge this choice hasn't been done.)
That said, no matter how we spin it, frame pointer unwinding is always going to be cheaper, and while profiling is getting better, I think I'm almost more excited about the other aspects of debuggability this is gaining: out-of-the-box working bpftrace, bcc-tool and anything else that needs to deal with unwinding with just about anything that's running on the box. I think we'll see a huge gain in capabilities over the next few years with frame pointers more prevalent in fedora and ubuntu and I'm sure more will now follow.
[1] https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-bas...
There are some structural issues in the Linux world, too; the default of debug data contained within the binary is often undesirable, symbol servers (they finally learned about those in Ubuntu 22) require extra setup & tooling support that isn't often invested in, widely used libraries like libunwind are both arcane and terrible (yes, an instruction pointer of 0 will not have associated unwind information; use your brain and realize someone called a NULL function pointer).
Cargo culting culture in embedded devices of just using some old toolchain copy pasted from some vendor seems to be pervasive. I cut fresh compilers and align their output with the old crusty toolchains. Made entire classes of issues go away. Regardless, agreed the omission of the frame pointer was merely one issue at play with those particular core dumps on those particular devices, years back at this point. :)
My guess is that on average the potential performance discovered with the techniques this enables is higher than the guaranteed negligible performance loss.
[1] https://research.google/pubs/google-wide-profiling-a-continu...
This is one of the downsides of using C/C++ rather than a modern programming language like Rust or Zig. In the former case, the system maintainers reach across the table and change the settings, despite what the actual application developer has chosen. In the latter, the upstream developers' choices are respected more, mainly because the tooling is less standardized.
Shoutouts to this NixOS bug which is still ongoing after causing much pain for many years: https://github.com/NixOS/nixpkgs/issues/18995
For one thing, you might be the upstream developer of a library the program links against, and your stack frames might be hidden under frames from the program that didn’t save the frame pointer. Or if it’s a library that isn’t saving the frame pointer, vice versa.
But the more interesting use case is if you’re not the upstream developer of anything, but a skilled user who wants to get a view of your entire system and diagnose problems yourself. Personally I do that all the time, both in my spare time and at work. With respect to performance in particular I have more experience with the macOS tooling than the Linux tooling, but it’s analogous – with tools like Instruments and dtrace I can get a profile of any process I want or of the entire system, and I find that incredibly valuable. And that’s made possible in part by the stock macOS toolchain enabling frame pointers by default.
The NixOS case you linked sounds rather annoying, but turning on optimizations and -Werror and PIC by default is very different from just enabling frame pointers by default.
Not clearly written in the original article is that "many" (I'm not sure what actual percentage, but vaguely "most") packages already have frame pointers enabled.
The problem packages that don't are exactly all of those upstream projects that intentionally compile with -fomit-frame-pointer because of those small performance gains. And those are also usually the exact same projects you end up wanting to profile or otherwise analyse :)
I work in the Support organisation at Canonical and the two most frequent projects I run into this with are Ceph and Openvswitch - they both compile with -fomit-frame-pointer by default upstream (and currently in the Ubuntu packages) which makes using perf (which I often need to do with both of those) a pain.
While you can do it, you have to record ~8kB of stack extra for every sample (times 1000 per second, times the number of CPUs...) and then unwind it later with the DWARF debug symbols. The resulting perf exports are multiple gigabytes for 0-2 minutes. Compared to maybe 25-200MB for frame-pointer enabled cases where I can usually then easily captured.
The problem is the product or upstream project wants to claim the absolute best performance, even 1% better, but the end user rarely needs that last 1% and both them and their support team would like to be able to easily use profiling in production to fix the random much more significant 10-100% performance bugs that inevitably crop up when actually using it and not benchmarking it :)
For the long term, the more exciting option that's emerging is SFrame[1]. This is a new data section which would be generated by the compiler and contains unwind tables which the kernel will be able to understand. Unlike DWARF/.eh_frame, these tables would remain in the final binary (i.e. not be stripped away), and on exec(), the kernel would store them for use during profiling. Since the format is quite similar to ORC(*), and Steven Rostedt is quite invested in the format, it seems a safe bet that support will land in the kernel.
My hope isn't necessarily that a distribution completely disables frame pointers once this format becomes available... though it could be an interesting thing to try. Rather, there can be a conscious choice about whether frame pointers are used, or SFrame, which would be useful for cases like Python, where it's mentioned that frame pointers may still have a significant performance impact. The kernel should be able to fall back to frame pointers when SFrame is unavailable, which means that either will be acceptable. Ideally, in a few years time we'll be able to go back to forgetting about frame pointers for most cases :)
---
* Ironically, the kernel itself tends not to use frame pointers! It has its own unwind format called ORC, which gets generated by an in-kernel program called "objtool" which essentially reverse engineers the assembly generated by the compiler. It's x86_64-specific and frequently needs adjustment when the compiler changes code generation. It can't be used for userspace programs.
** it also knows how to unwind kernel stacks with ORC (see above)
*** There is an option to allow perf to unwind with DWARF, but it's a total hack (though a very effective one). By passing --call-graph=dwarf, you can instruct the kernel to copy the userspace stack (by default, 8k bytes!) into the perf event buffer with each sample (this can be as many as 100 or 1000 samples per second, per CPU...). Later, the perf userspace program will use that info, along with information about each process's address space, and the debuginfo for each program, to unwind the stacks. This has huge performance overhead, and it requires that you have easy access to debuginfo, which may not be the case, especially for container workloads.
In the same spirit, it seems that the .eh_frame -> BPF unwind table process could be (relatively) easily modified to produce SFrame, which you could attach to the binaries if you have a trustworthy way of doing that (which is... a big if). So that once SFrame support becomes available in the kernel, you could apply it to applications without rebuilding them.
Remove snap (or choose a Ubuntu distro variant like Pop OS without snap)
I believe Firefox now points to its snap variant, which I discovered when it broke a bunch of my browser extensions. Switching to the official Mozilla PPA was easily enough, but left a bad taste; if Canonical continues down the route of silently nudging users onto snap, I’ll probably switch to Debian.
(I have no particular opinions about snap itself, other than that it seems poorly documented and doesn’t adhere to the “do what I say” philosophy when it’s secretly injected into apt.)
What do the Coding Guidelines listed in e.g. awesome-safety-critical say about Frame pointers? https://awesome-safety-critical.readthedocs.io/en/latest/#co...
(Edit)
/? "cert" "frame pointer" https://www.google.com/search?q=%22cert%22+%22frame+pointer%... :
- Stack buffer overflow > Exploiting stack buffer overflows: https://en.m.wikipedia.org/wiki/Stack_buffer_overflow :
> In figure C above, when an argument larger than 11 bytes is supplied on the command line foo() overwrites local stack data, the saved frame pointer, and most importantly, the return address
What about the Top 25?
/? site:cwe.mitre.org "frame pointer" https://www.google.com/search?q=site%3Acwe.mitre.org+%22fram... :
- CWE-121: Stack-based Buffer Overflow https://cwe.mitre.org/data/definitions/121.html
This is closer to a better approach for security, debuggability, and performance IMHO:
https://news.ycombinator.com/item?id=38138010 :
> gdb on Fedora auto-installs signed debuginfo packages with debug symbols; Fedora hosts a debuginfod server for their packages (which are built by Koji) and sets `DEBUGINFOD_URLS=`
> Without debug symbols, a debugger has to read unlabeled ASM instructions (or VM opcodes (or an LL IR)).
Someone could easily prepare an demo of a frame pointer buffer overflow exploit to explain?
Exciting!
https://community.ibm.com/community/user/wasdevops/blogs/kev...