AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy (opens in new tab)

(phoronix.com)

410 pointscrcastle1mo ago165 comments

https://lore.kernel.org/lkml/yr3inlzesdb45n6i6lpbimwr7b25kqk...

165 comments

Its worth reading this follow-up LKML post by Andres Freund (who works on Postgres): https://lore.kernel.org/lkml/yr3inlzesdb45n6i6lpbimwr7b25kqk...

aftbit1mo ago

>If this somehow does end up being a reproducible performance issue (I still suspect something more complicated is going on), I don't see how userspace could be expected to mitigate a substantial perf regression in 7.0 that can only be mitigated by a default-off non-trivial functionality also introduced in 7.0.

cr125rider1mo ago

They said the magic words to get Linus to start flipping tables. Never break userspace. Unusably slow is broken

anal_reactor1mo ago

> Maybe we should, but requiring the use of a new low level facility that was introduced in the 7.0 kernel, to address a regression that exists only in 7.0+, seems not great.

Completely right. This sounds like a communication failure. Maybe Linux maintainers should pick a few applications that have "priority support" and problems with these applications are also problems with Linux itself. Breaking Postgres is a serious regression.

Reminds me of a situation where Fedora couldn't be updated if you had Wine installed and one side of the argument was "user applications are user problem" while the other was "it's Wine, like come on".

falcor841mo ago

I for one liked the old and simple WE DO NOT BREAK USERSPACE attitude.

https://linuxreviews.org/WE_DO_NOT_BREAK_USERSPACE

gcr1mo ago

Performance regressions are different from ABI incompatibilities. If the kernel refused to do any work that slowed down any userspace program, the pace would go a lot slower.

2 more replies

reisse1mo ago

Not sure it is true anymore. I've encountered few userspace breaks in io_uring, at least.

jeffbee1mo ago

Funny how "use hugepages" is right there on the table and 99% of users ignore it.

bombcar1mo ago

I’m absolutely flabbergasted by the performance left on the table; even by myself - just yesterday I learned Gentoo’s emerge can use git and be a billion times faster.

globular-toast1mo ago

The time spent by emerge is utterly dwarfed by the time spent to build the packages, so who cares? Maybe it's different if installing a binary system but don't think most people are doing that.

3 more replies

justinclift1mo ago

Note that it's just not a single post, and there's additional further information in following the full thread. :)

adrian_b1mo ago

Yes, and in the following messages the conclusion was that the regression is mitigated when using huge pages.

devchix1mo ago

This seems bad, Splunk advises you to turn off THP due its small read/write characteristics: https://help.splunk.com/en/splunk-enterprise/release-notes-a...

Bad because as of Splunk 10.x, Splunk bundles postgres to integrate with their SOAR platform. Parenthetically, this practice of bundling stuff with Splunk is making vuln remediation a real pain. Splunk bundles its own python, mongod, and now postgres, instead of doing dependency checking. They're going to have to keep doing it as long as they release a .tgz and not just an RPM. The most recent postgres vuln is not fixed in Splunk.

1 more reply

jeltz1mo ago

Which you always should use anyway if you can.

1 more reply

TacticalCoder1mo ago

AIUI in that thread they're saying "0.51x" the perf on a 96-core arm64 machine and they're also saying they cannot reproduce it on a 96-core amd64 machine.

So it's not going to affect everybody both running PostgreSQL and upgrading to the latest kernel. Conditions seems to be: arm64, shitloads of core, kernel 7.0, current version of PostgreSQL.

That is not going to be 100% of the installed PostgreSQL DBs out there in the wild when 7.0 lands in a few weeks.

torginus1mo ago

It's a huge issue of ARM based systems, that hardly anyone uses or tests things on them (in production).

Yes, Macs going ARM has been a huge boon, but I've also seen crazy regressions on AWS Graviton (compared to how its supposed to perform), on .NET (and node as well), which frankly I have no expertise or time digging into.

Which was the main reason we ultimately cancelled our migration.

I'm sure this is the same reason why its important to AWS.

p_l1mo ago

Macs are actually part of pain point with ARM64 Linux, because the Linux arm set er tend to use 64 kB pages while Mac supports only 4 and 16, and it causes non trivial bugs at times (funnily enough, I first encountered that in a database company...)

zamalek1mo ago

It was later reproduced on the same machine without huge pages enabled. PICNIC?

anarazel1mo ago

Yes, I did reproduce it (to a much smaller degree, but it's just a 48c/96t machine). But it's an absurd workload in an insane configuration. Not using huge pages hurts way more than the regression due to PREEMPT_LAZY does.

With what we know so far, I expect that there are just about no real world workloads that aren't already completely falling over that will be affected.

1 more reply

MBCook1mo ago

So perhaps this is a regression specifically in the arm64 code, or said differently maybe it’s a performance bug that has been there for a long time but covered up by the scheduler part that was removed?

adrian_b1mo ago

The following messages concluded that using huge pages mitigates the regression, while not using huge pages reproduces it.

db48x1mo ago

Could be either of those, or something else entirely. Or even measurement error.

1 more reply

master_crab1mo ago

For production Postgres, i would assume it’s close to almost no effect?

If someone is running postgres in a serious backend environment, i doubt they are using Ubuntu or even touching 7.x for months (or years). It’ll be some flavor of Debian or Red Hat still on 6.x (maybe even 5?). Those same users won’t touch 7.x until there has been months of testing by distros.

crcastleOP1mo ago

Ubuntu is used in many serious backend environments. Heroku runs tens of thousands (if not more) instances of Ubuntu on its fleet. Or at least it did through the teens and early 2020s.

https://devcenter.heroku.com/articles/stack

2 more replies

pmontra1mo ago

A customer of mine is running on Ubuntu 22.04 and the plan is to upgrade to 26.04 in Q1 2027. We'll have to add performance regression to the plan.

1 more reply

fxtentacle1mo ago

.. which confirms all of my stereotypes. Looks like the AWS engineer who reported it used a m8g.24xlarge instance with 384 GB of RAM, but somehow didn't know or care to enable huge pages. And once enabling them, the performance regression disappears.

bushbaba1mo ago

Because such settings aren’t obvious to those not familiar with them. LLMs should make discoverability easier though

perrygeo1mo ago

Honest question: what's the value of running the benchmark and reporting a performance regression if the author is not familiar with basic operation of the software? I'd argue that not understanding those settings disqualifies you from making statements about it.

2 more replies

monocasa1mo ago

I feel like using spinlocks in user space at all without kernel support like rseq is just asking for weird performance degradations.

anarazel1mo ago

I really dislike the use of spinlocks in postgres (and have been replacing a lot of uses over time), but it's not always easy to replace them from a performance angle.

On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake). Turns out that that increase in memory barriers causes regressions that are nontrivial to avoid.

Another difficulty is that most of the remaining spinlocks are just a single bit in a 8 larger byte atomic. Futexes still don't support anything but 4 bytes (we could probably get away with using it on a part of the 8 byte atomic with some reordering) and unfortunately postgres still supports platforms with no 8 byte atomics (which I think is supremely silly), and the support for a fallback implementation makes it harder to use futexes.

The spinlock triggering the contention in the report was just stupid and we only recently got around to removing it, because it isn't used during normal operation.

Edit: forgot to add that the spinlock contention is not measurable on much more extreme workloads when using huge pages. A 100GB buffer pool with 4KB pages doesn't make much sense.

anarazel1mo ago

Addendum big enough to warrant a separate post: The fact the contention is a spinlock, rather than a futex is unrelated to the "regression".

A quick hack shows the contended performance to be nearly indistinguishable with a futex based lock. Which makes sense, non-PI futexes don't transfer the scheduler slice the lock owner, because they don't know who the lock owner is. Postgres' spinlock use randomized exponential backoff, so they don't prevent the lock owner from getting scheduled.

Thus the contention is worse with PREEMPT_LAZY, even with non-PI futexes (which is what typical lock implementations are based on), because the lock holder gets scheduled out more often.

Probably worth repeating: This contention is due to an absurd configuration that should never be used in practice.

menaerus1mo ago

Contention doesn't exist in older kernel versions even with huge-pages disabled, no?

1 more reply

amluto1mo ago

> On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake).

Now you've gotten me wondering. This issue is, in some sense, artificial: the actual conceptual futex unlock operation does not require sequential consistency. What's needed is (roughly, anyway) an release operation that synchronizes with whoever subsequently acquires the lock (on x86, any non-WC store is sufficient) along with a promise that the kernel will get notified eventually (and preferably fairly quickly) if there was a non-spinning sleeper. But there is no requirement that the notification occur in any particular order wrt anything else except that the unlock must be visible by the time the notification occurs [0]; there isn't even a requirement that the notification not occur if there is no futex waiter.

I think that, in common cache coherence protocols, this is kind of straightforward -- the unlock is a store-release, and as long as the cache line ends up being written locally, the hardware or ucode or whatever simply [1] needs to check whether a needs-notification flag is set in the same cacheline. Or the futex-wait operation needs to do a super-heavyweight barrier to synchronize with the releasing thread even though the releasing thread does not otherwise have any barrier that would do the job.

One nasty approach that might work is to use something like membarrier, but I'm guessing that membarrier is so outrageously expensive that this would be a huge performance loss.

But maybe there are sneaky tricks. I'm wondering whether CMPXCHG (no lock) is secretly good enough for this. Imagine a lock word where bit 0 set means locked and bit 1 set means that there is a waiter. The wait operation observes (via plain MOV?) that bit 0 is set and then sets bit 1 (let's say this is done with LOCK CMPXCHG for simplicity) and then calls futex_wait(), so it thinks the lock word has the value 3. The unlock operation does plain CMPXCHG to release the lock. The failure case would be that it reports success while changing the value from 1 to 0. I don't know whether this can happen on Intel or AMD architectures.

I do expect that it would be nearly impossible to convince an x86 CPU vendor to commit to an answer either way.

(Do other architectures, e.g. the most recent ARM variants, have an RMW release operation that naturally does this? I've tried, and entirely failed AFAICT, to convince x86 HW designers to add lighter weight atomics.)

[0] Visible to the remote thread, but the kernel can easily mediate this, effectively for free.

[1] Famous last words. At least in ossified microarchitectures, nothing is simple.

anarazel1mo ago

> > On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake).

> Now you've gotten me wondering. This issue is, in some sense, artificial: the actual conceptual futex unlock operation does not require sequential consistency. What's needed is (roughly, anyway) an release operation that synchronizes with whoever subsequently acquires the lock (on x86, any non-WC store is sufficient) along with a promise that the kernel will get notified eventually (and preferably fairly quickly) if there was a non-spinning sleeper. But there is no requirement that the notification occur in any particular order wrt anything else except that the unlock must be visible by the time the notification occurs [0]; there isn't even a requirement that the notification not occur if there is no futex waiter.

Hah.

> ... > But maybe there are sneaky tricks. I'm wondering whether CMPXCHG (no lock) is secretly good enough for this. Imagine a lock word where bit 0 set means locked and bit 1 set means that there is a waiter. The wait operation observes (via plain MOV?) that bit 0 is set and then sets bit 1 (let's say this is done with LOCK CMPXCHG for simplicity) and then calls futex_wait(), so it thinks the lock word has the value 3. The unlock operation does plain CMPXCHG to release the lock. The failure case would be that it reports success while changing the value from 1 to 0. I don't know whether this can happen on Intel or AMD architectures.

I suspect the problem isn't so much the lock prefix, but that the non-futex spinlock release just is a store, whereas a futex release has to be a RMW operation.

I'm talking out of my ass here, but my guess is that the reason for the performance gain of the plain-store-is-a-spinlock-release on x86 comes from being able to do the release via the store buffer, without having to wait for exclusive ownership of the cache line. Due to being a somewhat contended simple spinlock, often embedded on the same line as the to-be-protected data, it's common for the line not not be in modified ownership anymore at release.

1 more reply

adrian_b1mo ago

Using LOCK CMPXCHG or even plain CMPXCHG does not make sense unless it is done in a loop, which tests the success of the operation.

Implementing locks does not need this kind of loops, which may greatly increase the overhead, but only loops that do simple loads, for detecting changes, or the invocation of a FUTEX_WAIT, which is equivalent with that.

Besides loops that wait for changes, any kind of lock may be implemented with atomic read-modify-write instructions (e.g. on x86 XCHG, LOCK XADD, LOCK BTS and so on, and equivalent instructions on Armv8.1-A or later ISAs) that are not used in loops, so they have predictable overhead. For example, a futex may be used by a thread that waits for multiple events, if the other threads use a locked bit-test-and-set on the futex variable to signal the occurrence of an event, where each event is assigned to a distinct bit.

CMPXCHG and the equivalent load-and-lock/store conditional are really needed far less often than some people use them. The culprit is a widely-quoted research paper that has shown that these instructions are more universal than simple atomic fetch-and-operation instructions, allowing the implementation of lock-free algorithms, but the fact that they can do more does not mean that they should also be used when their extra power is not necessary, because that is paid dearly by introducing non-deterministic overhead.

A simple atomic instruction has an overhead much greater than an access to the L1 cache or the L2 cache, but typically the overhead is similar to that of a simple access to the L3 cache and significantly lower than the overhead of a simple access to the main memory, which remains the most expensive operation in modern CPUs.

Moreover, while mutual exclusion can be implemented reasonably efficiently with locks, it is also used far more often than necessary. It is possible to implement shared buffers or message queues that use neither mutual exclusion nor optimistic access that may need to be retried (a.k.a. lock-free access), but instead of those they use dynamic partitioning of the shared resource, allowing concurrent accesses without interference.

Organizing the cooperation between threads around shared buffers/message queues is frequently much better than using mutual exclusion, which stalls all contending threads, serializing their execution, and also much better than lock-free access, which may need an unpredictable number of retries when contention is high.

1 more reply

jcalvinowens1mo ago

That 64-bit atomic in the buffer head with flags, a spinlock, and refcounts all jammed into it is nasty. And there are like ten open coded spin waits around the uses... you certainly have my empathy :)

This got me thinking about 64-bit futexes again. Obviously that can't work with PI... but for just FUTEX_WAIT/FUTEX_WAKE, why not?

Somebody tried a long time ago, it got dropped but I didn't actually see any major objection: https://lore.kernel.org/lkml/20070327110757.GY355@devserv.de...

anarazel1mo ago

> That 64-bit atomic in the buffer head with flags, a spinlock, and refcounts all jammed into it is nasty.

Turns out to be pretty crucial for performance though... Not manipulating them with a single atomic leads to way way worse performance.

For quite a while it was a 32bit atomic, but I recently made it a 64bit one, to allow the content lock (i.e. protecting the buffer contents, rather than the buffer header) to be in the same atomic var. That's for one nice for performance, it's e.g. very common to release a pin and a lock at the same time and there are more fun perf things we can do in the future. But the real motivation was work on adding support for async writes - an exclusive locker might need to consume an IO completion for a write that's in flight that is prevent it from acquiring the lock. And that was hard to do with a separate content lock and buffer state...

> And there are like ten open coded spin waits around the uses... you certainly have my empathy :)

Well, nearly all of those are all to avoid needing to hold a spinlock, which, as lamented a lot around this issue, don't perform that well when really contended :)

We're on our way to barely ever need the spinlock for the buffer header, which then should allow us to get rid of many of those loops.

> This got me thinking about 64-bit futexes again. Obviously that can't work with PI... but for just FUTEX_WAIT/FUTEX_WAKE, why not?

It'd be pretty nice to have. There are lot of cases where one needs more lock state than one can really encode into a 32bit lock state.

I'm quite keen to experiment with the rseq time slice extension stuff. Think it'll help with some important locks (which are not spinlocks...).

1 more reply

jcalvinowens1mo ago

> I feel like using spinlocks in user space at all without kernel support like rseq is just asking for weird performance degradations.

Yeah, exactly. "Doctor, help, somebody replaced my wooden hammer with a metal one, and now I can't hit myself in the face with it as many times."

If you use spinlocks in userspace, you're gonna have a bad time.

mgaunard1mo ago

Most people looking for performance will reach for the spinlock.

The expectation is that the kernel should somehow detect applications that are spinning, and avoid preempting them early.

IshKebab1mo ago

Well that seems like an unreasonable expectation no? Also isn't the point of spinlocks that they get released before the kernel does anything? Otherwise you could just use a futex... Which maybe you should do anyway...

https://matklad.github.io/2020/01/04/mutexes-are-faster-than...

1 more reply

silon421mo ago

If you are spinning so long that it requires preemption, you're doing something wrong, no?

1 more reply

jeltz1mo ago

PostgreSQL is old and had to support kernels which did not support spinlocks. But, yes, maybe PostgreSQL should stop doing so now that kernels do.

dsr_1mo ago

Nobody sensible runs the latest kernel; nobody running PG in production should be afraid of setting a non-default at either boot time or as a sysctl. So this will, most likely, be another step in building a PG database server (turn off pre-emption if your kernel is 7.0 or later and PG is pre-whatever-version).

At worst it might become a permanent part of building a PG server and a FAQ... but if it affects one thing this badly, it will affect others.

Meekro1mo ago

> Nobody sensible runs the latest kernel

From the article: "Linux 7.0 stable is due out in about two weeks. This is also the kernel version powering Ubuntu 26.04 LTS to be released later in April."

Unfortunately, lots of people will be running it in less than a month. At the moment, it'll take a kernel patch (not a sysctl) to undo this-- hopefully something changes.

Neywiny1mo ago

Not nobody but not everybody upgrades to the newest distros immediately. That's the advantage of LTS. I've even found that a lot of programs have poorer support on 24.04 than 22.04 due to security changes, so I'm fine sticking with 22.04 as my main dev system.

justinclift1mo ago

> ... not everybody upgrades to the newest distros immediately.

While that's true, for new deployments the story is often "deploy on the latest release of things available at the time".

So, there will probably be a substantial deployment of new projects / testing projects using the Linux 7.0 kernel along with the latest available software packages in a few weeks.

1 more reply

stingraycharles1mo ago

This seems to be brushing off a major performance regression just because you personally don’t upgrade for 4 years. I don’t think that’s common at all.

1 more reply

vasco1mo ago

Someone said "its fine nobody uses this" and someone else gave the world's biggest slam dunk of "Ubuntu in 1 month" and your reply is that "not everyone does it". How far from the point can you be!

In the Linux world this is the worst possible scenario, distro with the largest adoption, LTS.

2 more replies

esafak1mo ago

That's the advantage of LTS? 24.04 is the LTS, not the one you use, 22.04.

3 more replies

9999000009991mo ago

Depends on your shop.

As someone with a heavy QA/Dev Opps background I don't think we have enough details.

Is it only ARM64 ? How many ARM64 PG DBs are running 96 cores?

However...

This is the most popular database in the world. Odds are this will effect a bunch of other lesser known applications.

whilenot-dev1mo ago

Please follow the complete thread: https://lore.kernel.org/lkml/xxbnmxqhx4ntc4ztztllbhnral2adog...

> [...] used huge_pages=on - as that is the only sane thing to do with 10s to 100s of GB of shared memory [...] if I disable huge pages, I actually can reproduce the contention [...]

1 more reply

teekert1mo ago

I think most people on enterprise-y systems wait for (at least) 26.04.1, the window is 3 years (when on 24.04, which is supported until ~2029-04-30, it's 1 year when on 22.04) starting now, hardly anyone switches immediately.

tankenmate1mo ago

Not necessarily;

``` $ grep PREEMPT_DYNAMIC /boot/config-$(uname -r) CONFIG_PREEMPT_DYNAMIC=y CONFIG_HAVE_PREEMPT_DYNAMIC=y CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y ```

if your kernel has CONFIG_PREEMPT_DYNAMIC then you can go back to the pre 7.0 default by adding preempt=none to your grub config. I haven't seen any plans by Ubuntu to drop CONFIG_PREEMPT_DYNAMIC from the default kernel config.

tankenmate1mo ago

actually i just checked, yeah, ubuntu would have to add none back to the kernel and `CONFIG_PREEMPT_NONE=y` the config so that it can be selected at boot.

bombcar1mo ago

We need some sensible people running the latest and greatest or we won't catch things like this.

stingraycharles1mo ago

That may be the case, but it’s still not a great situation to be in and one has to wonder: if PostgreSQL is affected, what else is?

bombcar1mo ago

That's the big thing - PSQL will be tested, noticed, and fixed (and likely have a version that handles 7.0 by the time it's in common use).

But other software won't and may not even be noticed, except as a (I hate using the term) enshittification.

Better to introduce the "correct way" in 7.0 but not regress the old (translate the "correct" into the old if necessary) - and then in 8.0 or some future release implement the regression.

stingraycharles1mo ago

Exactly, this is how it’s usually done. As the developer on the mailing list mentions, implementing a new low level construct in 7.0 and a performance regression that requires said new low level construct to mitigate is not great. You need a grace period in which both old and new approach is fast.

Seattle35031mo ago

If you're running in a docker container you share the host kernel. You might not have a choice.

cwillu1mo ago

The option to set PREEMPT_NONE was removed for basically all platforms.

harshreality1mo ago

Background on PREEMPT_LAZY:

https://lwn.net/Articles/994322/

longislandguido1mo ago

Anyone check to see if Jia Tan has submitted any kernel patches lately?

rs_rs_rs_rs_rs1mo ago

They don't need to, there's about a billion bugs they can exploit.

FireBeyond1mo ago

Once upon a time, Linus would shout and yell about how the kernel should never "break" userspace (and I see in some places, some arguments of "It's not broken, it's just a performance regression" - personally I'd argue a 50% hit to performance of a pre-eminent database engine is ... quite the regression).

Now, the kernel engineer who introduced the brand new mechanism (introduced in Linux 7.0) for handling pre-emption says the "fix" is for Postgres to start using this new mechanism (I think the sister comment below links to what one of the Postgres engineers thinks of that, and I'm inclined mostly to agree).

shakna1mo ago

Freund seems to suggest that hugepages is the right way to run a system under this sort of load - which is the fix.

> Hah. I had reflexively used huge_pages=on - as that is the only sane thing to do with 10s to 100s of GB of shared memory and thus part of all my benchmarking infrastructure - during the benchmark runs mentioned above.

> Turns out, if I disable huge pages, I actually can reproduce the contention that Salvatore reported (didn't see whether it's a regression for me though). Not anywhere close to the same degree, because the bottleneck for me is the writes.

But, they can speak for themselves here [0].

[0] https://news.ycombinator.com/item?id=47646332

perching_aix1mo ago

Entertaining perspective - I thought that the whole "it's not an outage it's a (horizontal or vertical) degradation" thing was exclusive to web services, but thinking about it, I guess it does apply even in cases like this.

MBCook1mo ago

It wouldn’t be the first time one of the other maintainers ran afoul of “Linus’s law“.

He may simply be waiting until more is known on exactly what’s causing it.

bear86421mo ago

> I'd argue a 50% hit to performance [...] is ... quite the regression

Indeed! Especially if said regression happens to impact anything trade/market related...

quietsegfault1mo ago

This was my immediate thought - kernel doesn’t break software, or at least it didn’t used to.

arjie1mo ago

Well, the reason he'd yell about it is that someone did it. If no one ever did it, he'd never yell and we'd never have the rule. So one can only imagine that this is one of those things where someone has to keep holding the line rather than one of those things where you set some rule and it self-holds.

Doubtless someone will have to do the yelling.

cperciva1mo ago

This makes me feel better about the 10% performance regression I just measured between FreeBSD 14 and FreeBSD 15.0.

db48x1mo ago

Heh. Did they at least add useful features to balance out that cost?

cperciva1mo ago

FreeBSD 15 has lots of useful features! And better performance on other benchmarks; I just need to track down what's going wrong with this particular one.

db48x1mo ago

Certainly could be worse!

cdelsolar1mo ago

https://lkml.org/lkml/2012/12/23/75

bob10291mo ago

I'm struggling a bit with why we need all these fancy dynamic preemption modes. Is this about hyperscalars shoving more VMs per physical machine? What does a person trying to host a software solution gain from this kernel change?

If a user wants to spin in an infinite loop all day every day, I don't see the problem with that. Even if the spinning will provably never do any useful work.

ponco1mo ago

more throughput WITHOUT huge tail latency is my understanding. A user above posted this link https://lwn.net/Articles/994322/ which goes into the background. My mental model is "give the kernel more explicit information" and it will be able to make better decisions

teleforce1mo ago

Does the PostgresSQL 18 performance increased with the latest asynchronous I/O, smarter query planning with improved parallelism kind of offset this performance hits? [1].

"Enhanced and smarter parallelisation; initial benchmarks indicate up to 40% faster analytical queries".

[1] PostgreSQL 18 released: Key features & upgrade tips:

https://www.baremon.eu/postgresql-18-released-key-features-u...

anal_reactor1mo ago

Can someone explain to me what's the problem? I have very little knowledge of Linux kernel, but I'm curious. I've tried reading a little, but it's jargon over jargon.

alienchow1mo ago

I'm not familiar with the jargon either, but based on some reading it comes down to how the latest kernel treats process preempts.

Postgres uses spinlocks to hold shared memory for very critical processes. Spinlocks are an infinite loop with no sleep to attempt to hold a lock, thus "spinning". Previous kernels allowed spinlocking processes to run with PREEMPT_NONE. This flag tells the kernel to let the locking process complete their work before doing anything. Now the latest kernel removed this functionality and is interrupting spinlocking processes. So if a process that is holding a lock gets interrupted, all other postgres spinlocks processes that need the same lock spin in place for way longer times, leading to performance degradation.

anal_reactor1mo ago

Why does it only appear on arm64 and not x86?

adrian_b1mo ago

It was not architecture-related. Not using huge pages also reproduced the regression on x86.

I do not know why using huge pages mitigates the regression, but it could be just because when the application uses huge pages it uses spinlocks much less frequently so the additional delays do not accumulate enough to cause a significant performance reduction.

1 more reply

tijsvd1mo ago

From what I understand in the follow up: postgres uses shared memory for buffers. This shared memory is read by a new connection while locked.

In postgres, connections are handled with a process fork, not a new thread. If such a fork first reads memory, even if it already exists, that causes a minor page fault, which goes back to the kernel so it can update memory mapping tables.

The operation under lock is only a few instructions, but if it takes longer than expected, then that causes lock contention. Regression in the kernel handling minor faults?

The whole thing is then made worse because it's a spinlock, causing all waiting processes to contend over the cpus which adds to kernel processing.

Mitigated by using huge pages, which dramatically reduces the number of mapping entries and faults. I reckon that it could also be mitigated in postgres by pre-faulting all shared memory early?

Deeg9rie9usi1mo ago

Once again phoronix shoot out an article without further researching nor letting the mail thread in question cool down. The follow up mails make clear that the issue is more or less a non-issue since the benchmark is wrong.

adrian_b1mo ago

The following up mails conclude that the regression happens only when huge pages are not used.

While using huge pages whenever possible is the right solution and this should be enough for PostgreSQL, perhaps there are applications that cannot use huge pages and which are affected by the regression.

So I do not think that it is right to just ignore what happened.

scottlamb1mo ago

> While using huge pages whenever possible is the right solution and this should be enough for PostgreSQL, perhaps there are applications that cannot use huge pages and which are affected by the regression.

It will be more interesting to talk about those applications if and when they are found. And I wouldn't assume the solutions are limited to reverting this change, starting to use the new spinlock time-slice extension mechanism, and enabling huge pages.

It sounds like using 4K pages with 100G of buffer cache was just the thing that made this spinlock's critical section become longer than PostgreSQL's developers had seen before. So when trying to apply the solution to some hypothetical other software that is suddenly benchmarking poorly, I'd generalize from "enable huge pages" to "look for other differences between your benchmark configuration and what the software's authors tested on".

justinclift1mo ago

> It will be more interesting to talk about those applications if and when they are found.

Redis recommend disabling hugepages: https://redis.io/docs/latest/operate/oss_and_stack/managemen...

---

Actually, looks like they changed the log warning to be more specific, as it's just the "always" setting which seems to cause Redis grief?

https://github.com/redis/redis/issues/3895

1 more reply

Deeg9rie9usi1mo ago

I agree with you. The lurid headlines of phoronix.com just annoy me...

galbar1mo ago

It's not a good look to break userspace applications without a deprecation period where both old and new solutions exist, allowing for a transition period.

1 more reply

up2isomorphism1mo ago

Not sure why people have to upgrade to the newest major kernel version as soon as it is released.

conradludgate1mo ago

It's the performance team's job to test these things. Doesn't mean they're going to deploy it immediately.

Someone should be testing these things and reporting regressions

jeltz1mo ago

If nobody tests and reports these things when the version is released the regression would not be fixed when people start using it in production.

IshKebab1mo ago

Don't make excuses.

dboreham1mo ago

THP again?

dmitrygr1mo ago

And this is exactly why we need the old Linus. Someone needs to yell “we do not break user space“

carlsborg1mo ago

Perhaps in due time we will see workload specific forks of Linux maintained by a team of agents

j / k navigate · click thread line to collapse

165 comments

lfittl1mo ago