Windows vs. Linux Scaling Performance 16 to 128 Threads with Threadripper 3990X (opens in new tab)

(phoronix.com)

108 pointsjjuhl6y ago52 comments

52 comments

40 comments · 11 top-level

t0mas886y ago· 7 in thread

From my experience for a long time the Windows NT (from Win2K onwards) kernel and scheduler were actually better than Linux in several ways. That always amazed me, because Linux was a better server OS in many other ways.

Now at 64 cores and above it is clear that the Linux developers have spent a lot of time making the Linux kernel better. May have something to do with the fact that a big proportion of servers in production with many CPUs/cores are Linux servers so they started investing in this quite early?

cesarb6y ago

> May have something to do with the fact that a big proportion of servers in production with many CPUs/cores are Linux servers so they started investing in this quite early?

Not servers, but single system image supercomputers. Quoting Wikipedia (https://en.wikipedia.org/wiki/Altix):

"At product introduction, the system supported up to 64 processors running Linux as a single system image and shipped with a Linux distribution called SGI Advanced Linux Environment, which was compatible with Red Hat Advanced Server. By August 2003, many SGI Altix customers were running Linux on 128- and even 256-processor SGI Altix systems. SGI officially announced 256-processor support within a single system image of Linux on March 10, 2004 using a 2.4-based Linux kernel. [...]"

And as I recall, back then Linux suffered an issue similar to what Windows is facing now: a single 64-bit integer was no longer enough for a processor bitmask when you have more than 64 processors. IIRC, a lot of code had to be refactored to allow for a bigger bitmask, and an abstraction layer was put in place. Nowadays, the limit is 8192 processors according to arch/x86/Kconfig (see also: https://access.redhat.com/articles/rhel-limits).

twoodfin6y ago

It wouldn’t surprise me if Microsoft spends more time and effort on tuning and providing OS services for the particular commercial applications that are typically used on high core-count Windows server boxes, several of the most prominent of which are also Microsoft products.

I suspect SQL Server has no trouble scaling to 64 cores and beyond.

Guest426y ago

That is true. I have seen some terrible queries written on enterprise hardware that returned at the drop of a hat without being cached ahead of time.

blihp6y ago

I suspect it's just simple math. In the same way that Windows deployments vastly outnumber mainframes, (large scale server) Linux deployments vastly outnumber Windows. FAANG (and want-to-be FAANG) companies have been hammering away at Linux for decades with workloads that have grown to scales Windows has never seen. For example, how many billion user web services are running on Windows? What percentage of supercomputers today run Windows? etc.

oaiey6y ago

Well the Windows Kernel is known to be well engineered. The userland including user interface is a different story. Here the backward comparability and the huge feature sets kick in including all its disadvantages (and advantages)

ksec6y ago

I have often wondered if Microsoft will one day Open Source the Window Kernel.

lisk16y ago

My guess is Linux is still mainly used in server environment so utilizing more threads as possible is necessity so more contribution is going into this ,

gmueckl6y ago· 6 in thread

There is no mention in the article whether the software suite was vetted for support of more than 64 threads on Win32. The API has a peculiar weakness that limits thread scheduling to a single processor group by default and a group can have no more than 64 hardware threads. To get above this limit, the application must explicitly adjust the processor affinity of its threads to include the additional hardware threads. MS was not in a hurry to adjust the C++ STL and their OpenMP runtime after the basic processor group API appeared in Vista. I am not sure if they managed to do it by now. Some of the benchmark results look to me like the missing scaling from 64 to 128 hardware threads on Windows might be caused by this.

monocasa6y ago

It's not just the API, it's the scheduler in NT itself that won't move threads from one process group of up to 64 hardware threads (on a 64bit system) to another and so it has to be manually managed by the application if you want to scale out farther than that on NT.

Given that it's a fundamental limitation of the NT scheduler (not present in Linux), it seems like it'd be on the table for "yeah, windows makes this way harder, and a lot of applications won't scale the same way on windows as their Linux versions will", rather than "oh, that just doesn't count because they aren't using it right".

EDIT: As an aside, this kind of thing is exactly why Linux doesn't provide binary compatibility on the driver level. It's easy to paint yourself into a corner by making decisions that were perfectly sane 20 years ago. Now NT has fundamental limitations, hitting even harder in kernel space where nearly every driver out there has some macros compiled in that touched these structures. It's bad enough at the syscall layer, but it's even worse when you can't change things with code that's directly modifying internal structures.

This is exactly why Linux won't provide a driver ABI, and why it's a good thing.

gmueckl6y ago

I forgot that the API doesn't allow thread affinity across processor group boundaries. It's been a couple of years since I last touched all of this. Revisiting it, it becomes clear that this limitation actually prevents transparent support for >64 hardware threads in the C++ STL or pthreads on Windows.

zamadatix6y ago

I don't think it's a fundamental binary API limitation type thing as the issue does not exist in Windows Enterprise or Server. This was covered the last time this was posted https://www.anandtech.com/show/15483/amd-threadripper-3990x-...

1 more reply

tpxl6y ago

There was another phoronix article recently that mentioned this, so they must know this exists.

Either way, if it doesn't work in the benchmark, it doesn't really matter who's at fault (as long as the benchmark isn't completely synthetic)?

muststopmyths6y ago

How is it a "benchmark" if it doesn't fully utilize the operating system's APIs to maximize performance ?

I am not defending the utter shittiness of the Win32 Processor group API design, which is the usual overly complicated Win32 API that only makes sense to an NT kernel developer. I can see why no application developer bothered to incorporate that in their software. It's basically asking you to do thread scheduling yourself for >64 cores.

But I'm always annoyed to see people whinging about Windows performance when they're using "cross-platform" applications (that are using the lowest common denominator APIs) to supposedly measure perf.

For an application that knows the OS environment it's running in, the "scheduler" or the "kernel" is very unlikely to be an actual impediment most of the time. I use quotes because those words are usually used to mean "my app doesn't run as I expect and it can't be my fault".

This is true of all operating systems.

An old MSDN doc [1] actually says "The reason for initially limiting all threads to a single group is that 64 processors is more than adequate for the typical application.". There is no newer doc that I could find, so perhaps the design hasn't changed in 8 or 10.

"XYZ should be more than adequate for the typical application" is like a Microsoft basic design principle or something :-)

[1]https://docs.microsoft.com/en-us/previous-versions/windows/h...

2 more replies

908B64B1976y ago

The 64 processor per processor group limitation makes sense to me. A lot of >64 processor configs on the market are NUMA systems where it's probably best to pin threads to the NUMA node where their data will be hosted.

They also used a version of Windows aimed at desktop/graphical uses and not Server. I wouldn't be surprised if Desktop Windows didn't like having apps use a lot of CPU (making the system feel unresponsive) but server didn't mind.

MrBuddyCasino6y ago· 6 in thread

This reveals one weakness of the Windows development model: if something isn’t a feature that is driven with a PM behind it, it won’t happen. On the other hand, if some obscure internal thing isn’t optimal yet, you can bet some obsessed hacker is going to tackle it one day. How many schedulers has Linux had already?

jdsully6y ago

I’m not sure about the Windows ORG, but office always made time for dev led architectural improvements. Back in the old days it was called Milestone Zero.

I’d be absolutely shocked if Windows didn’t have a similar process.

derision6y ago

I don't necessarily consider that a weakness

MrBuddyCasino6y ago

Ok, its a trade-off. The year of the Linux desktop is surely coming soon. I must say I did not expect the scaling performance difference to be so large though.

908B64B1976y ago

Performance for more than 64 cores on the retail non-server OS seems like feature not a lot of users are asking for...

detaro6y ago

Do the server variants do better? If yes, one would at least wonder why the variants you'd expect on high-end workstations didn't get the same options.

2 more replies

bonzini6y ago

Three, four if you count Con Kolivas's.

newnewpdro6y ago· 3 in thread

The linux kernel has been run on "big iron" for a long time now, it would be surprising if it weren't better prepared for scaling to 128+ cores.

linux/Documentation/vm/numa.rst states it was started in 1999, was windows going anywhere near NUMA architectures back then?

the84726y ago

Since windows was mostly running on x86 and the memory controllers were in the northbridge back then even multi-socket systems wouldn't have been affected by NUMA. Moving them on-die only happened later.

thedance6y ago

There were NUMA x86 rigs long before the memory controller moved to the CPU. IBM xSeries and serverworks chipsets from around 2000 had NUMA topologies.

muststopmyths6y ago

Windows server 2003 had NUMA support. I am not sure that Server 2000 exposed any NUMA capability, but there were a lot of things cut from that project because it was running late. My guess is NUMA was one of the things that got pushed (in terms of release) to 2003.

They used to have a "Datacenter" SKU of the server where you'd find most of these kinds of features. This was only available with OEM hardware IIRC.

lisk16y ago· 3 in thread

looking at the results make me wonder if MS is keeping separate branches of win 10 internally or some CPU hogging services are disabled on Win 10 enterprise version.

wmf6y ago

Windows 10 Pro crippled the scheduler. Windows 10 Enterprise uses the same uncrippled scheduler as Windows Server. "CPU-hogging services" don't consume 32 full cores.

lisk16y ago

but this proofs my theory that MS is keeping internally different repositories for win 10 also we know that some tracking services are disabled for Win 10 enterprise which leads to logical conclusion that tracking services could potentially limit OS I/O ops.

1 more reply

thom6y ago

Any insight into whether Pro for Workstations is better here?

streetcat16y ago· 2 in thread

So, the get max pref from the Windows kernel, the software should use the completion port API, and not regular threads/locks.

https://docs.microsoft.com/en-us/windows/win32/fileio/i-o-co...

However, any software that does that will likely NOT be cross platform.

In addition, if you want to benchmark the kernel, you should run against ram disk and not SSD.

dragontamer6y ago

In the general case, to maximize performance on any platform requires you to use platform-specific code.

There are some decent "cross platform" platforms, such as Java or C#, which have a better degree of performance compatibility. But if you're working at the system level (aka: PThreads / epoll with Linux, or Windows Threads / Critical Sections / Completion Ports), you need to use the OS-specific code to truly reach best performance.

Java, especially with high-performance JVMs like Azul, can be surprisingly efficient. But to achieve the best performance on the Java Azul Zing runtime, means to use Java Azul-specific libraries! Once again, tying yourself down to a platform.

As it turns out, performance is the hardest thing to port. You can somewhat easily port functionality to any system and kludge things together (with effort, your C# code can port over to .Net Mono and run on Linux). But to actually get performance guarantees with primitives is almost always platform specific testing.

Case in point: you may make certain assumptions about the Linux scheduler, only for the Linux scheduler to change from O(n) to O(1) to Completely Fair, and today the System Admin can change scheduler details to better tune the needs of your application. These things have an effect on performance that make it difficult to port between systems... or even between the SAME system running slightly different configurations (ex: misconfigure Huge Pages on one box)

streetcat16y ago

Right. So in this case, what does the article compare?

1 more reply

pstrateman6y ago· 2 in thread

They're using `Clear Linux 32280` which is a distro produced by Intel.

Presumably built using the Intel compiler which specifically penalizes using AMD CPUs.

Would explain the advantage at low core counts that windows has.

arianvanp6y ago

Interestingly the opposite is true. AMD performs surprisingly well on Intel Clear Linux https://www.forbes.com/sites/jasonevangelho/2020/02/12/surpr...

yxhuvud6y ago

To the point where the benchmarks in the post get a bit misleading, as clear Linux will outperform Ubuntu (or Fedora or whatever a user is more likely to install) by a quote big margin.

arminiusreturns6y ago

Honestly a little surprised it's as close as it is. I have consistently hated having to deploy anything that requires lots of cores on a windows machine.

I have been keeping on eye on DragonflyBSD for years now, it does some very interesting things, so this:

> Coming up next I will be looking at the FreeBSD / DragonFlyBSD performance on the Threadripper 3990X

has me excited.

adossi6y ago

I'd like to see comparisons of compilation time. I wish there was a standard for benchmarking CPUs by compilation time. I know quite often a compilation of the Firefox source code is used, as well as the Linux kernel, I just wish it was more prevalent in these reviews.

thedance6y ago

These are all embarrassingly parallel multiplication workloads. Would be nice for a change if anyone would run something like MySQL or a gRPC server or something like that, you know one where it actually makes a difference how threads get scheduled when they go to sleep and wake up and when packets arrive and so forth.

lostmsu6y ago

With no clear explanation of wildly varying results between different benchmarks, I wonder if the the analysis is flawed.

Were those programs built with the same toolchain? Could it be, that some library the lagging ones use is causing the problem?

j / k navigate · click thread line to collapse

52 comments

40 comments · 11 top-level

t0mas886y ago· 7 in thread

cesarb6y ago

> May have something to do with the fact that a big proportion of servers in production with many CPUs/cores are Linux servers so they started investing in this quite early?

Not servers, but single system image supercomputers. Quoting Wikipedia (https://en.wikipedia.org/wiki/Altix):

twoodfin6y ago

I suspect SQL Server has no trouble scaling to 64 cores and beyond.

Guest426y ago

That is true. I have seen some terrible queries written on enterprise hardware that returned at the drop of a hat without being cached ahead of time.

blihp6y ago

oaiey6y ago

ksec6y ago

I have often wondered if Microsoft will one day Open Source the Window Kernel.

lisk16y ago

My guess is Linux is still mainly used in server environment so utilizing more threads as possible is necessity so more contribution is going into this ,

gmueckl6y ago· 6 in thread

monocasa6y ago

This is exactly why Linux won't provide a driver ABI, and why it's a good thing.

gmueckl6y ago

zamadatix6y ago

1 more reply

tpxl6y ago

There was another phoronix article recently that mentioned this, so they must know this exists.

Either way, if it doesn't work in the benchmark, it doesn't really matter who's at fault (as long as the benchmark isn't completely synthetic)?

muststopmyths6y ago

How is it a "benchmark" if it doesn't fully utilize the operating system's APIs to maximize performance ?

This is true of all operating systems.

"XYZ should be more than adequate for the typical application" is like a Microsoft basic design principle or something :-)

[1]https://docs.microsoft.com/en-us/previous-versions/windows/h...

2 more replies

908B64B1976y ago

MrBuddyCasino6y ago· 6 in thread

jdsully6y ago

I’m not sure about the Windows ORG, but office always made time for dev led architectural improvements. Back in the old days it was called Milestone Zero.

I’d be absolutely shocked if Windows didn’t have a similar process.

derision6y ago

I don't necessarily consider that a weakness

MrBuddyCasino6y ago

Ok, its a trade-off. The year of the Linux desktop is surely coming soon. I must say I did not expect the scaling performance difference to be so large though.

908B64B1976y ago

Performance for more than 64 cores on the retail non-server OS seems like feature not a lot of users are asking for...

detaro6y ago

Do the server variants do better? If yes, one would at least wonder why the variants you'd expect on high-end workstations didn't get the same options.

2 more replies

bonzini6y ago

Three, four if you count Con Kolivas's.

newnewpdro6y ago· 3 in thread

The linux kernel has been run on "big iron" for a long time now, it would be surprising if it weren't better prepared for scaling to 128+ cores.

linux/Documentation/vm/numa.rst states it was started in 1999, was windows going anywhere near NUMA architectures back then?

the84726y ago

thedance6y ago

There were NUMA x86 rigs long before the memory controller moved to the CPU. IBM xSeries and serverworks chipsets from around 2000 had NUMA topologies.

muststopmyths6y ago

They used to have a "Datacenter" SKU of the server where you'd find most of these kinds of features. This was only available with OEM hardware IIRC.

lisk16y ago· 3 in thread

looking at the results make me wonder if MS is keeping separate branches of win 10 internally or some CPU hogging services are disabled on Win 10 enterprise version.

wmf6y ago

Windows 10 Pro crippled the scheduler. Windows 10 Enterprise uses the same uncrippled scheduler as Windows Server. "CPU-hogging services" don't consume 32 full cores.

lisk16y ago

1 more reply

thom6y ago

Any insight into whether Pro for Workstations is better here?

streetcat16y ago· 2 in thread

So, the get max pref from the Windows kernel, the software should use the completion port API, and not regular threads/locks.

https://docs.microsoft.com/en-us/windows/win32/fileio/i-o-co...

However, any software that does that will likely NOT be cross platform.

In addition, if you want to benchmark the kernel, you should run against ram disk and not SSD.

dragontamer6y ago

In the general case, to maximize performance on any platform requires you to use platform-specific code.

streetcat16y ago

Right. So in this case, what does the article compare?

1 more reply

pstrateman6y ago· 2 in thread

They're using `Clear Linux 32280` which is a distro produced by Intel.

Presumably built using the Intel compiler which specifically penalizes using AMD CPUs.

Would explain the advantage at low core counts that windows has.

arianvanp6y ago

Interestingly the opposite is true. AMD performs surprisingly well on Intel Clear Linux https://www.forbes.com/sites/jasonevangelho/2020/02/12/surpr...

yxhuvud6y ago

To the point where the benchmarks in the post get a bit misleading, as clear Linux will outperform Ubuntu (or Fedora or whatever a user is more likely to install) by a quote big margin.

arminiusreturns6y ago

Honestly a little surprised it's as close as it is. I have consistently hated having to deploy anything that requires lots of cores on a windows machine.

I have been keeping on eye on DragonflyBSD for years now, it does some very interesting things, so this:

> Coming up next I will be looking at the FreeBSD / DragonFlyBSD performance on the Threadripper 3990X

has me excited.

adossi6y ago

thedance6y ago

lostmsu6y ago

With no clear explanation of wildly varying results between different benchmarks, I wonder if the the analysis is flawed.

Were those programs built with the same toolchain? Could it be, that some library the lagging ones use is causing the problem?

j / k navigate · click thread line to collapse