Now at 64 cores and above it is clear that the Linux developers have spent a lot of time making the Linux kernel better. May have something to do with the fact that a big proportion of servers in production with many CPUs/cores are Linux servers so they started investing in this quite early?
Not servers, but single system image supercomputers. Quoting Wikipedia (https://en.wikipedia.org/wiki/Altix):
"At product introduction, the system supported up to 64 processors running Linux as a single system image and shipped with a Linux distribution called SGI Advanced Linux Environment, which was compatible with Red Hat Advanced Server. By August 2003, many SGI Altix customers were running Linux on 128- and even 256-processor SGI Altix systems. SGI officially announced 256-processor support within a single system image of Linux on March 10, 2004 using a 2.4-based Linux kernel. [...]"
And as I recall, back then Linux suffered an issue similar to what Windows is facing now: a single 64-bit integer was no longer enough for a processor bitmask when you have more than 64 processors. IIRC, a lot of code had to be refactored to allow for a bigger bitmask, and an abstraction layer was put in place. Nowadays, the limit is 8192 processors according to arch/x86/Kconfig (see also: https://access.redhat.com/articles/rhel-limits).
I suspect SQL Server has no trouble scaling to 64 cores and beyond.
Given that it's a fundamental limitation of the NT scheduler (not present in Linux), it seems like it'd be on the table for "yeah, windows makes this way harder, and a lot of applications won't scale the same way on windows as their Linux versions will", rather than "oh, that just doesn't count because they aren't using it right".
EDIT: As an aside, this kind of thing is exactly why Linux doesn't provide binary compatibility on the driver level. It's easy to paint yourself into a corner by making decisions that were perfectly sane 20 years ago. Now NT has fundamental limitations, hitting even harder in kernel space where nearly every driver out there has some macros compiled in that touched these structures. It's bad enough at the syscall layer, but it's even worse when you can't change things with code that's directly modifying internal structures.
This is exactly why Linux won't provide a driver ABI, and why it's a good thing.
Either way, if it doesn't work in the benchmark, it doesn't really matter who's at fault (as long as the benchmark isn't completely synthetic)?
I am not defending the utter shittiness of the Win32 Processor group API design, which is the usual overly complicated Win32 API that only makes sense to an NT kernel developer. I can see why no application developer bothered to incorporate that in their software. It's basically asking you to do thread scheduling yourself for >64 cores.
But I'm always annoyed to see people whinging about Windows performance when they're using "cross-platform" applications (that are using the lowest common denominator APIs) to supposedly measure perf.
For an application that knows the OS environment it's running in, the "scheduler" or the "kernel" is very unlikely to be an actual impediment most of the time. I use quotes because those words are usually used to mean "my app doesn't run as I expect and it can't be my fault".
This is true of all operating systems.
An old MSDN doc [1] actually says "The reason for initially limiting all threads to a single group is that 64 processors is more than adequate for the typical application.". There is no newer doc that I could find, so perhaps the design hasn't changed in 8 or 10.
"XYZ should be more than adequate for the typical application" is like a Microsoft basic design principle or something :-)
[1]https://docs.microsoft.com/en-us/previous-versions/windows/h...
They also used a version of Windows aimed at desktop/graphical uses and not Server. I wouldn't be surprised if Desktop Windows didn't like having apps use a lot of CPU (making the system feel unresponsive) but server didn't mind.
I’d be absolutely shocked if Windows didn’t have a similar process.
linux/Documentation/vm/numa.rst states it was started in 1999, was windows going anywhere near NUMA architectures back then?
They used to have a "Datacenter" SKU of the server where you'd find most of these kinds of features. This was only available with OEM hardware IIRC.
https://docs.microsoft.com/en-us/windows/win32/fileio/i-o-co...
However, any software that does that will likely NOT be cross platform.
In addition, if you want to benchmark the kernel, you should run against ram disk and not SSD.
There are some decent "cross platform" platforms, such as Java or C#, which have a better degree of performance compatibility. But if you're working at the system level (aka: PThreads / epoll with Linux, or Windows Threads / Critical Sections / Completion Ports), you need to use the OS-specific code to truly reach best performance.
Java, especially with high-performance JVMs like Azul, can be surprisingly efficient. But to achieve the best performance on the Java Azul Zing runtime, means to use Java Azul-specific libraries! Once again, tying yourself down to a platform.
As it turns out, performance is the hardest thing to port. You can somewhat easily port functionality to any system and kludge things together (with effort, your C# code can port over to .Net Mono and run on Linux). But to actually get performance guarantees with primitives is almost always platform specific testing.
Case in point: you may make certain assumptions about the Linux scheduler, only for the Linux scheduler to change from O(n) to O(1) to Completely Fair, and today the System Admin can change scheduler details to better tune the needs of your application. These things have an effect on performance that make it difficult to port between systems... or even between the SAME system running slightly different configurations (ex: misconfigure Huge Pages on one box)
Presumably built using the Intel compiler which specifically penalizes using AMD CPUs.
Would explain the advantage at low core counts that windows has.
I have been keeping on eye on DragonflyBSD for years now, it does some very interesting things, so this:
> Coming up next I will be looking at the FreeBSD / DragonFlyBSD performance on the Threadripper 3990X
has me excited.
Were those programs built with the same toolchain? Could it be, that some library the lagging ones use is causing the problem?