What specific configurations (CPU, MB, RAM) are known to work?
Let's say I have a Ryzen system, how can I check if ECC really works? Like, can I see how many bit flips got corrected in, say, last 24h?
You must check the specifications of the motherboard to see if ECC memory is supported.
As a rule, all ASRock MBs support ECC and also some ASUS MBs support ECC, e.g. all ASUS workstation motherboards.
I have no experience with Windows and Ryzen, but I assume that ECC should work also there.
With Linux, you must use a kernel with all the relevant EDAC options enabled, including CONFIG_EDAC_AMD64.
For the new Zen 3 CPUs, i.e. Ryzen 5xxx, you must use a kernel 5.10 or later, for ECC support.
On Linux, there are various programs, e.g. edac-utils, to monitor the ECC errors.
To be more certain that the ECC error reporting really works, the easiest way is to change the BIOS settings to overclock the memory, until memory errors appear.
Looking back at my notes, the output of journalctl -b tells should say something like, "Node 0: DRAM ECC enabled."
Then 'edac-ctl --status' should tell you that drivers are loaded.
Then you run 'edac-util -v' to report on what it has seen,
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
edac-util: No errors to report.You can also use memtest86+ for this, although I don't recall if it requires specific configuration for ECC testing.
They aren't tested on it, so it's possible to get a dud, but it's minuscule chance that isn't worth bothering.
Now, to actual issues you can encounter: motherboards
The problem is that ECC means you need to have, iirc, 8 more data lines between CPU and memory module, which of course mean more physical connections (don't remember how many right now). Those also need to be properly done and tested, and you might encounter a motherboard where it wasn't done. Not sure how common, unfortunately.
Another issue is motherboard firmware. Even though AMD supplies the memory init code, the configuration can be tweaked by motherboard vendor, and they might simply break ECC support accidentally (even by something as simple as making a toggle default to false then forgot to expose it in configuration menu).
Those are the two issues you can encounter.
The difference with AFAIK Threadripper PRO, and EPYC, is that AMD includes ECC in its test and certification programs for it, which kind of enforces support.
I think some Gigabyte boards are infamous for this in certain circle
OTOH: Gigabyte might have a Threadripper PRO motherboard (WRX80 chipset) coming out in the future
PC C:\> wmic memphysical get memoryerrorcorrection
MemoryErrorCorrection
6
SuperUser has a convenient decoder[1], but modern systems will report "6" here if ECC is working.When Windows detects a memory error, it will record it in the system event log, under the WHEA source. As a side note, this is also how memory errors within the CPU's caches are reported under Windows.
[1] https://superuser.com/questions/893560/how-do-i-tell-if-my-m...
*not officially, and the memory controller provides no report for 'fixed' errors.
Edit: as detaro mentioned in the reply, there is, and here's the source [0] -- that's what they mean by "RAS" on promotional pages [1]. That indeed looks like a nice option.
[0] https://www.amd.com/system/files/documents/updated-3000-fami...
[1] https://www.amd.com/en/products/embedded-epyc-3000-series
There are computers in the Intel NUC form factor, with ECC support (e.g. with Ryzen V2718), e.g from ASRock Industrial.