- 1S/2S is obviously where the pie is. Few servers are 4S.
- 8 DDR4 channels per socket is twice the memory bandwidth of 2011, and still more than LGA-36712312whateverthenumberwas
- First x86 server platform with SHA1/2 acceleration
- 128 PCIe lanes in a 1S system is unprecedented
All in all Naples seems like a very interesting platform for throughput-intensive applications. Overall it seems that Sun with it's Niagara-approach (massive number of threads, lots of I/O on-chip) was just a few years too early (and likely a few thousands / system to expensive ;)
Yes, definitely drooling at this. Assuming a workload that doesn't eat too much CPU, this would make for a relatively cheap and hassle-free non-blocking 8 GPU @ 16x PCIe workstation. I wants one.
This one will be interesting. The current Ryzen (like most of the Intel desktop range) has two channels, but everyone has been benchmarking it against the i7-6900K because they both have eight cores. The i7-6900K is the workstation LGA 2011 with four channels. If the workstation Ryzen will have eight channels...
By comparison, the performance of sparc substantially improved moving from the T1, T2 to T3+. The T1 used a round-robin policy to issue instructions from the next active thread each cycle, supporting up to 8 fine-grained threads in total. That made it more like a barrel processor.
Starting with the T3, two of the threads could be executed simultaneously. Then, starting with the T4, sparc added dynamic threading and out-of-order execution. Later versions are even faster and clock speeds have also risen considerably.
Every mainframe interface is basically an offload interface.. "computers" DMAing and processing to the CPs and each other. Every I/O device has a command processor, so it can handle channel errors and integrated pcie errors in a way PCs cannot.
A PC with Chelsio NICs doing TCP offload with direct data placement or RDMA as well as Fiber Channel storage would be mini/mainframe-ish.
And AMD should dump SHA1 acceleration in the next generation.
The cost to have that on silicon is probably close to zero. If you think SHA1 is just going to magically disappear because you want it to, well, you'll be in for a SHA1 sized surprise. Our grandkids will still have SHA1 acceleration.
>ARMv8 has had it for like 2-3 years now...
Because ARM cores don't remotely have the CPU heft an Intel x86/64 chip has, so ARM needs all this acceleration because its typically used in very low power mobile scenarios. On top of that, Intel claims AES-NI can be used to accelerate SHA1.
https://software.intel.com/en-us/articles/improving-the-perf...
https://en.wikipedia.org/wiki/Intel_SHA_extensions
>There are seven new SSE-based instructions, four supporting SHA-1 and three for SHA-256:
>SHA1RNDS4, SHA1NEXTE, SHA1MSG1, SHA1MSG2, SHA256RNDS2, SHA256MSG1, SHA256MSG2
The only processors so far with these extensions are low power Goldmont chips.
At the previous job where I built the 64-core system, I even emailed the AMD marketing department to see if we could do some PR campaign together, but I think it was too soon before the Naples drop, because I never got a response. Here's to hoping supermicro does a 4 cpu board for this... 124 cores would be amazing. (But I'll take 64 naples cores as long as it gets rid of the bugs and issues I found with the opterons).
Very few of the operations used GPU. Things may have changed since I was working there, but the work at the time wasn't suited for a GPU architecture.
Initial step was sequence cleanup, which is a hidden markov model executed over a collection of sequences of varying length, so hard to parallelize. Sequence annotation is embarassingly parallel on a per-library basis (each sequence can be annotated independently of the other), but the computational work is fuzzy string matching, which is once again hard to GPU-ize. Another major computational job was contig assembly, which is somewhat parallelizable (pairwise sequence comparisons), but once again involves fuzzy string matching so not GPU-izable.
So that's just sequence genetics. Don't know if GPUs are used in other areas.
Lots of cores, lots of threads, and lots of main memory. That was the key.
http://ce-publications.et.tudelft.nl/publications/1520_gpuac...
A quote from a anandtech forum post [0] reads promising:
"850 points in Cinebench 15 at 30W is quite telling. Or not telling, but absolutely massive. Zeppelin can reach absolutely monstrous and unseen levels of efficiency, as long as it operates within its ideal frequency range."
A comparison against a Xeon D at 30W would be interesting.
The possibility of this monster maybe coming out sometime in the future is also quite nice: http://www.computermachines.org/joe/publications/pdfs/hpca20...
[0] https://forums.anandtech.com/threads/ryzen-strictly-technica...
With that say, I'm looking forward to these systems.
However, there is an important difference. AMD seems to be putting multiple dies into the same package, whereas Intel seems to have (as the Cluster on Die name implies) everything on the same die. So my fear is that the interconnect between dies may not be fast enough to paper-over our NUMA weaknesses.
x U (or x HE, if you're talking with a German manufacturer, they like to make that mistake ... ;) are rack-units, i.e. how large the case is.
Also what with ECC? Ryzen can support it or not?
The underperformance in gaming was tracked down to software issues according to AMD. Namely:
- bugs in the Windows process scheduler (scheduling 2 threads on same core, and moving threads across CPU complexes which loses all L3 cache data since each CCX has its own cache)
- buggy BIOS accidentally disabling Boost or the High Performance mode (feature that lets the processor adjust voltage and clock every 1 ms instead of every 40 ms.)
- games containing Intel-optimized code
More info: http://wccftech.com/amd-ryzen-launch-aftermath-gaming-perfor...
Furthermore hardcore gamers usually play at 1440p or higher in which case there is no difference in perf between Intel or AMD, as demonstrated by the many benchmarks (because the GPU is always the bottleneck at such high resolutions.)
BTW they advertised it as good for gaming + streaming (h264 CPU encoding at the same time on the same machine). And "content creation", which pretty much always means video editing.
IIRC Ryzen supports unbuffered ECC if the mainboard supports it.
2. Compared to Desktop / Windows Ecosystem, their are much more Open Source Software on the Server side, along with usual Open Source Compiler. Which means any AMD Zen optimization will be far easier to deploy compared to Games and App on Desktop coded and compiled with Intel / ICC.
3. The sweet spot for Server Memory is still at 16GB DIMMs. A 256GB Memory for your caching needs or In-memory Database will now be much cheaper.
4. When are we going to get much cheaper 128GB DIMM Memory? Fitting 2TB Memory per Socket, and 4TB per U, along with 128 lanes for NVM-E SSD Storage, the definition of Big Data, just grown a little bigger.
5. Between now and 2020, the roadmap has Zen+ and 7nm. Along with PCI-E 4.0. I am very excited!
Yes, and it's rumored that the top end 7nm chip will be 48 cores (codename starship). Exciting times ahead now that the competition is back.
How will Naples fare on this front?
I'm glad I don't own any Intel stock atm :)
A high core count, energy efficient CPU with IO out the wazoo?
I'm happy I bought AMD stock over the summer (:
The main scalability issue I have with Postgres is its horrible layout of data pages on disk. You can't order rows to be layed out on disk according to primary key. You can CLUSTER the table every now and then but that's not really practical for most production loads.
I don't think there's been any work on it yet though.
My guess is the 1 socket options scales great. 2 sockets are are less than ideal, and you will not double the 1 socket performance.
I've seen benchmarks on the -hackers mailing list with 88 core Intel servers (4s 22c) in regard to eliminating bottlenecks when you have that many cores. So even if it's not 100% there yet, it will be soon.
https://semiaccurate.com/2016/11/17/intel-preferentially-off...
Basically, using multiple dies increases latency significantly between the cores on different dies. This will affect performance. I will not judge till I see the benchmark though :-)
http://wccftech.com/amd-exascale-heterogeneous-processor-ehp...
I'd like to have that in the old project quantum package: http://wccftech.com/amd-project-quantum-not-dead-zen-cpu-veg...
That would be a TFLOPS level supercomputer on your desk.
Well not with HBM (which is DRAM), but huge amounts of L3 SRAM on a MCM... POWER5 I believe.
Though I am kind of worried concerning memory access. Latency penalties when accessing non-local memory are very high on Zen CPUs due to the multi-die architecture design.
Does that mean we will finally see some serious interest in Shared-Nothing design and alike in the future ?
So the single socket systems can have more pci-e lanes available, but the dual socket has less per socket because some of those lanes are used for hypertransport.
What I can't figure out is why Intel and AMD aren't using similar (Hypertransport for AMD and QPI for intel) to connect directly to GPUs in a cache coherent way. These days the faster interconnects spend a decent fraction of their latency just getting across the PCI-e bus twice.
So 100 Gbit networks, Infiniband, GPUs, etc all could take advantage of a lower latency cache coherent interface, but it's not available.
I suspect mainly because qpi and hypertransport are incompatible and pci-e is good enough for the high volume cases.