Comparative Benchmark of Arm, AMD, and Intel for Cloud-Native Workloads (opens in new tab)

(kinvolk.io)

65 pointsblixtra6y ago22 comments

22 comments

22 comments · 8 top-level

alain940406y ago· 4 in thread

In multi-thread benchmarks of raw memory I/O we found a clear performance leader in Ampere’s eMAG, outperforming both AMD’s EPYC and Intel’s XEON CPUs by a factor of 6 or higher

That doesn't sound right. Neither AMD nor Intel get more than a handful of GB/s in basic memory I/O? Any idea what could be wrong?

lettergram6y ago

Upfront they say:

"It should be noted that Kinvolk has ongoing cooperation with both Ampere Computing and Packet, and used all infrastructure used in our benchmarking free of charge. Ampere Computing furthermore sponsored the development of the control plane automation used to issue benchmark runs, and to collect resulting data points, and to produce charts."

I'm not saying anything was intentionally done, but optimizations were likely done on the Ampere side.

wmf6y ago

It doesn't sound remotely right to me either. It could be NUMA since the Intel and AMD systems are NUMA but the eMAG is not. The code for this benchmark appears to be https://github.com/akopytov/sysbench/blob/master/src/tests/m... which... is not an interesting way to benchmark a large server IMO. Running a single process with a lot of threads and a lot of RAM on a NUMA server is going to perform poorly (unless you do a lot of tuning which I don't recommend either). "Microservices" might run a lot faster.

lrem6y ago

Are you sure about that? I got the impression that should work well under Linux, unless you create a lot of contention.

trynumber96y ago

I ran sysbench memory on dual-channel six-core MacBook and it scores 18311.19 MiB/sec, higher than either of those x64 behemoths. Something seems off.

spamizbad6y ago· 3 in thread

Should be mentioned the AMD CPUs featured are the previous generation (7401) CPUs.

wtallis6y ago

So are the Xeons. But AMD's latest generation was a bigger change than Intel's latest update.

floatboth6y ago

The current eMAG (Skylark), though it is current, is not very new either. It's a 16 nm design from last year. They wanted to launch Quicksilver in 2019, but there's like one month left..

ac296y ago

Sounds like it will be sampling this year: https://www.datacenterknowledge.com/hardware/ampere-gears-la...

andy_ppp6y ago· 3 in thread

Wouldn't most things need Hyperthreading off to be secure on Intel or is it fine if you have your own hardware?

sigio6y ago

That's only fine if you know all code running in all parts (containers) on the same hardware node. Code running on one container can influence data/code from other containers. (When some third-party has a form of code execution)

loeg6y ago

Privilege-aware scheduling could colocate only same-container (or same-user, or same-process) threads on HT pairs.

otakucode6y ago

Their tests disabled hyperthreading on Intel due to the security concerns and also on AMD on speculation that security concerns might arise in the future (if I read everything correctly).

baybal26y ago· 3 in thread

Very impressive perf on ARMs side given it competes against decades of x86 specific optimisation in the code.

Intel for example for long intentionally made float performance close to integer of same size, so there was no perf difference in scripting languages that use float internally for all computations.

ARM sucks at web benchmarks because ARM never put any accent on fp perf. Many ARM cores simply don't have fp units at all. The most popular JS vm V8 does a lot of useless float>integer and back conversions under the hood, and that doesn't help either. They are almost free on x86, but degrade js perf on smartphones by double digits.

Second, vector math and vector float math have close to no use in web loads, but a lot of devs still try to put SSE instructions everywhere simply because SSE is many times faster than simple math and many binary manipulations on x86.

ARM on other hand is relatively good with making a lot of ops on byte and double data, because it was historically never aimed at number crunching with extra wide vector instructions.

For the same reason ARMs UCS-2 and UTF-16 parsing performance is that bad. All kinds of parsers exploit fast register renaming on x86 to run tzcnt with very good perf, but they have to revert to relatively slow SIMD bitmasks on ARM. You can feel that a lot when you work with VMs/interpreters that use UCS-2 as their internal unicode implementation.

Hardware peripherals were always x86 optimised too. Yes, almost every device you can hook onto PCIE has been extensively optimised to work well with x86 style DMA, and some higher level APIs like I/O virtualisation, DMA offload engines, and assumptions about typical controller, memory, and cache latency.

Yes, even endianness conversion is there to make x86 jump ahead. Almost all "enterprise hardware" intentionally uses little endian in its protocols, to avoid endianness conversion on x86. Of course at the cost of doing it on big endian machines, that include ARM.

P.S. On other hand, nearly all peripheral ICs aiming at embedded market prefer big endian for an opposite reason.

ComputerGuru6y ago

Modern ARM is bi-endian but rarely run in big endian mode.

gnufx6y ago

ARM is doing OK with the current generation of HPC systems, and the post-K system, whose name I forget, should be rather impressive at floating point. SIMD width is not all that matters, after all. (Obviously this is v8 and up, which requires floating point.)

Taniwha6y ago

ARMs are usually used little ended

userbinator6y ago· 1 in thread

In the memcopy benchmark, which is designed to stress both memory I/O as well as caches, Intel’s XEON shows the highest raw performance

I am not surprised by that, given that x86 has a single instruction that will copy arbitrary number of bytes in cacheline-sized chunks --- something that ARM does not have.

wmf6y ago

It seems like the bottleneck should be the memory hierarchy, not executing instructions. /RISC4EVER

sanxiyn6y ago

If you are interested in development workload (ARM porting) instead of "cloud-native" workload, I did one here: https://github.com/sanxiyn/blog/blob/master/posts/2019-11-12...

In addition to Packet, both AWS and Scaleway were also benchmarked.

NicoJuicy6y ago

And for the Worldwide LHC Computing Grid, the Power8 came out on top.

Dutch article:

https://tweakers.net/reviews/7426/datavloedgolf-lhc-op-komst...

fhcoso6y ago

I'm very impatient to look RISC-V coming to look performance/security. Don't forget to disable a lot of features about Intel if you want a full secure environment like SMT/Hyper-Threading

About ARM, Cloudflare uses them : https://blog.cloudflare.com/arm-takes-wing/

j / k navigate · click thread line to collapse

22 comments

22 comments · 8 top-level

alain940406y ago· 4 in thread

In multi-thread benchmarks of raw memory I/O we found a clear performance leader in Ampere’s eMAG, outperforming both AMD’s EPYC and Intel’s XEON CPUs by a factor of 6 or higher

That doesn't sound right. Neither AMD nor Intel get more than a handful of GB/s in basic memory I/O? Any idea what could be wrong?

lettergram6y ago

Upfront they say:

I'm not saying anything was intentionally done, but optimizations were likely done on the Ampere side.

wmf6y ago

lrem6y ago

Are you sure about that? I got the impression that should work well under Linux, unless you create a lot of contention.

trynumber96y ago

I ran sysbench memory on dual-channel six-core MacBook and it scores 18311.19 MiB/sec, higher than either of those x64 behemoths. Something seems off.

spamizbad6y ago· 3 in thread

Should be mentioned the AMD CPUs featured are the previous generation (7401) CPUs.

wtallis6y ago

So are the Xeons. But AMD's latest generation was a bigger change than Intel's latest update.

floatboth6y ago

The current eMAG (Skylark), though it is current, is not very new either. It's a 16 nm design from last year. They wanted to launch Quicksilver in 2019, but there's like one month left..

ac296y ago

Sounds like it will be sampling this year: https://www.datacenterknowledge.com/hardware/ampere-gears-la...

andy_ppp6y ago· 3 in thread

Wouldn't most things need Hyperthreading off to be secure on Intel or is it fine if you have your own hardware?

sigio6y ago

loeg6y ago

Privilege-aware scheduling could colocate only same-container (or same-user, or same-process) threads on HT pairs.

otakucode6y ago

Their tests disabled hyperthreading on Intel due to the security concerns and also on AMD on speculation that security concerns might arise in the future (if I read everything correctly).

baybal26y ago· 3 in thread

Very impressive perf on ARMs side given it competes against decades of x86 specific optimisation in the code.

Intel for example for long intentionally made float performance close to integer of same size, so there was no perf difference in scripting languages that use float internally for all computations.

ARM on other hand is relatively good with making a lot of ops on byte and double data, because it was historically never aimed at number crunching with extra wide vector instructions.

P.S. On other hand, nearly all peripheral ICs aiming at embedded market prefer big endian for an opposite reason.

ComputerGuru6y ago

Modern ARM is bi-endian but rarely run in big endian mode.

gnufx6y ago

Taniwha6y ago

ARMs are usually used little ended

userbinator6y ago· 1 in thread

In the memcopy benchmark, which is designed to stress both memory I/O as well as caches, Intel’s XEON shows the highest raw performance

I am not surprised by that, given that x86 has a single instruction that will copy arbitrary number of bytes in cacheline-sized chunks --- something that ARM does not have.

wmf6y ago

It seems like the bottleneck should be the memory hierarchy, not executing instructions. /RISC4EVER

sanxiyn6y ago

If you are interested in development workload (ARM porting) instead of "cloud-native" workload, I did one here: https://github.com/sanxiyn/blog/blob/master/posts/2019-11-12...

In addition to Packet, both AWS and Scaleway were also benchmarked.

NicoJuicy6y ago

And for the Worldwide LHC Computing Grid, the Power8 came out on top.

Dutch article:

https://tweakers.net/reviews/7426/datavloedgolf-lhc-op-komst...

fhcoso6y ago

I'm very impatient to look RISC-V coming to look performance/security. Don't forget to disable a lot of features about Intel if you want a full secure environment like SMT/Hyper-Threading

About ARM, Cloudflare uses them : https://blog.cloudflare.com/arm-takes-wing/

j / k navigate · click thread line to collapse