Amazon's Arm-based Graviton2 Against AMD and Intel: Comparing Cloud Compute
by Andrei Frumusanu on March 10, 2020 8:30 AM EST- Posted in
- Servers
- CPUs
- Cloud Computing
- Amazon
- AWS
- Neoverse N1
- Graviton2
SPEC - MT Performance (4xlarge 16 vCPU)
The 64-core results were quite interesting and put the Graviton2 in a very competitive performance position, but all this talk about performance scaling varying depending on the loaded core count of the system made me wonder how the EC2 instances would perform at lower vCPU counts.
I fired up the same tests, just this time around with only rate-16 to match the number of vCPUs. These are 4xlarge EC2 instances with corresponding 16 vCPUs, but there’s one large caveat in this comparison that we must keep in mind: The Graviton2 instances very likely have no neighbours at this point in time in the test preview, meaning the performance scaling we’re seeing here is very much a best-case scenario for the Amazon chip. EC2 global capacity floats around at 60% active usage, and I imagine Amazon distributes this horizontally across the available sockets in their datacentres. How these performance figures will look like in the real world once Graviton2 ramps up in public availability is anybody’s guess.
The AMD system likely won’t care too much about such scenarios as their NUMA nature means they’re isolated from noisy neighbours anyhow, and we’re just seeing use of a single 8-core chip with its own memory controllers, but the Intel system will have possibly some neighbours doing some activity on the same socket and shared resources. I only ran one test run here; you’d probably need a lot of data to get a representative figure across EC2 usage.
For the Intel m5n instances, using an 4xlarge instance actually means you're only on on single socket this time around, meaning that the scaling behaviour in favour of higher per-thread performance isn't to be expected as high as on the Graviton2 system, as system DRAM bandwidth and L3 is halved compared to the 16xlarge figures on the previous page.
Also, since we’re testing 16 vCPU setups here, we can have an apples-to-apples comparison between the first- and second-generation Graviton systems which should be a fun comparison.
The comparison between the two generations of Graviton processors here is also astounding. Memory intensive workloads favour the newer Graviton2 by at least a factor of 2x, more often 3x, 4x, 5x and even up to 7x in libquantum.
The AMD system as expected doesn’t gain much scaling from using less cores as there’s no more shared resources available on a per-thread basis. The Intel chip fares slightly better per-thread, but doesn’t see the same higher performance scaling (Or should I say, reverse-scaling) as achieved by the Graviton2.
In fp2006, we see more or less the same kind of results.
Overall, in the 16-vCPU rate results the Graviton2 surpasses the performance advantage it showcased in the 64-core results, ending up with an even bigger margin.
The SPEC2017 results again show the same conclusion – the Graviton2 really gains a ton of per-thread performance through the ability to use more of the chip’s L3 cache and 8 memory channels. Whilst on the 64-rate results the Graviton2 and the Xeon were neck-in-neck in fp2017, here the Graviton ends up with a 44% performance advantage.
Again, I can’t put enough emphasis on this, but these results are a best-case scenario for the 4xlarge 16vCPU results of the Graviton2. If production instances are able to achieve such figures will very largely depend on the draw of luck on whether you’re going to be alone on the physical hardware or whether you’ll have any neighbours on the chip. And even if you have neighbours, the performance figures will largely depend on what kind of workloads they will be running alongside your use-cases.
I saw a few articles out there comparing the performance between the m6g instances against the m5 generation instances (Skylake-SP hardware), but most of these tests were done only on medium (1 vCPU) to xlarge (4 vCPUs). When reading such pieces, it’s naturally important to keep in mind the vast scaling advantage the Graviton2 chip has – the smaller your instance is the more chance you’ll have noisy neighbours on the hardware, something that currently just doesn’t happen in the Graviton2’s preview phase.
96 Comments
View All Comments
jbrower - Saturday, July 24, 2021 - link
Well at least you have a troll -- mark of success for authors, heheProDigit - Wednesday, March 11, 2020 - link
110W is very pessimistic, and would make no sense at all, considering that the ryzen 9 3900x uses 105W at 12 cores 24 threads at 4.6Ghz and 7nm, and the 3950 does the same with 4 more cores.Plus, regular arm based (AMLogic) boxes use 3Watt in total under load (that includes CPU+Ethernet+RAM+Emmc) for 4 CPU cores running at 1,9Ghz.
If you ask me, 64 core arm CPUs running at 2Ghz should run at around just over 1 watt per core, making it a 65W tdp chip
Andrei Frumusanu - Wednesday, March 11, 2020 - link
There's 64 PCIe4 lanes and 8 memory controllers in there as well.cdome - Wednesday, March 11, 2020 - link
Quick question. Does Graviton2 have support for SVE2 vector extension? if yes how wide are execution units? thank youAndrei Frumusanu - Wednesday, March 11, 2020 - link
No, there's 2x128b v8 ASIMD/NEON pipes.Soulkeeper - Wednesday, March 11, 2020 - link
What was used to generate the images on page 2 ?ie: https://images.anandtech.com/doci/15578/AMD-Epyc-6...
Is this app/source available to download ?
Thanks
sharath.naik - Wednesday, March 11, 2020 - link
Whats behind the name Annapurna? The name is Indian in origin but the company is Israeli.nijimon - Thursday, March 12, 2020 - link
Judging by the logo it could be referring to the massif in the Himalayas.https://en.wikipedia.org/wiki/Annapurna_Massif
Andy Chow - Thursday, March 12, 2020 - link
"I recently had the time to write a new custom microbenchmark for testing synchronisation latencies of CPU cores, exhibiting some of the cache-coherency as well as physical layouts of current designs."Wow, and what a benchmark that turned out to be. Please consider packaging it and releasing it. Or giving us the code so we can run it. I would really love to run that test on a few of my machines. I am frustrated with current benchmarks on this area also, and you seem to have built the perfect solution.
ballsystemlord - Thursday, March 12, 2020 - link
1 Grammar error:"Overall, it's a bit odd to see GCC ahead in that many workloads given that LLVM the is the primary compiler for billions of Arm devices in the mobile space."
Extra "the":
"Overall, it's a bit odd to see GCC ahead in that many workloads given that LLVM is the primary compiler for billions of Arm devices in the mobile space."