AMD’s 7nm Vega 20 GPU die. Image credit: Anandtech
On November 6th, AMD held an event dubbed “Next Horizon,” during which they formally announced the next generation of EPYC “Rome” high-performance CPUs and Radeon Instinct machine learning/AI GPUs for the data center. These chips are manufactured on TSMC’s bleeding edge 7 nanometer fabrication process, said to deliver 2x the density and a 50% reduction in power consumption versus the currently used 14nm LPP node from GlobalFoundries. The day prior, Intel made somewhat of an attempt to upstage AMD, announcing its Cascade Lake-AP server CPUs, still manufactured on 14nm. Based on the specifications of the upcoming chips, AMD appears poised to take significant share from Intel in the data center market next year.
First, let’s dive into the known specifications of the EPYC chips:
|CPU||AMD EPYC “Rome”||Intel Xeon “Cascade Lake-AP”||AMD EPYC 7601 (“Naples”)||Intel Xeon Platinum 8180M “Skylake-SP”|
|Node/uArch||7nm Zen 2||14nm++ Cascade Lake||14nm Zen||14nm+ Skylake-SP|
|1.8GHz (ES) / ?||?||2.2GHz / 3.2GHz||2.5GHz / 3.8GHz|
|Memory Controller||Octa-Channel DDR4 (3200?)
Up to 4TB per socket
Up to 3TB per socket
Up to 2TB per socket
Up to 1.5TB per socket
|I/O||128x PCI-E 4.0||96x PCI-E 3.0||128x PCI-E 3.0||48x PCI-E 3.0|
|Socket||Socket SP3 (LGA 4094) (<2P)||BGA 5908 (<2P)||Socket SP3 (LGA 4094) (<2P)||LGA 3647 (<8P)|
First we immediately take note of the sheer size of this monstrosity: 64 cores, 128 threads, 160MB combined L2 + L3 cache. AMD achieved this in part by moving to an even more modular architecture than that used in the original EPYC, coupled with a new version of Infinity Fabric to reduce latency and increase bandwidth.
Image credit: Tom’s Hardware
The new EPYC incorporates 8 tiny CPU dies, each containing a complex of 8 cores, tied together through a massive I/O die which is still manufactured on the 14nm node. Things like I/O controllers don’t shrink down to smaller nodes very well, and the benefits of doing so are negligible. This allows AMD to significantly improve yields and cost of production, as well as improve greatly upon the issues caused by non-uniform memory access, high latencies between dies and multiple hops for data through Infinity Fabric.
The I/O die contains the PCI-E controller, memory controller, and likely an L4 cache (although this remains unconfirmed). This eliminates NUMA and non-uniform memory latency, ensuring that only one hop to the I/O die is necessary and allowing the chip to actually behave like a single-socket part. The L4 cache if implemented would be fully inclusive of the L3 (which is already inclusive of the L2), meaning that any data needed to be pulled from another die’s cache would already be present in the I/O die, improving dramatically on “Naples”‘ wildly varying cache latency.
“Rome” also is the first x86 CPU to implement the PCI-E 4.0 specification, doubling bandwidth for peripherals like graphics cards to 64GB/s bidirectional. It also boasts new Infinity Fabric Links, offering 200GB/s of bidirectional bandwidth to compatible Radeon Pro/Instinct GPUs as well as between CPUs in a dual-socket configuration. This puts it miles ahead of Intel’s current Xeon offerings, which only output 48 PCI-E 3.0 lanes per socket. However, unlike with EPYC where dual-socket configurations use half of each CPU’s PCI-E lanes for inter-socket communications (and thus doesn’t increase the total lane count), Intel’s PCI-E lane count is unaffected in multi-socket configurations, thus a system with two Xeons supports up to 96 PCI-E 3.0 lanes. This still falls far short of EPYC with less than half the total I/O bandwidth.
AMD didn’t reveal clock speeds or exact performance uplift, but a footnote in their press release suggested a 29% IPC improvement over “Naples”. Even if this is a best-case scenario and typical workloads only see half the improvement, this is nonetheless very impressive. They discussed numerous architectural improvements, such as an improved front-end and branch predictor, lower latencies and increasing the FPU width to 256 bits. This means that they’re tackling the key weaknesses of their previous lineup versus Intel’s CPUs, which mainly include workloads that are latency-sensitive or utilize 256-bit AVX.
At the event, they demonstrated one 64-core “Rome” CPU being benchmarked in C-Ray against two 28-core Xeon Platinum 8180M CPUs (the top of the line from Intel, costing $13000 each) in a dual-socket config. The EPYC machine finished the benchmark 7% quicker than the Xeons. Furthermore, AMD hinted that power consumption would stay the same with “Rome” (180W TDP), whereas the two Xeons have a combined TDP of 410W and also require a chipset which consumes about 20W. If this is even remotely indicative of typical performance, this chip will put Intel in the toughest competitive position it’s been in since 2005.
Intel’s response, which will probably be released at least a quarter or two later than “Rome” if their recent antics are anything to go by, is a 48-core Cascade Lake-AP CPU. While last year, Intel famously berated AMD for using 4 “glued together” dies in “Naples,” this CPU will utilize two of Intel’s 28-core dies on an MCM package, migrated to the more refined 14nm++ node. Note that 4 cores have been disabled on each die, likely due to (what will anyway be) excessive heat and power consumption. As each die supports 6 memory channels, the CPU will support 12 channels of DDR4 memory. No other concrete information was announced by Intel, leading this author to think the product is nowhere near being launched, similar to their 28-core desktop CPU announced 5 months ago which is still nowhere in sight.
Regardless of when it comes out, Intel’s CPU will likely have a hard time competing. It’s still manufactured on 14nm, using two gigantic 698mm^2 dies (“Rome” uses dies in the 70mm^2 range), and all expectations are set on a TDP of 300W or higher compared to 180W for EPYC. Moreover, based on available data, it looks like “Rome” will have equal, if not measurably better performance per clock and per core compared to Intel’s aging “-Lake” architecture. With nearly double the power consumption, 3/4 the cores, a way higher price tag, and less than half the I/O bandwidth, it’s hard to see the appeal of this part compared to EPYC. The main thing Intel has going for it right now is its reputation — it’s thoroughly established and entrenched in the data center, with a reliable track record and countless existing contracts, whereas AMD was absent from this market for the past few years. “Rome” might just be enough to make many large customers change platforms, though.
AMD made no announcements regarding desktop parts, but rest assured that Zen 2-based chips will be coming to Socket AM4 in 2019. Beyond that is the realm of rumor and speculation.
The Radeon Instinct chips are a lot less exciting, so I’m not going to talk about them in as much depth. There are two models: the MI60, featuring 64 compute units and a TDP of 300W; and the MI50, featuring 60 compute units and a TDP of 150W. The chip used is a die shrink of the current 14nm “Vega 10” silicon to 7nm, dubbed “Vega 20”, with enhancements such as dedicated INT4/INT8 hardware delivering 59/118TOPS respectively and a bump to a 1:2 FP64:FP32 ratio from Vega 10’s 1:16. Otherwise, the architecture is the same. The core config of 4096 SP, 256 TMU, and 64 ROP is retained, while the number of HBM stacks is doubled (running at 2000MHz effective) for a total of 16GB/32GB (MI50/MI60) and 1TB/s bandwidth. These cards are marketed as being for AI and machine learning, but they are arguably even better suited for scientific and HPC workloads thanks to the FP64 units. In this area they will compete with NVIDIA’s Tesla V100 PCI-E card, which offers similar theoretical performance on the 12nm node. For machine learning, NVIDIA’s Tesla T4 (also 12nm) should offer superior performance at a fraction of the power draw (75W vs 150W/300W for Radeon Instinct). These cards are not being marketed to gamers and for gaming workloads we would not expect them to exceed 1080 Ti performance. We are eagerly awaiting AMD’s “next gen” graphics architecture, coming ~2020-2021, as GCN (which is approaching its 8th birthday) is simply not exciting anymore and does not compete strongly. However, AMD will likely be competing on price.
But back to the exciting stuff, the CPUs. Is it over? Is Intel finished? How will 7nm Ryzen materialize? Let us know what you think in the comments.