Intel Brings Chiplets to Data Center CPUs

Article By : Sally Ward-Foxton

Sapphire Rapids CPUs are split into four dies linked via Intel's EMIB technology.

Intel Corp.’s fourth-generation Xeon processor, codenamed Sapphire Rapids, consists of four chiplets, the company revealed during its Architecture Day event.

This marks the first time Intel has integrated chiplets into its Xeon data center CPU line, having previously added the technology to its Stratix 10 FPGA line last year. Stratix 10 was the first product to incorporate Intel’s advanced packaging technology, embedded multi-die interconnect (EMIB), that uses a silicon interposer to connect dies. Previously, Intel combined its CPUs with an AMD GPU in a product called Kaby Lake-G, which did not use EMIB. Intel has also developed vertical die stacking technology called Foveros (used in its Lakefield line), but confirmed this week it will not be used in Sapphire Rapids.

Nevine Nassif Intel
Nevine Nassif (Source: Intel)

In a pre-Architecture Day interview with EE Times, Intel’s chief engineer for Sapphire Rapids, Nevine Nassif, cited cost and yield advantages enabled by EMIB as the impetus behind the chiplet strategy.

“We started looking at this a long time ago, 15-16 years ago, but what we didn’t have at that time was EMIB technology,” Nassif said. “In the past, whenever we tried breaking a design into two dies, the overhead area you needed for the [interconnect] was too big – that power between the dies was too big, and the latencies would kill performance.”

EMIB has enabled Intel to scale down to 55-micron bump pitch, meaning the overhead for adding die-to-die interconnect has decreased sufficiently while latency is reduced to manageable levels.

Sapphire Rapids’ chiplet development began when Intel engineers sought to slice a die in two or four. The intent was accelerating availability of prototype silicon, particularly given yields produced by its process technology at the prototype stage (Sapphire Rapids is built on Intel’s 7-nm process technology).

Encountering problems at the previous process node, which delayed Sapphire Rapids’ predecessor, Ice Lake, the chiplet approach emerged as an attractive option.

“Later in the process, we wanted to come back to something more monolithic… [but] as time went on and things got delayed, in order to stay competitive, we wanted more cores,” Nassif explained. Chiplets “turned out to be a good way to go beyond the reticle limits. We had to break the die into at least two to fit in the number of cores we wanted.”

Modular die fabric

The technology used to connect the die is called modular die fabric (MDF), which carries the bandwidth of the mesh among dies. All four are connected by a mesh, preserving the monolithic properties of the design. Any core can talk to other cores on any of the four dies and access the shared cache across all four quadrants as well as I/O across four quadrants.

Sapphire Rapids Block Diagram
Aside from bring mirrored pairs, Sapphire Rapids’ four chiplets are identical. (Image: Intel)

“One core is not restricted to its quadrant, though it’ll see better performance of course, because things are nearby, but everything is available to every core,” Nassif said. “We always have our two and four socket systems that go across UPI [Intel Ultra-Path Interconnect], so we can talk via UPI to a core on the other socket.”

With Stratix 10, EMIB was used to marry heterogeneous die (one FPGA die plus 4 transceiver dies plus two HBM chiplets). For simplicity, the four Sapphire Rapids dies are identical, aside from being mirrored pairs.

“Since this was our first foray into this kind of thing, we just went with something simpler – it wasn’t simple – but simpler,” noted Nassif. “From a design perspective, they’re identical, so that we only had to validate that one die that was working and then validate the interfaces going from die to die. After that, we were pretty certain that we could get this to work.”

Future generations of data center CPUs may use heterogeneous dies in package, but not Sapphire Rapids.

Development challenges

Maintaining a four-die design that looks as much as possible like a monolithic die caused some development headaches, Nassif acknowledged.

Ideally, Intel wanted to boost the clock rate by a factor of four among dies to reduce the silicon area required, but the bit-error rate was too high. Further, the configuration would have required correction circuitry that would increase latency. Sending at the mesh speed was too conservative, with too much area overhead.

Sapphire Rapids chiplets
Sapphire Rapids is the first generation of Xeon data center CPUs to make use of chiplet technology (Image: Intel)

“We had to try to find a way to minimize the impact to latency, minimize area overhead, minimize power overhead and make sure we didn’t have errors from die to die,” she said. “We had to find the sweet spot.”

Voltage for die-to-die communication also had to be optimized to reduce power consumption while maintaining signal integrity.

Data bandwidth between dies also had to be maintained to support mesh bandwidth. Designers looked at the data patterns in different workloads to optimize bandwidth versus power consumption.

“We tried to do smart things around how to send the data across so that we wouldn’t be burning a ton of power,” Nassif said. “We also watched the data as it went across to make sure that we would pull back in certain cases or push harder in other cases, to make sure that we weren’t burning unnecessary power as we send the data across.”

Different data transfer modes optimized for certain workloads have been around since Sky Lake (2015). Sapphire Rapids’ out-of-the-box data transfer mode uses Quad mode (traffic confined to one die), but customers can choose to switch to other modes at the expense of power consumption.

Chiplet future

While Nassif confirmed that Sapphire Rapids CPUs will only have four chiplets, future generations may well have more.

“There’s nothing constraining us to four die, and as we were working on Sapphire Rapids at different times, we considered four, six, or eight,” she said. “There’s nothing in the technology itself that constrained us to four. It had more to do with what we had available to us, and the risk we wanted to take on the first generation.”

It comes down to balancing the performance advantages of being on the same die with reticle limitations and yield. The sweet spot seems to be at a roughly 400-mm2 sized die.

Given the way the Sapphire Rapids dies are connected, there really could only be four dies.

“We are looking at ways to make things modular, so I could, say, run with only two die for certain products, then maybe add another die when I want to increase cores or I/Os,” Nassif said. “We want to make things more modular in terms of how you mix and match the die. It’s not something we will have in [the current] generation, it’s something we’re looking at for the future.”

Memory and I/O

DDR5 support also is new with Sapphire Rapids, pushing memory frequencies higher than for Ice Lake. The latest Xeon processor has a DDR5 memory controller on each die, supporting eight channels total.

Future versions will support four stacks of high bandwidth memory (HBM). Those iterations will require an HBM controller on each die connecting to the HBM memory via EMIB. On those versions, the CPU dies will be slightly different to DDR5-only versions; some accelerators for cryptography, compression and other tasks will be bumped out to make room for the HBM controller (though the new data streaming accelerator, DSA, which aids movement of data between memory and I/O without using the CPU core, will be retained). There are also some changes to the mesh supporting HBM bandwidth, Nassif said. HBM-enabled Sapphire Rapids die will retain their DDR5 controllers, so DDR5 and HBM could be used together.

Ice Lake’s delay led Intel to introduce PCIe Gen5 in Sapphire Rapids, along with more PICe lanes. The lanes support the new compute express link (CXL) protocol, which in turn supports coherent memory transactions. CXL also is used for accelerator and memory expansion in data centers. While PCIe lanes can be bifurcated, current-generations CXL lanes can not, though Nassif said they will be in future products.

Golden Cove core

Sapphire Rapids CPUs will use Intel’s new Golden Cove core architecture, specifically the P-core version optimized for performance. Client side processors like Alder Lake will use both P- and E-cores, a version optimized for power efficiency. Future Sapphire Rapids processors won’t use E-cores, but future generations of Xeon might.

Sapphire Rapids P-core
The Golden Cove core featured in Sapphire Rapids is optimized for performance rather than power efficiency (Image: Intel)

The Golden Cove core includes the next generation of Intel’s AI acceleration technology, DLBoost. In this version, the chipmaker added instructions for matrix multiplication, a common mathematical operation in AI workloads. The new instructions, called AMX (advanced matrix extensions), is in addition to the advanced vector extensions (AVX-512) instructions added previously to DLBoost.

There are two components in the AMX architecture:

  • Tiles, a new state component consisting of 8x two-dimensional registers, each 1 Kb in size. The register file supports basic data operators such as load/store and clear. More complex operations are performed by coprocessors that operate on tiles. The tile state is OS-managed, which required new extensions in the XSAVE architecture.
  • The second component is TMUL (tile matrix multiplication), and is billed as the first coprocessor attached to tiles. The systolic array supports flavors of INT8 with 32-bit accumulation and BFloat16 with single-precision accumulation.

Intel’s matrix-multiplication micro-benchmarks showed that acceleration with AMX was 7.8 times faster than using AVX-512 vector neural network instructions alone. AMX peak compute throughput in the current implementation of Sapphire Rapids is 2000 INT8 operations per cycle per core (8x higher than AVX-512) or 1,000 BFloat16 operations per cycle per core.

This article was originally published on EE Times.

Sally Ward-Foxton covers AI technology and related issues for EETimes.com and all aspects of the European industry for EE Times Europe magazine. Sally has spent more than 15 years writing about the electronics industry from London, UK. She has written for Electronic Design, ECN, Electronic Specifier: Design, Components in Electronics, and many more. She holds a Masters’ degree in Electrical and Electronic Engineering from the University of Cambridge.

 

Subscribe to Newsletter

Leave a comment