Next Exascale System Powered by AMD Chips

Article By : Rick Merritt

Frontier, the second of three U.S. exascale supercomputers, will use AMD CPUs and GPUs, claiming a performance lead the company hasn't had since 2008

AMD snagged a design win for next-generation CPUs, GPUs, and interconnects in Frontier, the second of three U.S. exascale-class supercomputers. The total contract, valued at more than $600 million, is the largest deal to date for AMD, system integrator Cray, and the U.S. Department of Energy.

When installed sometime in 2021, Frontier is expected to deliver more than 1.5 exaflops, making it slightly higher in performance than Aurora, the first U.S. exascale system being built by Intel and Cray. A third system, dubbed El Capitan, is expected to be awarded to the team of IBM and Nvidia, who built Summit and Sierra, the current leading supercomputers.

The deal is a landmark for AMD’s renewed focus on high-performance chips. To date, Intel has dominated as much as 95% of the CPU sockets in top supercomputers, with IBM’s Power chips taking a significant slice of what remained.

A Frontier node will consist of a custom AMD Epyc CPU linked coherently to four upgraded Radeon GPUs over an enhanced version of the company’s Infinity fabric. The Epyc will use “future-generation” Zen cores and sport an updated microarchitecture packing new instructions for AI and supercomputing jobs.

The Radeon will use high-bandwidth memory, sport new compute cores, and enable mixed-precision operations for deep learning. AMD declined to say how many chips Frontier will use or what process the chips are made in, leaving analysts to speculate that it could be TSMC’s 7+-, 6-, or even 5-nm nodes.

“We believe that at the time of power-on, it will be the most powerful supercomputer in the world,” said AMD chief executive Lisa Su. She said that the new Radeon chip will eventually become a standard product.

The milestone marks the first time that AMD has powered the world’s top supercomputer since 2008, when IBM used Opteron CPUs along with a version of its Cell processors in Roadrunner, the first petaflop system. The following year, AMD powered the three top supercomputers, two designed by Cray.

The AMD processors will “likely have improved double- and single-precision floating-point performance but will also have 16-bit floating point as well” for deep-learning jobs, said analyst Kevin Krewell of Tirias Research. “The ability to coherently attach four Radeon GPUs to the one Epyc CPU is a key design feature — it gives each node tremendous computational performance.”

By contrast, the 143-petaflop Summit system packs two IBM Power 9 processors and six Nvidia Volta GPUs in a node using Nvidia’s proprietary coherent interconnect.

AMD Node

A Frontier node consists of an Epyc CPU and four Radeon GPUs on a coherent Infinity fabric. (Image: AMD)

With Frontier, Cray pushes the envelope in compute density

Frontier, to be installed at Oak Ridge National Lab (ORNL), will consist of more than 100 Cray Shasta cabinets, consuming about 40 MW of power, enough to run a small city. Cray will develop Epyc and Radeon boards with power delivery and “integrated direct liquid cooling” that aims to push the envelope in compute density.

By contrast, the Aurora system being built at Argonne National Lab will use about twice as many Shasta cabinets — a sign of the aggressiveness of Cray’s density target with Frontier. The system is expected to occupy about 7,300 square feet of floor space, not quite two basketball courts in size.

Both Aurora and Frontier will use Slingshot, Cray’s top-of-rack switch that can support up to 250,000 nodes on a three-hop network running at 12.8 Tbits/second. Each switch packs 64 200-Gbits/s ports in a dragonfly topology and is compatible with Ethernet.

Each AMD GPU will have its own port on a Slingshot switch that can read or write directly to the GPU’s memory. AMD and Cray will enhance AMD’s open-source ROCm software to let programmers tap into the direct links.

Overall, the deal “bodes well for AMD’s future, as this is technology that should be in the mainstream market after 2021, and it shows that the ROCm software stack has some legs,” said analyst Patrick Moorhead, president of Moor Insights & Strategy.

Cray Slingshot

The Slingshot packs 64 200-Gbit/s links with enhanced QoS. (Image: Cray)

More than $100 million of the deal will pay for creating a new programming environment for the hybrid CPU/GPU system. GPU accelerators have long been popular in supercomputers, especially after the IBM/Nvidia Sierra and Summit systems started a trend of designing supercomputers to support legacy and new AI apps.

“Closely integrating artificial intelligence with data analytics and modeling and simulation will drastically reduce the time to [scientific] discovery by automatically recognizing patterns in data and guiding simulations beyond the limits of traditional approaches,” said Thomas Zacharia, a director at ORNL.

The U.S. Department of Energy (DoE) will spend just short of $2 billion to commission the three exascale systems — Aurora, Frontier, and the El Capitan system coming to Lawrence Livermore Lab.

“This is a DoE effort to make sure the U.S. remains in the forefront of this important technology — not only because it drives competition in the IT space, but it also drives competition in the overall economy and jobs,” said Zacharia.

In recent years, China has lead the Top 500 list several times, and it now has more Top 500 supercomputers than the U.S.

“I’ve been a scientist who has grown up using supercomputers,” said Zacharia. “For the last decade or more, I’ve procured a number of No. 1 systems, but this is the most expensive machine by far I’ve ever been procured, so my signature [on the contract] didn’t look like my usual signature.”

ORNL listed a handful of potential applications queued up to use the machine. They span modeling and analysis of nuclear, particle and plasma physics, advanced materials, astronomy, and energy generation — including modeling and analyzing materials and biological structures at atomic scales.

While the Frontier system will be installed in 2021, it could take until late 2022 before ORNL is ready to make it available to the wider research community. Researchers can apply for access to the system on the Frontier website.

Subscribe to Newsletter

Leave a comment