Intel Doubles Down On AI With Latest Xeon Scalable, FPGA

Article By : Sally Ward-Foxton

Intel unveils Cooper Lake, with added support for BF16 format, and adds a dedicated AI engine to its FPGA offering for the data center...

As part of its AI strategy for the data center, Intel announced several devices tailored specifically for this market. This includes the third generation of its Xeon Scalable CPU, a new Stratix FPGA with a dedicated AI engine, new Optane persistent memory and NAND SSDs. The company also shed a little light on its AI strategy for the data center for the first time since acquiring dedicated data center AI accelerator company Habana Labs six months ago.

Cooper Lake
Intel unveiled Cooper Lake, the first tranche of parts in the third generation of its flagship Xeon Scalable CPU line for the data center.

With more than 35 million Xeon Scalable processors deployed, this series is the clear market leader, and the only mainstream CPU offering for the data centre that includes deep learning acceleration. The previous (second) generation of parts introduced DL Boost, which added a single instruction to Intel’s AVX-512 (advanced vector extension) instruction set to handle INT8 convolutions, a mathematical operation common in deep learning inference. Convolution had previously taken three separate instructions to complete.

Cooper Lake, the third generation of the Xeon Scalable family, integrates support forbfloat16 (BF16), a number format invented by Google that is becoming the standard for AI training as it offers an optimal balance of compute efficiency and prediction accuracy. This update to DL Boost represents the industry’s first x86 support for BF16 and vector neural network instructions (VNNI).

“One of the nice things about bfloat 16 is that it requires very minimal software changes for customers,” said Lisa Spelman, Intel corporate vice president and general manager, Xeon and memory group. “It allows them to achieve improved hardware efficiency for both training and for inference, without needing a tremendous amount of software work, which can often be the barrier to unlocking more AI performance.”

AI training and inference performance
Performance on ResNet-50 training (left) and inference (right) showing Intel’s previous generation Xeon processors, Xeon scalable processors with AVX-512 instruction set, and DL Boost with BF16 capability (Image: Intel)

Intel’s figures have Cooper Lake at up to 1.93X the training performance and 1.87X the inference performance of second-generation parts for image classification. For natural language processing training, performance is 1.7X that of the previous generation.

Other features carried over from the second generation include Speed Select, which allows control over the base and turbo frequencies of specific cores to maximise performance of high priority workloads.

Cooper Lake is intended to make AI training and inference more widely deployable on general-purpose CPUs. It is for 4-8 socket implementations and there are 11 SKUs announced today that are already shipping to customers (Facebook already announced its server design is based on it, and Alibaba, Baidu, Tencent and others are adopting too). General OEM availability is expected in the second half of 2020. Ice Lake processors, the third generation Xeon Scalable processors for 1-2 socket implementations, are expected to be available later this year.

The fourth generation of the Xeon Scalable family, Sapphire Rapids, has just had its silicon powered on at Intel HQ. These devices will use the new advanced matrix extension (AMX) instruction set, whose spec is due to be published this month.

Stratix 10-NX
Intel’s first AI-optimized FPGA, the Stratix 10-NX embeds a new AI compute engine it calls the AI tensor block. The new engine delivers up to 15X the INT8 performance than today’s Stratix 10-MX for AI workloads.

Stratix 10-NX Tensor block
The Stratix 10-NX includes new tensor blocks (right) (Image: Intel)

“The design of the AI tensor block is focused on accelerating AI applications. Specifically optimizing highly efficient tensor pipelines, at commonly used precision – reduced precision integer and floating point formats that are commonly used in the AI space,” said Intel’s David Moore, corporate vice president and general manager, programmable solutions group. “These innovations allow us to pack 15 X more compute into the same footprint as our standard DSP compute block, and each Stratix 10-NX has got thousands of these AI tensor blocks, making it capable of delivering the real time performance in the most demanding AI applications.”

Stratix 10-NX block diagram
The Stratix 10-NX features high performance AI tensor blocks, integrated high-bandwidth memory, high-bandwidth networking and the ability to extend via chiplets (shown here in red) (Image: Intel)

Aimed at high-bandwidth, low latency AI acceleration that may require many variables to be evaluated in real-time across multiple nodes, the Stratix 10-NX can also be pooled efficiently to support today’s extremely large models.

“The Stratix 10-NX compliments Xeon, and broader Intel portfolio elements as a high performance, low latency, multi-functional accelerator, and will specifically address applications that demand hardware customization,” Moore said.

The Stratix 10-NX will be available later this year.

AI Strategy
Intel’s intent is to offer AI acceleration across its entire data center offering, whether that’s built into a Xeon CPU, an FPGA or a dedicated accelerator from Habana Labs.

“The continued success of our AI business is about [the idea] that one size does not fit all when it comes to AI and that the total performance and total cost of the workload is a more important customer metric than just a standalone AI acceleration benchmark or comparison,” said Lisa Spelman. “We have customers using Xeon for things like recommendation engines that have a long and complex workload where the AI is a portion of it, and then you’ll see [FPGA or Habana customers who] really benefit from having a partner in solving the AI challenge to their Xeon infrastructure.”

As for Xeon vs Habana, Spelman said that customers’ typical total cost of ownership calculations show that beyond a certain point, when AI becomes a certain percentage of the workload, that’s when the economics improve by switching to dedicated accelerators. Recommendation systems, she said, were particularly suited to CPU processing due to the flow of the workload and the different tasks that need to be done, whereas training for image processing might work best with a dedicated accelerator, even though it’s utilised for a narrower portion of the problem.

“We really look at it as a continuum and with our customers, we talk about the total cost of ownership they’re trying to achieve, the different software optimizations that might be required from making those choices, and the constraints they may face – whether they’re at the edge, what their power constraints are within the data center, kind of holistic type of view. And then we can work on recommendations to match their use case,” said Spelman.

Leave a comment