MLPerf Deadlock: Google and Nvidia Tied for First Place

Article By : Sally Ward-Foxton

The fourth round of the MLPerf AI training system benchmarks attracted 650 submissions – how has performance increased since last year?

Google and Nvidia tied for first place in the fourth round of MLPerf Training benchmark scores, each winning four of the eight benchmarks in the closed division with their large-scale AI accelerator systems.

First-time contributor Graphcore showed off the capabilities of its 16- and 64-chip pods featuring the second-generation intelligence processing unit (IPU). Habana Labs entered its Gaudi chip for the first time (the company entered its inference chip, Goya, in a previous inference round).

Other entries came from a large-scale system based on the Huawei Ascend 910, Intel’s Xeon CPUs without accelerators, and a range of Nvidia-A100-based systems from third parties.

Nvidia DGX-A100 system with eight A100 GPUs (Image: Nvidia)

The range of submissions stretched from enterprise-class systems to supercomputers, with 13 organisations submitting 650 peer-reviewed scores, roughly 6 times the number of scores submitted in the last round.

Executive director of MLCommons David Kanter told EE Times that the aim of MLPerf is to increase performance across the board.

“One of the things we hope to accomplish is to help drive performance for everyone in the industry,” he said. “[In this round], folks have been tuning software, optimizing the network, building larger systems, and using newer processors and accelerators.”

Compared to the previous round, the best benchmark results improved by up to 2.1X. ResNet-50 performance has improved more than 25X in two years, though this is the product of larger scale systems as well as faster hardware and optimised software.

Closed division

The closed division contains results from systems meeting strict setup specifications, intended as a framework for direct comparison.

Two new benchmarks were added this round: RNN-T is a speech to text network used at Google on a wide variety of devices and UNet-3D is a medical imaging network used to look for cancer cells in 3D scans of the kidneys. Translation models NMT and the Transformer network used previously have been retired as they are no longer state of the art.

These new benchmarks join the existing ones: ResNet-50 for image classification, object detection networks SSD and Mask R-CNN, natural language processing network BERT, DLRM (deep learning recommendation model) and Minigo, a reinforcement learning network that learns the game of Go.

Enterprise scale

The majority of submissions for enterprise scale systems came from Nvidia or server makers using Nvidia GPUs. All the Nvidia-based submissions used the Ampere A100 GPU, which appears to be the new industry standard AI accelerator for the data center.

Habana GAUDI HLS-1 AI Training System

Nvidia said that its own A100 scores improved around 2.1X compared to the previous round, for several reasons. Nvidia’s software stack CUDA has further minimised the required communication with host CPUs. A technique called Sharp has doubled effective bandwidth between nodes by offloading CPU operations to the network, decreasing the data traversing between endpoints. Spatial data parallelism can now split an image across 8 GPUs. And using HBM2e memory has increased A100’s memory bandwidth by nearly 30%.

Graphcore submitted its first set of four MLPerf Training scores, for 16- and 64-chip systems training ResNet and BERT. Running TensorFlow/Poplar software, the 16-IPU system could train ResNet in 37.12 minutes, while the 64-chip system could do it in 14.48 minutes.

For a rough comparison, Dell’s ResNet score for a system with 16x Nvidia A100 accelerators was 20.98 minutes (40GB A100s using MXNet). Nvidia’s own score for ResNet on a 64-chip system (80-GB A100s running MXNet) was 4.91 min.

Graphcore has argued previously that its customers don’t care how many accelerator chips are in a system, and that a comparison normalised on price would mean comparing systems with multiple Graphcore IPUs to a single A100.

Closest to Graphcore’s 16-IPU ResNet score (37.12 minutes) were Dell and Supermicro systems, each with 8x A100-40GB accelerators (36.37 and 36.20 minutes, respectively). Closest to Graphcore’s 16-IPU BERT score (34.49 minutes) was a Supermicro 8x A100-80GB system (28.32 minutes). Should we then assume that a ballpark performance figure would put 2x IPUs up against 1x A100? The scores aren’t directly comparable, but may give us a rough idea.

Graphcore said its scores were the result of novel techniques including hybrid model and data parallelism, FP16 master weights, the use of external streaming memory, and small batch training. It also uses a packed sequencing technique for BERT, which involves packing unrelated short sequences from the training dataset together to build full length sequences – this avoids using padding tokens and is thereby more efficient.

By entering its third-generation Xeon Scalable CPUs, which have some AI acceleration featuresIntel wanted to show that CPU servers customers most likely already have are perfectly capable of handling AI workloads. Intel also wanted to show that Xeon systems can scale with the size of the workload by submitting scores for 4 to 64-CPU systems. The 8-Xeon system trained ResNet in 943.97 minutes, while the 64-Xeon system did it in 213.92 minutes, about 4.4 times faster though it uses 8x as many CPUs.

Intel also presented a range of scores for the DLRM benchmark for systems with 4x to 16x Xeon CPUs. They managed to train it in between 124.95 and 48.26 minutes.

DLRM is traditionally seen as a workload where accelerators don’t have as much of an advantage. For comparison with the Intel Xeon CPU systems, a Dell server equipped with 8 A100-PCIE-40GB (250W) accelerators managed the same benchmark in 99.86 minutes. A Supermicro server with the same GPU setup as the Dell, using the same MXNet software framework (but with 2x AMD host CPUs instead of Intel) trained it in 153.38 minutes.

Nividia’s own optimised scores for DLRM with 8x meatier versions of the A100 (A100-SXM4-80GB, 400W version) blew them all out of the water at 1.96 minutes. One of server maker Nettrix’ systems with the same accelerator setup (but different software) beat Nvidia’s own score at 1.92 minutes.

Minigo is also perceived to be difficult for accelerators of all kinds. 32x Intel Xeon CPUs without accelerators can do it in 409.00 min. 64x Intel Xeons can do it in 271.23 – a similar score to 8x Nvidia A100-80GB accelerators at 269.54 minutes.

So is there a clear winner for enterprise-scale systems? Companies’ purposes for submitting scores, and therefore their definitions of winning, might be quite different. Two companies said they had submitted scores representing out-of-the box systems, that is, they had not attempted any optimization at all.

Habana Labs said submitting out-of-the-box scores meant it would be easy for customers to make small adjustments to the model (change the data, switch layers) while maintaining similar performance to the scores submitted. Its system, with 8x Gaudi accelerators, trained ResNet in 62.55 minutes and BERT in 164.37 minutes.

The Graphcore IPU-Machine with four Graphcore IPU AI accelerators (Image: Graphcore)

For a rough comparison, the closest ResNet scores to Habana’s were 61.48 minutes from Lenovo (4x A100-40GB system), backed up by 61.57 from Supermicro (same accelerator and software as Lenovo but using AMD rather than Intel host CPUs).

Fujitsu said it would be easy for anyone to reproduce its results. Fujitsu’s ResNet score for an system with 4x A100-40GBs was 73.49 min. For comparison, Lenovo’s optimised score for a system with the same accelerators and software framework could do it in 71.96, illustrating the difference software optimization can make.

Larger scale

Three organizations submitted scores for large scale systems. The biggest were Google’s 3456x TPU system, Nvidia’s 4096x A100 system, and PCI & PKU’s 1024x Huawei Ascend 910 system.

Unsurprisingly, these supercomputers trained the benchmark networks in the fastest times. Winning scores were split evenly between Google and Nvidia – each won four benchmarks overall. Google won ResNet, SSD, BERT and DLRM, while Nvidia won UNet-3D, Mask R-CNN, RNN-T and Minigo.

Peng Cheng Laboratory (PCL), a research facility in Shenzhen, used its Peng Cheng Cloud Brain II supercomputer, which combines Huawei Ascend 910 AI accelerators and ARM processors. The facility collaborated with Peking University (PKU) for its submissions.

On BERT, the fastest score was from Google’s 3456x TPU system, which trained it in 0.29 minutes. Nvidia’s score for training BERT on 4096x A100s exactly tied with Google’s 2048x TPU score (both at 0.32 minutes). By comparison, PCL & PKU’s top score for BERT was 2.40 minutes with a 256x Huawei Ascend 910 Premium A accelerators. The same number of Ascend 910 Pro A chips ran it in 2.69.

Google said it achieved a roughly 1.7X speedup in top-line submissions using the TPUv4, compared to last year’s scores which were based on the TPUv3. For example, 3456x TPUv4s can train ResNet in 0.23 minutes, while last year’s 4096x TPUv3 system could do it in 0.48.

As well as a new architecture for its TPU chips, Google said the performance improvements were down to high interconnect bandwidth in each TPU pod, at-scale software optimizations leveraging novel innovations provided in the hardware, and new features for its XLA compiler.

The company also said TPUv4 pods are already widely deployed throughout its data centers and will be available for customer workloads later this year.

Open division

There were 5 submissions to the open division from Graphcore, Intel and Google. The open division is intended for submitters who want to make tweaks that wouldn’t be allowed in the closed division. Scores are less directly comparable as a result, but it’s interesting that all three companies who submitted in this division made changes to the optimizer algorithm (a part of the training algorithm which adjusts the network’s weights based on errors from the previous pass).

Graphcore tuned additional hyperparameters in the optimizer to make training converge faster. This boosted its BERT results from 34.49 minutes to 27.75 for the 16-IPU system, and from 11.69 to 9.39 minutes for the 64-IPU system, a speedup of 1.24X in each case.

Intel used two different optimizer algorithms for different parts of the DLRM benchmark to allow it to increase batch size without losing accuracy or slowing convergence. Google used a second-order optimizer (which is not allowed in the closed division) which helped training converge faster for its ResNet score.

View the entire list of submitted scores here.

Subscribe to Newsletter

Leave a comment