The training benchmarks illuminate competition for boosting AI training performance among Nvidia, Graphcore and Intel-Habana accelerators.
In the latest round of MLPerf AI training benchmarks, Microsoft Azure demonstrated the world’s fastest cloud for AI using large-scale Nvidia powered instances. Azure’s NDm A110 v4 series of virtual machines ran benchmarks on up to 2,048 Nvidia A100-80GB GPUs, completing each benchmark in under 18 minutes.
Nvidia led on seven of the eight benchmarked workloads in the closed division with systems containing up to 4,320 A100 accelerators. Microsoft Azure topped the eighth category (medical imaging) with its Nvidia-powered cloud instance. Graphcore and Habana Labs also submitted improved results for ResNet-50 and BERT benchmarks.
Microsoft’s Azure’s MLPerf submission is ranked tenth among the world’s top 100 supercomputers. Nvidia’s in-house AI supercomputer, Selene, is about twice the size and currently ranks sixth.
Azure’s NDm A110 v4 series of virtual machines offers scalability from 1 to more than 256 virtual machines, or from 8 to 2,048 GPUs, as required. The 2,048 GPUs used in the Azure cloud demonstrated the ability to train an entire BERT natural language processing model in just over 25 seconds. The most difficult benchmark, MiniGo, was trained in under 17.5 minutes using 1,792 GPUs. Azure topped the 3D Unet benchmark, used for three-dimensional medical images, with a training time of 1.262 minutes using 768 GPUs (Nvidia’s 768-GPU result for 3D Unet was 1.373 minutes).
Among Microsoft’s goals was demonstrating that Azure cloud performance is comparable to on-premises equipment.
Nvidia’s submissions were designed to demonstrate the company’s capabilities for large-scale AI training.
“Scaling to larger clusters is really the hardest part of training AI, and it’s one where Nvidia’s AI platform has tremendous strengths,” claimed Paresh Kharya, Nvidia’s senior director of product management for accelerated computing. “Scaling is really important because everything becomes a bottleneck. It’s a very hard problem. From distributing work, coordinating work to moving data, everything becomes a bottleneck.”
Training huge, cutting-edge models can take months, even on Selene, Kharya said, adding that advancing state-of-the-art AI models would be impossible without scaling.
Scale is also important, Kharya said, since the ability to iterate fast on AI projects is vital. “One of the common misperceptions we see is to use just the cost of the infrastructure for the [return on investment] for training models,” he added. Users “care about the cost of infrastructure, but also the productivity of their expensive data science teams, and ultimately the time to bring their products and updates to their products to market faster than the competition.”
Benchmarks run on Selene scaled to up to 4,320 GPUs, the largest system in this round. Nvidia said the results represent a 30-fold speed increase compared to the fastest Graphcore system (256 accelerators) and 53 times faster than results for Habana Labs’ biggest system (also 256 accelerators).
As for per-accelerator chip performance, Nvidia claimed victory over Graphcore and Habana Labs accelerators, though it trailed Google TPU v4’s ResNet-50 score from the previous round of training benchmarks.
Nvidia also noted its steadily improving scores. Compared to MLPerf Training scores from July 2020 (when the A100 was introduced), Nvidia A100-based systems performed five times faster at scale and twice as faster at the chip level.
Software changes account for the performance gains, including CUDA graphs that reduce CPU bottlenecks by launching the entire sequence of kernels simultaneously rather than serially. Hence, the full training iteration ran directly on GPUs. CUDA streams improved parallelism by introducing a fine-grained overlap of computation and communications.
Nvidia’s NCCL and SHARP technologies were used to improve multi-GPU and multi-node processing. NCCL optimizes data aggregation based on available bandwidth and network latency. SHARP improves performance by offloading operations from the CPU onto the switch, eliminating the need to send data multiple times between different endpoints and servers. Meanwhile, an updated MX network implementation improved the efficiency of memory copies for operations like concatenation and split.
Graphcore demonstrated scaling on larger systems, including those with 128 and 256 IPU accelerators.
For 16- and 64-accelerator systems, Graphcore’s ResNet-50 scores improved 24 percent on the IPU-Pod16 and 41 percent on the IPU-Pod64. For BERT, IPU-Pod16 scores improved 5 percent and IPU-Pod64 scores rose 12 percent. Again, software optimization helped boost performance.
Graphcore’s results compare its IPU-Pod16 performance to Nvidia’s DGX-A100, even though the Graphcore platform includes twice the number of accelerator chips. Graphcore maintained the systems are equivalent in size (the IPU-Pod16 is 5U versus the DGX-A100 in 6U) and roughly equivalent on power consumption and price. It should be noted that Graphcore is the only company to use this comparison. Graphcore claimed its IPU-Pod16 outperformed Nvidia’s DGX-A100 on ResNet-50 (28.3 minutes to train on Graphcore; 29.1 minutes to train on Nvidia).
Graphcore’s BERT scores reflect systems with fewer host CPUs per accelerator than ResNet-50. BERT scores were benchmarked on systems with one host CPU per 32 IPUs, while ResNet-50 scores were benchmarked on systems with one host CPU per 8 IPUs.
“We have the flexibility to vary this property per workload, which is unusual,” said Dave Lacey, Graphcore’s chief software architect. “That enables us to experiment… and get these points of efficiency.”
Lacey added that this approach allows users to perform more computing on a single host server without moving to distributed CPU computation that requires additional infrastructure.
“This is also an important factor of cost,” Lacey said. “All these systems have very hefty CPUs on them, and that’s a significant cost to your system. If you can get away with the best ratio, the smallest number of CPUs, the accelerators are really doing the very heavy lifting here. Then that cost optimizes best for that particular workload.”
Lacey said Graphcore made a deliberate design choice for its IPU to push application logic onto the accelerator. The connection between host and accelerator is only used for training data – no code, no heavy synchronization, just data, he added.
Another issue is reducing the number of host CPUs depending on workloads and data used by a workload. “It [depends on] how much preparation or other non-AI type tasks are being done on the on the CPU, and also how much is traveling between the CPU and the accelerator,” Lacey said.
The effect is particularly pronounced for BERT workloads where the input data is much smaller than the images required for other workloads. Image processing workloads like ResNet-50 require additional non-AI tasks like image decompression which is better suited to the host CPU. Hence, more hosts are required.
Ethernet connections between host and accelerator also provide flexibility to reconfigure the number of host CPUs accordingly.
Graphcore’s comparisons for the ratio between host CPUs and accelerators are based on one Graphcore chip to one Nvidia or Habana chip. If a single Graphcore IPU-Pod16 equals a single Nvidia DGX-A100, as Graphcore sought for its ResNet-50 time-to-train comparison, ResNet-50 training would require the same number of host CPUs (any advantage is for BERT only in this example).
Intel Habana Labs
Intel’s Habana Labs submitted its second round of MLPerf training scores using its Gaudi training accelerator chip. Since the last round, Gaudi’s performance has doubled for BERT. ResNet-50 scores also improved by 11 percent.
Habana also demonstrated the scalability of its Gaudi technology, presenting similar results for naïve and weak scaling (weak scaling is not covered in MLPerf results).
Itay Hubara, Habana’s senior researcher, said naïve scaling considers the time to train for systems at different scales. Weak scaling is derived from naïve scaling results. Increasing the number of accelerators typically entails increasing batch size (the number of training data samples simultaneously fed into the system) in order to keep the hardware fully utilized. But increased batch size usually requires more training iterations since weights are updated after processing more data samples. That means more training data are required to achieve the same result in larger systems. Weak scaling is the naïve scaling score normalized per throughput, or to the same amount of data being processed.
“Our weak scaling and naïve scaling figures are very close for up to 64 Gaudi chips because we didn’t have to increase the batch size. We can work with a small local batch size,” Hubara said. “When [switching] to 16 [accelerators from eight], I don’t have to increase the global batch size by 2x… The architecture of Gaudi enables us to get high utilization even if I don’t take the maximum batch size that I can put into the device.”
Habana’s scores have improved over the last round, once again as a result of software optimizations.
BERT training times were halved thanks to data-packing techniques, where shorter sentences in the training data were packed together into one multi-sequence. (Shorter sentences would otherwise be padded with zeros to achieve a fixed input size.) Data packing is handled in pre-processing, and is not part of the benchmarked training time.
Habana also implemented light checkpoint saving, since the time required to save checkpoints becomes significant. Rather than saving a checkpoint, each worker saves a subset of the model weights, boosting speed.
Asked whether Habana accelerators could operate with fewer host CPUs, Hubara said: “The ratio of host CPUs to Gaudi cards can be changed; it is not a limit of our Gaudi card. Yet, a typical system has two Xeon sockets for eight accelerators. We use this configuration since we aim to replace GPU-based systems, and our customers prefer dual-socket systems.”
Google did not submit MLPerf training scores into the closed division, but did submit two scores in the open division for a pair of very large models, both architecturally similar to MLPerf’s BERT model but with larger dimensions and more layers.
One score trained a 480-billion-parameter, Transformer-based, encoder-only benchmark using TensorFlow running on a 2,048-accelerator TPUv4 system, training in approximately 55 hours.
The other score trained a 200-billion-parameter JAX model on a 1,024-chip TPUv4 system, training in approximately 40 hours.
Google said that each training run achieved a computational efficiency of 63 percent.
The full list of MLPerf AI Training benchmark scores is here.
This article was originally published on EE Times.
Sally Ward-Foxton covers AI technology and related issues for EETimes.com and all aspects of the European industry for EE Times Europe magazine. Sally has spent more than 15 years writing about the electronics industry from London, UK. She has written for Electronic Design, ECN, Electronic Specifier: Design, Components in Electronics, and many more. She holds a Masters’ degree in Electrical and Electronic Engineering from the University of Cambridge.