Nvidia Shows Off Hopper MLPerf Training Benchmarks

Article By : Sally Ward-Foxton

MLPerf training and tinyML benchmarks show wins for Nvidia, Habana, and GreenWaves.

The latest round of MLPerf training scores, as well as inference scores for tinyML hardware, are out.

In the MLPerf training round, Nvidia exhibited training benchmarks for its new H100 GPU for the first time. There were also strong results from Intel, Habana Labs, and MosaicML in this latest round, but nothing from Nvidia challengers Graphcore or Google.

In the tinyML benchmarks, GreenWaves Technologies’ multi-core RISC-V design swept the board for both latency and energy efficiency.

Nvidia H100

Nvidia debuted its H100 GPU, submitting scores across all training benchmarks.

Dave Salvator, director of AI, Benchmarking, and Cloud at Nvidia, presented H100’s results plotted against its previous generation, A100, from the A100’s MLPerf training debut two-and-a-half years ago. Acknowledging the huge role software plays in performance, Salvator argued that it’s fair to compare debut scores for the two parts; A100’s scores have improved 2.5× on average since its debut. Salvator’s comparison showed the H100 up to 6.7× higher performance compared to the A100 at debut.

“We have a track record of extracting more and more performance over time both on existing workloads as well as novel workloads when they come to market,” Salvator told EE Times in an interview. “I can’t say we’ll get [another] 2.5× out of Hopper just from software like we did with Ampere… but we will get more performance out of Hopper.”

Nvidia H100 MLPerf Training Results
This round of MLPerf training scores features debut entries from Nvidia’s next-gen H100 GPU (Source: Nvidia)

When compared to A100 training scores from the current round, H100’s performance was an average of 1.9× better across the eight benchmarked workloads. The biggest performance boost was on BERT, where Nvidia applied its transformer engine—software which varies precision between layers while preserving accuracy of the end result—where the uplift was 2.6×.

Why not apply the transformer engine to reduce precision intelligently on other workloads?

“It’s something we will look to do,” Salvator said. “Anywhere we can reduce precision and maintain accuracy is a win for us, for customers, and for developers. We will look to do that over time.”

H100’s TDP is 700 W while the A100’s was 400 W—an increase of 1.75×. Some of H100’s benchmarks were less than 1.75× better compared to A100 benchmarks in the current round. Does this mean some workloads are less power efficient running on H100 than A100 today?

“TDP is not a great proxy [for power consumed],” Salvator said. “Most processors have rarely, if ever, touched their actual TDP limit; it’s a conservative number with a lot of guard band baked into it… it’s an area under the curve problem. You also have to look at the time domain, which is to say, yes I’m using more power, but if I’ve used it for a shorter amount of time, you’ve actually used less energy.”

H100’s score on the reinforcement learning benchmark Minigo was just 1.08× the A100’s. Salvator said this workload, which is notoriously difficult to accelerate, was “not a strong area of focus” for Nvidia as the company poured its efforts into getting the transformer engine to perform this time around.

The 8x H100 chips in Nvidia’s DGX-H100 system can train BERT in 6.4 minutes.

Habana Labs Gaudi2

MLPerf training improvement for Habana Gaudi2
Habana’s Gaudi2 training chip has improved MLPerf training scores since the last round (Source: Intel/Habana Labs)

Intel’s Habana Labs submitted improved scores for its second-generation Gaudi2 training accelerator.

An 8-Gaudi2 system can train ResNet in 16.6 minutes or BERT in 15.6. This represents a slight improvement over Gaudi2 scores from July 2022, which Eitan Medina, COO at Habana Labs, told EE Times was down to optimizations in the company’s SynapseAI software stack.

These scores win in the “available” category for 8-accelerator systems, beating the closest 8x Nvidia A100 scores—28.2 and 16.8, respectively (Nvidia’s H100 is in the “preview” category as it’s not commercially available yet). Medina points out that the A100 is on the same process node as Gaudi2, but that Gaudi2 has more memory—96 GB compared to 80 GB for the A100—and also integrates networking on-chip.

Medina fancies Gaudi2’s chances against H100, given that Gaudi2’s scores in this round used BF16; he expects that using FP8 for future submissions will further boost Habana’s scores (Gaudi2 supports both FP8 formats).

“We have double the compute in FP8… this is something we’re really looking forward to enabling for our customers,” he said. “We do expect that additional software optimization, just good old engineering, will reveal more and more things we can do, both on the host side, as well as what [the tensor processing core] does, and how the graph compiler works.”

While Gaudi2’s power envelope is slightly bigger than A100’s, Gaudi2’s on-chip RoCE reduces component count meaning customers shouldn’t notice a big difference overall when comparing power consumption at the server level, Medina said.

MLPerf training results for Gaudi2
Habana Labs improved its Gaudi2 scores on both BERT and ResNet 50 (Source: Intel/Habana Labs)

Intel Sapphire Rapids

Intel submitted its first set of training scores for its fourth-generation Xeon Scalable CPUs, code named Sapphire Rapids. A total of 32 Sapphire Rapids CPUs can train BERT in 47.3 minutes, or ResNet in 89.0 minutes. Two CPUs can train DLRM in under an hour.

“We proved that on a standard 2-socket Xeon scalable processor, that you can train,” Jordan Plawner, senior director of AI products at Intel, told EE Times. “And within 1-16 nodes, you can train intuitively, in a reasonable amount of time.”

Intel Sapphire Rapids
Fourth-gen Intel Xeon Scalable Processor, code named Sapphire Rapids (Source: Intel )

While CPU training won’t suit all users of AI, some data center customers will be more than happy with this, Plawner said.

“Part of the market is very happy using shared, general-purpose infrastructure to do intermittent training,” he said. “Look at the number of minutes and the number of nodes. This either resonates with you or it doesn’t. Either you’re in this camp or you’re not.”

New to fourth-gen Xeons is AMX (advanced matrix extensions), a set of new instructions specifically for accelerating matrix multiplication in AI/ML workloads. Plawner expects between 3-6× speedup across inference and training for different model types; the MLPerf scores also reflect Sapphire Rapids’ larger size and core count.

Comparing to Intel’s previous MLPerf training submissions for third-gen Xeons (Cooper Lake) from July 2021, the only possible comparison was on the recommendation benchmark DLRM. (DLRM may not be a good indicator of AMX’s contribution, since the workload is typically more memory-bound than compute-bound, but Sapphire Rapids has more memory bandwidth and a hardware data streaming accelerator block which no doubt contributes here).

Four Sapphire Rapids CPUs can train the DLRM benchmark in 38.0 minutes, 3.3× faster than four Cooper Lake CPUs, and for 8x CPU systems, the improvement was 2.9×.

Plawner said that Intel is currently running fine-tuning/transfer learning experiments using Sapphire Rapids, a type of setup where big models trained on accelerator systems can be fine-tuned with a small amount of training in just a few minutes.

MosaicML

MosaicML showed scores in the open division (unlike the closed division, the open division allows changes to the model).

MosaicML took a popular version of BERT and trained it on a typical DGX-A100 system (8x Nvidia A100 GPUs). They added their own efficiency speedups via the company’s software library, Composer, which reduced the time to train from 21.4 to 7.9 minutes—a factor of 2.7×. This actually brings Mosaic’s A100 score close to Nvidia’s H100 score (6.4 minutes, albeit for a slightly different version of BERT).

“We’ve built a software optimization framework that makes it easy for folks to plug and play different software,” Hanlin Tang, co-founder of MosaicML, told EE Times. “To get these speedups, we did a few things. We added a whole bunch of our efficiency methods, some are system-level things like kernel fusions, and better kernels including better attention kernels such as [HazyResearch’s] FlashAttention… and the third thing is tuning, which leads to better data efficiency for the model.”

Better data efficiency—training to the same accuracy with less data—means training can be completed faster. This has implications for large language models where size can be limited by access to enough training data today.

Tang said that data quality also matters—for Mosaic’s previous ResNet submissions, the company used techniques such as training on smaller images in the initial parts of training when the model is learning coarse-grained features, for example. The company intends to apply techniques like this to NLP training in the future.

“A general concept that we’re seeing more and more of is that the neural network architecture starts becoming less important over time,” Naveen Rao, co-founder of MosaicML, told EE Times. “It’s really about how you select data to cause more learning. Our brains do this very well; we get naturally filtered data points that are more useful, and throw away the ones that are less useful. Not every data point has something to be learned from, and I think that’s a key concept.”

MosaicML runs customer training in its Nvidia A100-powered cloud, where optimizations can be invoked with a single command. While the optimization concepts aren’t unique to particular hardware, the implementations are; much is hardware- and system-specific, which is why the company offers a cloud service. The company’s aim is to offer training for very large models at efficient cost points.

“One of the reasons we founded the company was to have these state-of-the-art methods be accessible to many industries,” Rao said. “The problem we now have is [AI] can do amazing things, but it’s just being used by a small number of large tech companies. That’s not really what we want to see.”

GreenWaves Technologies NE16

As well as training results, this round of MLPerf also showcased new tinyML inference scores.

In the tinyML category, European startup GreenWaves Technologies, a first time submitter, swept the board with its 10-core RISC-V GAP9 processor, featuring the NE16 AI accelerator.

GreenWaves' GAP9 processor MLPerf results
GreenWaves’ multi-core GAP9 processor (Source: GreenWaves Technologies)

Martin Croome, VP of marketing at GreenWaves, told EE Times that the company’s staple diet is bigger audio networks, but there are some instances where customers have many smaller networks they want to run simultaneously.

GreenWaves’ GAP9 can perform keyword spotting inference in 0.73 ms using 18.6 µJ, or 0.48 ms using 26.7 µJ. This is both faster and lower energy than nearest challenger Syntiant, but Croome stressed that GreenWaves’ product is for a different market with a different cost point.

The company had several tricks up its sleeve for smaller networks like the tiny MLPerf benchmarks. For most of the benchmarks, GreenWaves’ team was able to keep everything in the device’s large shared L1 cache between layers, minimizing data transfer and the associated energy. Almost all weights were quantized to 6-bit (the NE16 can support down to 2-bit weights).

“NE16 has proven to be very good at optimizing pointwise convolutions, and the overall architecture is good, and we’ve done a lot of work on the tools over the last five years, so it’s a combination of multiple things,” Croome said.

GreenWaves uses a combination of custom and non-custom kernels assembled together by the company’s GAPflow toolchain, which can fuse together convolution, pooling layers, activations of different types, and more. This is particularly useful in the world of audio—GreenWaves’ target market—where neural networks are in general more diverse for computer vision.

Plumerai

European startup Plumerai’s software solution is an inference engine for any tinyML models running on Arm Cortex-M, which typically halves memory footprint and increases inference speed as much as 70% without affecting accuracy, according to the company (compared to TF Lite for Micros and CMSIS-NN). This is achieved without additional quantization.

Plumerai submitted scores using its inference engine for Arm Cortex-M33, M4, and M7 microcontrollers. Inference speed was improved 2-6% over Plumerai scores in the last round.

Compared to other results on the same Arm Cortex-M4 microcontroller (STM32L4R5ZIT6U) running at the same clock speed, Plumerai’s image classification scores were 1.3× faster than STMicro’s own result, which was in turn faster than OctoML. (STMicro’s inference engine, part of its X-Cube-AI software stack, is based on an optimized version of CMSIS-NN).

On Cortex-M33, Plumerai again beat STMicro and OctoML scores, even getting faster latency than Silicon Labs’ M33 device with on-chip accelerator (Plumerai did not submit power scores).

Other notable MLPerf Tiny entries

Syntiant submitted a second round of scores for its NDP120, which features a second-generation Syntiant in-memory compute core. Keyword spotting results improved from 1.80 to 1.48 ms (1.2×) and from 35.29 µJ to 31.5 (1.1×).

This is notably the first time Syntiant submitted benchmarks for workloads other than keyword spotting—visual wake words and image classification scores are also available for the NDP120. The company said all the tinyML benchmarks used less than a third of the on-chip resources, meaning it’s suited to applications like sensor fusion or running more than one neural network simultaneously.

STMicro improved its inference latency by up to 33% and energy scores by up to 37% compared to the last round. This was achieved by adding more optimizations in the company’s X-Cube-AI stack—users can now optimize for memory, latency, or a balance of the two.

Silicon Labs once again entered its MG24 part, a multi-protocol SoC for IoT applications, which includes Silicon Labs’ home-grown AI accelerator alongside an Arm Cortex-M33 core. The MG24’s scores improved 1.5-1.9× across the board, for both latency and energy consumption, except for anomaly detection which was similar to the last round. Silicon Labs used TensorFlow Lite for microcontrollers and CMSIS-NN for its submissions.

OctoML’s offering is a developer platform focused on code portability, performance, and tools, which is intended to allow model deployment without specialized ML expertise. The company submitted scores using two different compilation flows: The baseline used Apache TVM and CMSIS-NN, while the other scores used microTVM and OctoML’s AutoTuning optimization process.

The company’s intent was to show that autotuning with native microTVM schedules achieved similar performance to CMSIS-NN; visual wake words were within 11% on latency and power.

Qualcomm submitted in the preview category (for systems not yet commercially available) with a next-gen version of the Qualcomm Sensing Hub. The Sensing Hub is an on-chip AI accelerator block designed for always-on processing of sensor data (mainly camera data) in smartphones.

In Snapdragon mobile processors, this block offloads always-on processing from the Hexagon processor, which is used for bigger AI tasks. This latest generation of the Sensing Hub includes a second AI accelerator core alongside DSP and memory. It can perform anomaly detection in less than 0.1ms, faster than any score submitted in the “available” category.

 

This article was originally published on EE Times.

Sally Ward-Foxton covers AI technology and related issues for EETimes.com and all aspects of the European industry for EETimes Europe magazine. Sally has spent more than 15 years writing about the electronics industry from London, UK. She has written for Electronic Design, ECN, Electronic Specifier: Design, Components in Electronics, and many more. She holds a Masters’ degree in Electrical and Electronic Engineering from the University of Cambridge.

 

Subscribe to Newsletter

Leave a comment