In the latest round of MLPerf AI inference benchmarking scores, Nvidia showed up to 4.5x performance of A100 for its new flagship GPU.
Nvidia used the latest round of MLPerf inference scores to debut public benchmarks for its latest flagship GPU, the H100. H100 is the first chip to be built on the company’s Hopper architecture with its specially designed transformer engine. H100 outperformed Nvidia’s current flagship, the A100, by 1.5-2× across the board, except for the BERT scores where the advantage was more pronounced with up to 4.5× uplift.
With triple the raw performance of the A100, why are some of H100’s benchmark scores less than double?
“While the FLOPS and TOPS numbers are a useful initial set of guideposts, they don’t necessarily predict application performance,” Dave Salvator, Nvidia’s director of AI inference, benchmarking, and cloud, told EE Times in an interview. “There are other factors, [including] the nature of the architecture of the network you’re running. Some networks are more I/O bound, some networks are more compute bound… it varies by network.”
Salvator added that there is headroom for H100’s scores to improve as its software stack matures.
“This is a first showing for Hopper… there is still gas left in the tank,” he said.
Salvator pointed out that the A100’s results have improved 6× since that accelerator’s first MLPerf showing in July 2020. “Most of that came from software tuning optimizations, many of which make their way onto our containers on NGC [Nvidia’s software portal] that developers can use.”
H100’s standout result was on BERT-Large, where it performed as much as 4.5× better than the A100. Among H100’s new features are a hardware and software transformer engine that manages the precision of calculations during training for highest throughput while maintaining accuracy. While this functionality is more relevant to training, it does apply to inference, Salvator said.
“It’s largely the FP8 precision that’s coming into play here, but’s it’s also some other architectural aspects of H100. The fact that we have more compute capability plays a role, more streaming processors, more tensor cores, and more compute,” he said. H100 has also approximately doubled its memory bandwidth compared to A100.
Some parts of the BERT 99.9 benchmark ran in FP16 and some in FP8— the secret sauce here is knowing when to jump to higher precision to preserve accuracy, which is part of what the transformer engine does.
Nvidia also showed an approximately 50% energy efficiency improvement for its edge SoC Orin, which Salvator put down to recent work to find an operational sweet spot for frequency and voltage (MaxQ).
Benchmark scores for Grace CPU systems, Grace Hopper, and power measurements for H100 should be available once the products reach the market in the first half of next year, Salvator said.
Nvidia’s main challenger, Qualcomm, focused on energy efficiency for its Cloud AI 100 accelerator. Qualcomm runs the same chip in different power envelopes for data center and edge use cases.
There were over 200 Cloud AI 100 scores submitted by Qualcomm and its partners, including Dell, HPE, Lenovo, Inventec, and Thundercomm. Three new edge platforms based on Snapdragon CPUs with Cloud AI 100s were also benchmarked, including Foxconn Gloria systems.
Qualcomm entered the largest system (18 accelerators) in the available category of the closed data center division and claimed the crown for the best ResNet-50 offline and server performance. The 8x Cloud AI 100 scores, however, were easily bested by Nvidia’s 8x A100 PCIe system. (Nvidia H100 is in the “preview” category as it isn’t commercially available yet).
Qualcomm also claimed the best power efficiency across the board in the closed edge system and closed data center system divisions.
Chinese GPU startup Biren offered its first set of MLPerf scores since emerging from stealth last month.
The Chinese startup presented scores for its BR104 single-chiplet accelerator in the PCIe form factor alongside its BirenSupa software development platform. For both ResNet-50 and BERT 99.9, the Biren 8-accelerator system offered similar performance to Nvidia’s DGX-A100 in server mode, where there is a latency constraint, but comfortably outperformed Nvidia DGX-A100 in offline mode, which is a measure of raw throughput.
Biren’s BR100—which has a pair of the same chiplets used singly in the BR104—was not benchmarked.
Chinese server maker Inspur also submitted results for a commercially available system with 4x BR104 PCIe cards.
Another new entrant was Sapeon, a spin-out of Korean telecoms giant SK Telecom. Before spinning out, Sapeon had been working on its accelerator since 2017; the X220, a second-generation chip, has been on the market since 2020. The company said its chip is in smart speakers and security camera systems. It claimed victory over Nvidia’s A2, an Ampere-generation part intended for entry-level servers in 5G and industrial applications.
Sapeon showed scores for the X220-compact, a single-chip PCIe card consuming 65 W, and the X220-enterprise, which has two X220 chips and consumes 135 W. The company pointed out that the X220-compact beat Nvidia A2 by 2.3× in terms of performance, but was also 2.2× more power efficient, based on maximum power consumption. This is despite the X220’s low-cost 28-nm process technology (Nvidia A2 is on 7 nm).
Sapeon is planning a third-generation chip, the X330, for the second half of 2023, which the company says will offer higher precision and will handle both inference and training workloads.
Intel submitted preview scores for its delayed Sapphire Rapids CPU. This four-chiplet Xeon data center CPU is the first to get Intel’s advanced matrix extensions (AMX), which Intel says enables 8× the operations per clock compared to previous generations.
Sapphire Rapids also offers more compute, more memory and more memory bandwidth than previous generations. Intel said Sapphire Rapids’ scores were between 3.9-4.7× of its previous generation CPUs for offline mode and 3.7-7.8× for server mode.
Chinese company Moffett submitted scores in the open division for its platform, which includes its Antoum chips, its software stack, and the company’s own sparse algorithms. The company has the S4 (75 W) chip available with S10 and S30 (250 W) still in the preview category. The Antoum architecture uses Moffett’s own sparse processing units for native sparse convolution alongside vector processing units, which add workload flexibility.
Startup Neural Magic has developed a sparsity-aware inference engine for CPUs. Combined with Neural Magic’s compression framework, which takes care of pruning and quantization, the inference engine enables neural nets to run efficiently on CPUs by changing the order of execution so that information can be kept in the CPU’s cache (without having to go to external memory). The company’s scores were submitted on Intel Xeon 8380 CPUs.
Israeli software startup Deci submitted results for its version of BERT in the open division, running on AMD Epyc CPUs. Deci’s software uses neural architecture search to tailor the neural network’s architecture for the relevant CPU, and often reduces its size in the process. Speedup was between 6.33-6.46× versus the baseline.
This article was originally published on EE Times.
Sally Ward-Foxton covers AI technology and related issues for EETimes.com and all aspects of the European industry for EETimes Europe magazine. Sally has spent more than 15 years writing about the electronics industry from London, UK. She has written for Electronic Design, ECN, Electronic Specifier: Design, Components in Electronics, and many more. She holds a Masters’ degree in Electrical and Electronic Engineering from the University of Cambridge.