Nvidia Exhibits Hopper in Latest MLPerf Benchmarks

Article By : Sally Ward-Foxton

In the latest round of MLPerf AI inference benchmarking scores, Nvidia showed up to 4.5x performance of A100 for its new flagship GPU.

Nvidia used the latest round of MLPerf inference scores to debut public benchmarks for its latest flagship GPU, the H100. H100 is the first chip to be built on the company’s Hopper architecture with its specially designed transformer engine. H100 outperformed Nvidia’s current flagship, the A100, by 1.5-2× across the board, except for the BERT scores where the advantage was more pronounced with up to 4.5× uplift.  

Nvidia’s graph shows the performance of the new H100 relative to the company’s previous generation part (the A100) as well as versus competing hardware. (Click image to enlarge) (Source: Nvidia)

With triple the raw performance of the A100, why are some of H100’s benchmark scores less than double? 

“While the FLOPS and TOPS numbers are a useful initial set of guideposts, they don’t necessarily predict application performance,” Dave Salvator, Nvidia’s director of AI inference, benchmarking, and cloud, told EE Times in an interview. “There are other factors, [including] the nature of the architecture of the network you’re running. Some networks are more I/O bound, some networks are more compute bound… it varies by network.”  

Salvator added that there is headroom for H100’s scores to improve as its software stack matures. 

“This is a first showing for Hopper… there is still gas left in the tank,” he said.  

Salvator pointed out that the A100’s results have improved 6× since that accelerator’s first MLPerf showing in July 2020. “Most of that came from software tuning optimizations, many of which make their way onto our containers on NGC [Nvidia’s software portal] that developers can use.” 

H100’s standout result was on BERT-Large, where it performed as much as 4.5× better than the A100. Among H100’s new features are a hardware and software transformer engine that manages the precision of calculations during training for highest throughput while maintaining accuracy. While this functionality is more relevant to training, it does apply to inference, Salvator said.   

“It’s largely the FP8 precision that’s coming into play here, but’s it’s also some other architectural aspects of H100. The fact that we have more compute capability plays a role, more streaming processors, more tensor cores, and more compute,” he said. H100 has also approximately doubled its memory bandwidth compared to A100. 

Some parts of the BERT 99.9 benchmark ran in FP16 and some in FP8 the secret sauce here is knowing when to jump to higher precision to preserve accuracy, which is part of what the transformer engine does.  

Nvidia also showed an approximately 50% energy efficiency improvement for its edge SoC Orin, which Salvator put down to recent work to find an operational sweet spot for frequency and voltage (MaxQ). 

Orin’s improvement in energy efficiency (taller bars are better) versus the last round of scores. (Click image to enlarge) (Source: Nvidia)

Benchmark scores for Grace CPU systems, Grace Hopper, and power measurements for H100 should be available once the products reach the market in the first half of next year, Salvator said.  

Qualcomm 

Nvidia’s main challenger, Qualcomm, focused on energy efficiency for its Cloud AI 100 accelerator. Qualcomm runs the same chip in different power envelopes for data center and edge use cases.  

There were over 200 Cloud AI 100 scores submitted by Qualcomm and its partners, including Dell, HPE, Lenovo, Inventec, and Thundercomm. Three new edge platforms based on Snapdragon CPUs with Cloud AI 100s were also benchmarked, including Foxconn Gloria systems. 

Qualcomm entered the largest system (18 accelerators) in the available category of the closed data center division and claimed the crown for the best ResNet-50 offline and server performance. The 8x Cloud AI 100 scores, however, were easily bested by Nvidia’s 8x A100 PCIe system. (Nvidia H100 is in the “preview” category as it isn’t commercially available yet). 

Qualcomm also claimed the best power efficiency across the board in the closed edge system and closed data center system divisions.  

Qualcomm’s Cloud AI 100, run with 75 W TDP power constraints or below, fared well on power efficiency for edge devices (Click image to enlarge) (Source: Qualcomm)
Qualcomm also claimed a win on power efficiency in the closed data center category, with the Cloud AI 100 limited to 75 W TDP again here (Click image to enlarge) (Source: Qualcomm)

Biren

Chinese GPU startup Biren offered its first set of MLPerf scores since emerging from stealth last month.  

The Chinese startup presented scores for its BR104 single-chiplet accelerator in the PCIe form factor alongside its BirenSupa software development platform. For both ResNet-50 and BERT 99.9, the Biren 8-accelerator system offered similar performance to Nvidia’s DGX-A100 in server mode, where there is a latency constraint, but comfortably outperformed Nvidia DGX-A100 in offline mode, which is a measure of raw throughput.  

Biren’s BR100which has a pair of the same chiplets used singly in the BR104was not benchmarked. 

Chinese server maker Inspur also submitted results for a commercially available system with 4x BR104 PCIe cards. 

Sapeon

Another new entrant was Sapeon, a spin-out of Korean telecoms giant SK Telecom. Before spinning out, Sapeon had been working on its accelerator since 2017; the X220, a second-generation chip,  has been on the market since 2020. The company said its chip is in smart speakers and security camera systems. It claimed victory over Nvidia’s A2, an Ampere-generation part intended for entry-level servers in 5G and industrial applications.  

Sapeon showed scores for the X220-compact, a single-chip PCIe card consuming 65 W, and the X220-enterprise, which has two X220 chips and consumes 135 W. The company pointed out that the X220-compact beat Nvidia A2 by 2.3× in terms of performance, but was also 2.2× more power efficient, based on maximum power consumption. This is despite the X220’s low-cost 28-nm process technology (Nvidia A2 is on 7 nm).  

Sapeon is planning a third-generation chip, the X330, for the second half of 2023, which the company says will offer higher precision and will handle both inference and training workloads.  

Intel  

Intel submitted preview scores for its delayed Sapphire Rapids CPU. This four-chiplet Xeon data center CPU is the first to get Intel’s advanced matrix extensions (AMX), which Intel says enables 8× the operations per clock compared to previous generations.  

Sapphire Rapids also offers more compute, more memory and more memory bandwidth than previous generations. Intel said Sapphire Rapids’ scores were between 3.9-4.7× of its previous generation CPUs for offline mode and 3.7-7.8× for server mode. 

Other Notable Results 

Chinese company Moffett submitted scores in the open division for its platform, which includes its Antoum chips, its software stack, and the company’s own sparse algorithms. The company has the S4 (75 W) chip available with S10 and S30 (250 W) still in the preview category. The Antoum architecture uses Moffett’s own sparse processing units for native sparse convolution alongside vector processing units, which add workload flexibility. 

Startup Neural Magic has developed a sparsity-aware inference engine for CPUs. Combined with Neural Magic’s compression framework, which takes care of pruning and quantization, the inference engine enables neural nets to run efficiently on CPUs by changing the order of execution so that information can be kept in the CPU’s cache (without having to go to external memory). The company’s scores were submitted on Intel Xeon 8380 CPUs.  

Israeli software startup Deci submitted results for its version of BERT in the open division, running on AMD Epyc CPUs. Deci’s software uses neural architecture search to tailor the neural network’s architecture for the relevant CPU, and often reduces its size in the process. Speedup was between 6.33-6.46× versus the baseline. 

Deci’s version of BERT was able to run much faster than the baseline on the same hardware (Click image to enlarge) (Source: Deci)

Subscribe to Newsletter

Leave a comment