The latest round of MLPerf Inference scores includes new models for varied but realistic workloads such as recommendation, speech-to-text, medical imaging and more...
Benchmarking organisation ML Commons has released a new round of MLPerf Inference scores. This latest round is separated into classes of device to make for easier comparison. The results also feature a range of new AI models, which are intended to represent a range of different workloads that are commercially deployed but still considered state-of-the-art.
Nvidia-accelerated systems accounted for about 85% of the total submissions, winning all the categories they entered. However, there were no Nvidia submissions in the Mobile or Notebook classes (Nvidia is not present in these markets with AI acceleration products). There were also several interesting submissions from startups, and overall a greater tendency to have numbers in multiple columns, making for easier comparisons.
Changes From Last Round
The first major change to the results this round is that systems have been separated into classes: data center, edge, mobile and notebook. Mobile phones and notebooks have very specific form factors and performance profiles which makes them easy to separate out from the wider edge list.
“If you’re talking about a notebook, it’s probably running Windows, if you’re talking about a smartphone you’re probably running iOS or Android,” David Kanter, executive director of ML Commons told EE Times. “Separating these results out from the larger pool of inference scores is very helpful in making things clearer.”
The benchmarks for this second round of inference scores have also been revamped to include AI models which represent modern use cases. While the previous round focused on vision and image processing models, this time the data center and edge classes includes recommendation model DLRM, medical imaging model 3D-UNet which is used to look for tumors in MRI scans, speech-to-text model RNN-T and natural language processing (NLP) model BERT.
“[Model selection] is driven by customer input, but we don’t want to fall into the trap of having the students set their own test,” said Kanter, explaining that the aim was to identify cutting edge models that are in production, not just in the research phase. “DLRM and 3D-UNet, those were very informed [choices] driven by our advisory board, folks from the medical world, folks that do recommendation at scale… That sort of informed workload construction is tremendously valuable.”
The mobile and notebook classes use MobileNetEdge for image classification, SSD-MobileNetv2 for object detection, Deeplabv3 for image segmentation and Mobile BERT for NLP.
Across the board, accuracy targets have also been increased to reflect real-world deployments.
Analysis below refers only to the “closed” division for fair comparison.
Data center results
As expected, the majority of the submissions in the data center class used Nvidia GPU accelerators. The rest used Intel CPUs for the AI processing, with a couple of exceptions (see below). No submissions from Google for its TPU this time, and no submissions from anyone in the vocal community of startups that are establishing themselves in this space (Graphcore, Cerebras, Groq, etc).
“[Nvidia’s] performance lead over the CPUs has increased from about 6X to 30X on a basic computer vision model called ResNet, and on advanced recommendation system models… Nvidia A100 is 237 times faster than [Intel’s] Cooper Lake CPU,” said Paresh Kharya, senior director of product management and marketing at Nvidia. “A single DGX-A100 provides the same performance on recommendation systems as 1000 CPU servers, and astounding value for customers.”
Mipsology was the only commercially available non-CPU non-GPU entrant in this division. The company has an accelerator technology called Zebra which runs on Xilinx FPGAs (in this case, a Xilinx Alveo U250). Their technology can handle 4096 ResNet queries per second in server mode (compared to roughly 5563 for an Nvidia T4) or 5011 samples per second in offline mode (compared to roughly 6112 for the Nvidia T4).
Taiwanese company Neuchips submitted a score into the Research, Development or Internal category, which means the device it used is not commercially available and most likely won’t be for at least another 6 months. RecAccel is designed specifically to accelerate DLRM, the recommendation model used in this benchmark. It uses a massively parallel design running on an Intel Stratix FPGA for AI inference. Its results in the DRLM category were comparable or worse than Intel Cooper Lake CPUs and no match for Nvidia.
The edge category was dominated by scores accelerated by Nvidia’s A100, T4, AGX Xavier and Xavier NX.
Centaur Technology entered results from its commercially available reference design system which uses Centaur’s server processor based on its in-house x86 microarchitecture, plus a separate in-house AI accelerator as a co-processor. This reference design is a server-class system for on premises or private data center applications and it is optimized for cost and form factor (rather than power consumption or peak performance), according to Centaur.
On ResNet image classification (single stream latency), Centaur’s system was faster than Nvidia’s own submissions for server systems equipped with the Tesla T4. However, the T4 beat Centaur’s design on ResNet offline samples processed per second. Centaur did not fare quite as well on object detection, coming in somewhere between Nvidia’s two embedded edge modules, the Xavier NX and the AGX Xavier.
British engineering consultancy dividiti, which specializes in objectively evaluating ML hardware and software systems, submitted a raft of scores on systems ranging from Fireflys and Raspberry Pis to the Nvidia AGX Xavier. Seemingly identical systems using the Raspberry Pi are in fact using different operating systems (32-bit Debian vs 64-bit Ubuntu – Ubuntu was roughly 20% faster). The company’s results differed from Nvidia’s own results for the AGX Xavier since Nvidia used both the AGX Xavier’s GPU and two on-chip deep learning accelerators for its ResNet Offline and Multistream scores, where dividiti only used the GPU.
A dividiti spokesperson also told EE Times that while the company had managed to “more or less” reproduce Nvidia’s scores for the previous inference round, the latest results introduced a performance regression into the test harness which was only noticed minutes before the submission deadline (fixing this issue later improved some results by up to 20%). This serves to illustrate the important influence of the exact hardware/software combination on results.
New entries in this category include IVA Technologies and Mobilint, both in the Research, Development or Internal category.
IVA Technologies, a Russian designer and manufacturer of IT equipment, has been working on an AI accelerator chip which supports convolutional, 3D-convolutional and LSTM models. The company submitted a score labelled “FPGA” which may be a prototype of the accelerator ASIC implemented on an FPGA. ResNet single stream latency was 12.23 ms, roughly 4x slower than the Xavier NX, and it processed 89 offline samples per second, less than a tenth of the Xavier NX. However, the Edge category is wide and not much is known about the design – it could be intended for smaller devices than the Xavier NX.
Mobilint, a Korean AI accelerator ASIC startup, submitted a score for its Mobilint Edge design, which EE Times suspects was implemented as a prototype on a Xilinx Alveo U250 FPGA card. On ResNet, its latency was much longer than IVA Technologies’ design at 37.46 ms but it processed more offline samples per second (107). The company also submitted scores for object detection.
While neither IVA Technologies or Mobilint produced ground breaking scores, there is certainly value in benchmarking prototypes since it proves their accompanying software stacks are ready.
In the new mobile SoC category, there were three submissions which were fairly well-matched, without a clear winner.
MediaTek submitted scores for its Dimensity 820 (in the Xiaomi Redmi 10X 5G smartphone). This device uses MediaTek’s own AI processing unit (APU) 3.0 which is an FP16 and INT16-capable accelerator optimized for camera/imaging functions. The SoC also has a 5-core GPU.
The Qualcomm Snapdragon 865+ uses the company’s Hexagon 698 processor designed for AI acceleration which clocks in at 15 TOPS, alongside the Adreno 650 GPU. The benchmarks were run on an Asus ROG Phone 3.
Samsung’s Exynos 990 was benchmarked as part of the Galaxy Note 20 Ultra. This device contains a dual-core NPU (neural processing unit) and an Arm Mali-G77 GPU alongside various Arm CPU cores.
Samsung’s Exynos 990 did best on image classification and NLP; the MediaTek Dimensity 820 was very close on image classification but Samsung had a clearer lead on NLP. MediaTek had a clear lead in object detection, with the Qualcomm Snapdragon 865+ in second place. MediaTek also won the image segmentation benchmark, ahead of Qualcomm by a narrow margin.
There was only one entry in the Notebook category – an Intel reference design that uses the forthcoming Intel Xe-LP GPU as an accelerator. The Xe-LP is the low power version of Xe-HP and Xe-HPC which are for data center AI acceleration and HPC; neither of the bigger devices was benchmarked.
Because there was only one entry in this class, it’s tricky to interpret the Xe-LP’s results. However, the notebook category used the same AI models as the mobile category, so some comparison is inevitable. Xe-LP’s biggest advantage over the mobile SoCs was on image segmentation (DeeplabV3) where it outperformed the mobile winner by a factor of 2.5 on throughput (frames per second). Its weakest performance was on object detection (SSD – MobileNetv2) where its advantage was 1.15x over the mobile winner in terms of throughput (frames per second).
Moving forward, Kanter is hopeful that future rounds of the benchmarks will include more non-Nvidia and non-Intel CPU entries, saying that the organisation has gone out of its way to encourage startups and smaller companies to submit results.
“We have an open division, where you can submit any network you want,” he said. “One of the nice things about that is if a customer says I want X, and you do all the enablement for that, you can use X, as long as you can drop in the code so we can see what you’re running.”
Companies can submit results for as little as one AI model to keep engineering effort low, and can even submit their own models into the open category.
Kanter also mentioned that it is the organization’s intent to introduce a power measurement dimension to the next round of scores. Work is already in progress.
“One of the things we’d love to get people involved with is helping build the power measurement infrastructure – help us build out the tools to make those measurements,” Kanter said.
The full list of MLPerf Inference results in detail is available here.