Graphcore benchmarked its 4-chip system against a single Nvidia GPU. Is that comparing apples to oranges? Does it matter?
By publishing an array of in-house benchmark figures, British AI chip startup Graphcore has mounted a challenge against the market leader for AI acceleration in the data center, Nvidia. Graphcore is claiming significant performance advantages for its second-generation IPU versus state-of-the-art Nvidia GPUs. However, Graphcore has put systems of different sizes head-to-head, saying it has instead compared the Nvidia product that’s closest in price.
“The Graphcore numbers are misleading,” said Kevin Krewell, principal analyst at Tirias Research. “Many companies self-publish benchmark and performance data, but those should always be viewed sceptically. The use of performance per dollar is not a good measure for AI systems purchases because there are many other factors in the cost of ownership. Often, performance per rack space is a critical factor.”
The ancient practice of specmanship, it seems, is alive and well in the AI accelerator chip industry. Claims made by Graphcore include between 3.7x and 18x higher throughput for AI training and between 3.4x and 600x higher throughput for AI inference of various models compared to Nvidia GPUs.
Amongst other claims, Graphcore says its IPU-M2000 can achieve ResNet-50 training thoughput of 4326 images/second (batch=1024), which according to the company is 2.6x better than the Nvidia A100. On ResNet-50 inference, the IPU-M2000 can process 9856 images/sec which Graphcore says is 4.6x higher throughput than the Nvidia A100.
Graphcore has not been shy about going head-to-head with AI chip leader Nvidia in the past, but this latest announcement seems particularly bold. The majority of Graphcore’s benchmarks compare the IPU-M2000, a system with four IPU-MK2 chips, against a single Nvidia A100 GPU. The company also compares its IPU-Pod64, a system with 64 chips, against one or two Nvidia DGX-A100 systems (8x or 16x A100 chips). The scale of the systems compared in Graphcore’s announcement seems inconsistent, but as with all performance benchmarks, the devil is in the details.
Apples and oranges
“Graphcore’s comparisons are apples versus oranges, in terms of the models, algorithms and system configurations used, and they lacked key details like accuracy the models were trained to,” Paresh Kharya, Nvidia’s senior director of product management for accelerated computing told EE Times. “When compared using consistent methodology, Nvidia A100 offers much higher performance, versatility to run all AI models and a mature software stack so developers are productive from day one.”
Nvidia is a substantial contributor to the industry-wide independent AI benchmark, MLPerf, in terms of number of scores submitted for both training and inference benchmarks. Nvidia pointed out that in the latest MLPerf round, 11 companies used Nvidia’s software stack to submit performance scores for their Nvidia-based systems. While these results effectively back up Nvidia’s own results, they also validate the maturity of Nvidia’s software stack, and reflect its large community of developers.
Graphcore’s software stack, Poplar, is in version 1.4 and supports TensorFlow, PyTorch, ONNX and Alibaba’s Halo platform, with interfaces for PaddlePaddle and Jax on the roadmap.
“Benchmarking is nuanced and has many variables that can impact the performance and real customer experience,” said Kharya. “That’s why MLPerf was created, to enable apples to apples comparisons by standardizing the algorithms and measurement criteria, having peer reviews and making them representative of what customers run.”
Price and power
So how does Graphcore justify making performance comparisons between systems of different sizes?
“There are lots of variables you can normalize for when making a product comparison, but we don’t see number of chips being what customers care about. Our customers make a performance per dollar evaluation,” Chris Tunsley, director of product marketing at Graphcore told EE Times.
“The closest comparison for price and power consumption is 1x IPU-M2000 to 1x A100 DGX,” Tunsley added. “There won’t always be a precise 1:1 correlation as it is not possible to compare ‘fractions’ of a system. Our product has 4 MK2 IPUs (IPU-M2000) and this is the building block in sets of 4 up to an IPU-Pod64 which has 64 IPUs in it. You can buy one [Nvidia] A100 DGX-based product, or a DGX box with 8 chips. To help customers make their own analysis and comparisons, we now provide all the data for our performance results as data in a table on our website.”
Graphcore has said previously that the IPU-M2000 has a recommended retail price of $32,450, though this does not include a CPU server also needed to run the system (Graphcore says this enables freedom of server choice). By comparison, the 8-GPU DGX-A100 starts at $199,000. An Nvidia A100-accelerated server with 4x A100 GPUs (Supermicro A+ Server 2124GQ-NART) including CPU starts in the region of $57,000.
Performance benchmarks don’t normally have a price dimension, perhaps because price feels rather arbitrary to be used as an absolute metric, after all, it is set by the manufacturer based on their pricing strategy.
Aside from scale and price, there are of course many other considerations to practical systems, such as power consumption, cooling requirements and physical size (for example, an IPU-M2000 is 1U, versus the 4x A100 server mentioned above which is 2U, or the DGX-A100 at 6U). Factors like return on investment, as well as the value placed on ease of use, time to solution and infrastructure flexibility, are obviously weighted differently by individual customers.
Some of Graphcore’s published comparisons use benchmark figures published by Nvidia for the A100, but also refer to measured results using Nvidia hardware in the cloud. While it could be argued that experimental results may represent a realistic customer experience of the product, results will probably differ from a hardware manufacturer’s own highly-optimized benchmark scores.
Benchmarking is a complex and nuanced activity, and for AI training and inference workloads this is especially so. For this reason, there is growing momentum behind the MLPerf AI training and inference benchmarks, which are administered by the open engineering consortium MLCommons.
“Machine learning systems are extraordinarily complex and require careful optimization across a complete software and hardware stack including factors such as pre-processing, numerics, accuracy, and latency,” David Kanter, executive director of MLCommons told EE Times. “MLPerf establishes a clear set of tasks and performance metrics as well as the right approach for making comparisons using either common machine learning models or more novel implementations.”
This is the first set of benchmarks Graphcore is releasing for its second generation IPU, which launched in July. Graphcore also announced last week that it will join MLCommons and plans to submit benchmark scores to MLPerf in 2021. The next round of MLPerf inference benchmarks scores will be published in the later part of Q1 2021, with the next round of training scores following in Q2.
“I am glad to see Graphcore join MLCommons and promise to publish further benchmarks in 2021,” said Tirias Research’s Kevin Krewell. “Those benchmark scores will have far more scrutiny by the community. I believe [Graphcore’s self-published] benchmarks will age poorly.”
With the next MLPerf round so close, why put out a set of in-house benchmark scores now? Why not work towards MLPerf scores sooner instead?
“Graphcore has grown to a size and level of maturity now that we can devote the considerable time required to submit scores to MLPerf, even as we continue to develop our hardware, software and other parts of the business,” Graphcore’s Tunsley said. “Up until now, we have been focused on customer models. We are building a specific benchmarking team to develop our benchmarking capability and first on the list is MLPerf.”
Graphcore’s current headcount is 440 and the company has offices from Bristol to Oslo to Palo Alto and Beijing. As one of the earliest AI chip startups to launch its silicon, Graphcore also has a global network of distributors, resellers and system integrators in place.
Graphcore is widely known as one of the AI chip startup “unicorns”; the company’s most recent funding round in February 2020 valued the company at $1.95 billion, with $450 million raised to date. According to reports from the UK press, the company could IPO as early as 2022.
AI training and inference workloads and systems have a huge amount of variables and nuances that can drastically affect performance. It is therefore extremely difficult to make fair and accurate comparisons using tests that were not explicitly designed to be run as benchmarks and peer-reviewed by a community of experts from across the industry.
Graphcore is far from the only AI accelerator company publicly comparing their in-house performance results to figures published by the market leader. Independent benchmarks, developed in partnership with the industry and designed to reflect both real and emerging AI workloads, can provide accurate, comparable results that the nascent but competitive AI chip industry is desperately in need of.