SAN JOSE, Calif. — Google and Baidu collaborated with researchers at Harvard and Stanford to define a suite of benchmarks for machine learning. So far, AMD, Intel, two AI startups, and two other universities have expressed support for MLPerf, an initial version of which will be ready for use in August.

Today’s hardware falls far short of running neural-networking jobs at the performance levels desired. A flood of new accelerators are coming to market, but the industry lacks ways to measure them.

To fill the gap, the first release of MLPerf will focus on training jobs on a range of systems from workstations to large data centers, a big pain point for web giants such as Baidu and Google. Later releases will expand to include inference jobs, eventually extended to include ones run on embedded client systems.

“To train one model we really want to run would take all GPUs we have for two years,” given the size of the model and its data sets, said Greg Diamos, a senior researcher in Baidu’s deep-learning group, giving an example of the issue for web giants.

“If systems become faster, we can unlock the potential of machine learning a lot quicker,” said Peter Mattson, a staff engineer on the Google Brain project who announced MLPerf at a May 2 event.

An early version of the suite running on a variety of AI frameworks will be ready to run in about three months. At that time, organizers aim to convene a working group to flesh out a more complete version.

“We’re initially calling it a version 0.5 release … we did this with a small team, and now we want the community to put its stamp on a version 1.0 to be something everyone owns,” said Mattson. “We encourage feedback … to suggest workloads, benchmark definitions, and results so we can rapidly iterate” the benchmark.


MLPerf has both backers and a rival

About 35 people from six chip companies, four data center operators, and four universities got a first look at the plan at a closed-door meeting on April 12. Since then, organizers have expanded their efforts to win over supporters.

Other announced supporters include the University of California at Berkeley, the University of Minnesota, and the University of Toronto as well as two AI startups, SambaNova and Wave Computing.

In December, the Transaction Processing Council announced that it was forming a group to define AI benchmarks. “Several benchmarks planned in this space are involved in our effort now … there’s long-term benefit in focusing on one benchmark for this space,” said Mattson.

Baidu was an early mover, releasing in September 2016 DeepBench, an open-source, low-level benchmark for training using workloads from the China-based search giant. Diamos said that the company will now focus on MLPerf, which targets applications-level performance.

“DeepBench was focused on low-level programming interfaces because they are portable across hardware, but to get more accurate metrics, we need to evaluate full apps” and include workloads from many companies, said Diamos.

Initially, MLPerf will measure the average time to train a model to a minimum quality, probably in hours. Given that these jobs are run on large banks of servers, it may not report performance per watt. It will take into consideration the costs of jobs as long as price does not vary over the time of day that they are run.

Nvidia’s P100 Volta chip will be a reference standard because it is widely employed by data centers for training. The group aims to update published results every three months.

MLPerf will use two modes. A closed metric geared for commercial users will specify a model and data set to be used and restrict the values of key parameters such as batch size. An open metric aimed at researchers will apply fewer restrictions so that users can experiment with new approaches.