PALO ALTO, Calif. — Huawei presented at Hot Chips an AI accelerator it aims to scale from inference on wearables to training jobs in data centers. It also described systems based on them spanning SoCs for smartphones, cars, and cellular base stations as well as servers and a 512-petaflop cluster.

The presentation showed state-of-the-art work in silicon, software and systems. In many respects Huawei appeared ahead of rivals such as Intel or even risk-taking startups such as Cerebras which is narrowly focused on data center training.

The first generation of chips based on DaVinci cores were designed in just 11 months. The company expects to deploy 100 million devices using the cores this year, said Heng Liao, an R&D manager of Huawei who presented by video given political tensions from the U.S.-China trade war, much of it focused on his company.

“This has been an international conference for years, and I’m glad you got to present here,” said an Intel engineer in a Q&A session, generating spontaneous applause of approval from attendees.

DaVinci core

The DaVinci core includes traditional, vector and matrix processors. (Images: Huawei)

“We have client versions [of DaVinci] for cellphones and smart cameras using the same architecture as the data center. We realize creating a software stack is a tremendous effort, so we want the same software and architecture to be used from very small to very large devices,” said Liao.

The software incudes Mindspore, Huawei’s own AI framework. The company also developed two layers of software to translate from it, TensorFlow and Pytorch to its hardware.

Within the next quarter, Huawei will release benchmarks both on individual DaVinci chips and its 2,048-node cluster. The results will include participation in the MLPerf benchmarks, Liao said.

The flagship is the N7+ Ascend 910, a 182mm2 die with 32 DaVinci cores, delivering 256 TFlops on 16-bit floating point operations and dissipating 350W. Huawei designed a 6,000W server node packing eight of the chips and 1.5TBytes DRAM.

Liao described a next-generation of the 910 using a 3D stack of SRAM cache as well as HBM main memory. A version for smartphones will use a custom Wide-IO DRAM 3D bonded on to an apps processor.

“The analysis we have done indicated in a 300-400W range the problems can be solved, bigger challenge is stacking DRAM, given challenges thinning it and DRAM has a tighter thermal budget,” Liao said.

One attendee asked Liao about the current debate over next-generation HBM stacks possibly moving on top of the accelerator chip to deal with decreasing reach. “It seems like HBM on the side is still the most practical approach, but we will try both side and top mounting,” Liao said.

“Our belief is the [current HBM] binding technique needs to be improved and the density of wiring can be increased. This will help significantly reduce power,” he added.

Huawei stack

Huawei proposed the novel idea of a 3D SRAM cache.