AUSTIN, Texas — Arm sketched the inner workings of its machine-learning core at a press and analyst event here. Engineers are nearly finished with RTL for the design with hopes of snagging a commitment within weeks for use in a premium smartphone in 2019 or later.

Analysts generally praised the architecture as a flexible but late response to a market that is already crowded with dozens of rivals. However, Arm still needs to show detailed performance and area numbers for the core, which may not see first silicon until next year.

The first core is aimed at premium smartphones that are already using AI accelerator blocks from startup DeePhi (Samsung Galaxy), Cambricon (Huawei Kirin), and in-house designs (iPhone). The good news for Arm is that it’s already getting some commercial traction for the neural-networking software for its cores, released as open source, that sits under frameworks such as TensorFlow.

Winning the hearts and minds of software developers is increasingly key in getting design wins for hardware sockets, said Dennis Laudick, a vice president of marketing for Arm’s machine-learning group. He helped build partnerships around Arm’s Mali GPU cores, once a crowded market led by others but now dominated by Arm.

Long-term, deep-learning accelerators could be even more significant than graphics processors. “This is kind of the start of software 2.0,” said Laudick. “For a processor company, that is cool. But it will be a slow shift, there’s a lot of things to be worked out, and the software and hardware will move in steps.”

In a sign of Arm’s hunger to unseat its rivals in AI, the company has “gone further than we normally would, letting [potential smartphone customers] look under the hood” of the core’s design, he said.

At least one smartphone maker is already kicking the tires of the beta RTL. “A couple [of premium smartphone makers] aren’t interested [in the core], a couple are very interested, and a couple are somewhere in between,” said Laudick, adding that a production release of the RTL is on track for mid-year.

The first core targets 4.6 tera operations/second (TOPS) and 3 TOPS/W at 7 nm for high-end handsets. Arm plans simpler variants using less memory for mid-range phones, digital TVs, and other devices.

Theoretically, the design scales from 20 GOPS to 150 TOPS, but the demand for inference in the Internet of Things will pull it first to the low end. Arm is still debating whether it wants to design a core for the very different workloads of the data center that includes training.

“We are looking at [a data center core], but it’s a jump from here,” and its still early days for thoughts on a design specific for self-driving cars, said Laudick.

Arm's ML core marries MACs, SRAM, and a streamlined controller on each of up to 16 slices. (Images: ARM)
Arm’s ML core marries MACs, SRAM, and a streamlined controller on each of up to 16 slices. (Images: ARM)


A deeper look inside Arm’s ML core

Initial comparisons peg Nvidia’s open-source NVDLA core as smaller and lower-power given its dedicated blocks for several inference functions. Arm’s approach is more flexible. It uses up to 16 engines, each with 128 multiple-accumulate (MAC) units, a programmable engine, and a configurable SRAM block, typically about a megabyte.

“They could have just built a big MAC array; they’ve learned from what others have done to build something more purpose-built — it’s a complete standalone core with a programmable approach,’ said Mike Demler of the Linley Group.

OEMs will need the core’s ability to scale in multiple dimensions, “but there’s not data yet on its performance relative to CPUs and GPUs,” said Kevin Krewell of Tirias Research.

The Arm team started with a clean sheet of paper and chose to focus on quantized 8-bit integer data. It was tuned for 16-nm and 7-nm nodes with MAC and SRAM blocks hardened to reduce their power consumption and area.

The team is tracking research on data types down to 1-bit precision, including a novel 8-bit proposal from Microsoft. So far, the alternatives lack support in tools to make them commercially viable, said Laudick.

“I haven’t seen anything in research suggesting [that lower-precision data] will be a revolution in performance — ultimately, researchers may simplify net architectures without needing to change data types,” he added.

The core supports pruning and clustering of weights to maximize performance. It uses tiling to keep working data sets in SRAM and reduce the need to access external DRAM.

The programmable layer engine (PLE) on each slice of the core offers “just enough programmability to perform [neural-net] manipulations” but doesn’t include the legacy baggage of instruction fetch-and-decode blocks, said Robert Elliot, a technical director in Arm’s machine-learning group.

The Programmable Layer Engine takes data from a MAC unit (MCE) and delivers results to SRAM on the core's main block.
The Programmable Layer Engine takes data from a MAC unit (MCE) and delivers results to SRAM on the core’s main block.

The PLE includes a vector engine that acts as a lookup table to handle pooling and activation tasks. “Ninety percent of the core’s computations are in the MAC units,” he added.

Arm will release more data on the core’s performance when it is launched, probably in mid-June. But don’t expect detailed guidance on when to run what AI jobs on its CPU, GPU, or new machine-learning cores, a complex issue that the company, so far, is leaving to its SoC and OEM customers.

One customer is already using Arm’s open-source libraries for neural networks to support jobs across a third-party accelerator and a Mali GPU. The code supports both Android and embedded Linux.

ARM Neural Net software

Arm has already released open-source neural-net software for its cores being used by at least one customer on third-party IP and a Mali GPU. Click to enlarge.

— Rick Merritt, Silicon Valley Bureau Chief, EE Times