SAN JOSE, Calif. — Nearly a dozen processor cores for accelerating machine-learning jobs on clients are racing for spots in SoCs, with some already designed into smartphones. They aim to get a time-to-market advantage over processor-IP giant Arm that is expected to announce its own soon.

The competition shows that much of the action in machine-learning silicon is shifting to low-power client blocks, according to market watcher Linley Gwennap. However, a race among high-performance chips for the data center is still in its early stages, he told EE Times in a preview of his April 11 keynote for the Linley Processor Conference.

“Arm has dominated the IP landscape for CPUs and taken over for GPUs as well, but this AI engine creates a whole new market for cores, and other companies are getting a head start,” said Gwennap.

The new players getting traction include:

  • Apple’s Bionic neural engine in the A11 SoC in its iPhone
  • The DeePhi block in Samsung’s Exynos 9810 in the Galaxy S9
  • The neural engine from China’s Cambricon in Huawei’s Kirin 970 handset
  • The Cadence P5 for vision and AI acceleration in MediaTek’s P30 SoC
  • Possible use of the Movidius accelerator in Intel’s future PC chip sets

The existing design wins have locked up many of the sockets in premium smartphones that represent about a third of the overall handset market. Gwennap expects that AI acceleration will filter down to the rest of the handset market over the next two to three years.

Beyond smartphones, cars are an increasingly large market for AI chips. PCs, tablets, and IoT devices will round out the market.

To keep pace, Arm announced in February a blanket effort that it calls Project Trillium. But “what they need to be competitive is some specific hardware accelerator to optimize power efficiency,” said Gwennap.

“I would expect Arm to produce that kind of accelerator. The fact is that they are behind, which has created an opportunity for the newer companies to jump in.”

Arm is likely to show its cards at its annual October event in Silicon Valley. But there’s no guarantee that Arm will make up lost ground because there’s not necessarily a close tie between neural net engines and CPUs.

Next page: Waiting on benchmarks and data center rivals

AI block performance comparison
Raw performance numbers of client inference accelerators announced so far are just part of the story. (Chart: The Linley Group)


Waiting on benchmarks and data center rivals

Ultimately, the winning chips in this still-new battle will be the ones with the best combination of performance, power, and die area.

“The problem is that we see the raw performance, but it really comes down to delivered performance on neural networks, so what we need is a good benchmark like the number of images classified per second,” said Gwennap.

Baidu was early to release AI benchmarks as open-source, but they have not been widely adopted. The Transaction Processing Council formed a work group late last year to attack the problem, but it has yet to report any progress.

“It’s easy coming up with benchmark, but hard to get companies to agree and compare results … and things are changing, so any benchmark will have to evolve to stay relevant,” he said.

So far, Gwennap reports that the multi-core v-MP6000 of Videantis has a slight edge in raw performance over its closest rival, Ceva’s NeuPro, which combines a SIMD DSP with systolic MAC array.

Other players include Synopsys with its EV64, combining a SIMD DSP with custom logic for activation and pooling. Like Videantis, AImotive’s AIware uses many custom hardware blocks.

Among low-cost blocks, VeriSilicon’s VIP8000-O delivers the most raw performance using a GPU with up to eight deep-learning engines. Ironically, Cambricon’s CPU with a small matrix engine offers the lowest performance of announced chips, but it still got a significant design win in the Huawei smartphone.

Imagination is also a player with its PowerVR 2NX, a custom, non-GPU architecture with a MAC array. Nvidia hopes to act as a spoiler, making the IP for the NVDLA core in its Xavier processor free and open-source and winning support from Arm.

Overall, Gwennap said that as many as 40 companies are now designing customer AI silicon. Many target the data center, where Nvidia’s Volta GPU currently goes largely unchallenged as the training engine of choice by giants including Amazon.

“The competitors we see now are Google’s TPU and Microsoft’s FPGA-based Brainwave that is being deployed widely, but there’s not a lot of merchant alternatives now,” said Gwennap.

“Wave Computing seems to be ahead of the pack in bringing a new AI data center architecture to production this year.”

Wave’s decision to sell full systems suggests that it is targeting second- and third-tier players, not the largest data centers that prefer making their own optimized boxes.

Intel’s Nervana recently made clear that it will not have production silicon until 2019. Startup Graphcore suggested that it will announce its chip later this year. Another startup, Cerebrus, remains quiet, while bitcoin ASIC maker BitMain announced plans late last year for an AI chip for data centers.

“There’s a ton of companies working on this kind of stuff,” said Gwennap. “People see this as the next gold rush, and they are all trying to jump in.”

— Rick Merritt, Silicon Valley Bureau Chief, EE Times