CEVA Revamps AI Accelerator Engine IP

Article By : Sally Ward-Foxton

New generation AI core, NeuPro-M, can boost performance 5-15x compared to last gen core.

CEVA Inc. has revamped its NeuPro AI accelerator engine IP, adding specialized co-processors for Winograd transforms and sparsity operations and a general-purpose vector processing unit alongside the engine’s MAC array. The new generation engine, NeuPro-M, can boost performance 5-15X (depending on the exact workload) compared to CEVA’s second generation NeuPro-S core (released Sept 2019). For example, ResNet-50 performance was improved 4.9X without using the specialized engines – boosted to 14.3X when using specialized co-processors, according to CEVA. Results for Yolo-v3 showed similar speedups. The core’s power efficiency is expected to be 24 TOPS/Watt for 1.25 GHz operation.

Ceva NeuPro-M performance
Performance for CEVA’s NeuPro-M (NPM) engine versus CEVA’s previous generation of engine, the NeuPro-S (Source: CEVA)
Ceva NeuPro-M Core
CEVA’s NeuPro-M architecture include a shared local memory for the different accelerators within the engine (Source: CEVA)

The NeuPro-M engine architecture allows for parallel processing on two levels — between the engines (if multiple engines are used), and within the engines themselves. The main MAC array has 4000 MACs capable of mixed precision operation (2-16 bits). Alongside this are new, specialized co-processors for some AI tasks. Local memory in each engine breaks the dependence on the core shared memory and on external DDR; the co-processors in each engine can work in parallel on the same memory, though they sometimes transfer data from one to another directly (without passing through memory). The size of this local memory is configurable based on network size, input image size, number of engines in the design and customers’ DDR latency and bandwidth.

One of the specialized co-processors is a Winograd transform accelerator (the Winograd transform is used to approximate convolution operations using less compute). CEVA has structured this to accelerate 3×3 convolutions – the most common in today’s neural networks. CEVA’s Winograd transform can roughly double performance for 8-bit 3×3 convolution layers, with only 0.5% prediction accuracy reduction (using the Winograd algorithm out of the box/untrained). It can also be used with 4, 12 and 16-bit data types. Results are more pronounced for networks with more 3×3 convolutions present (see performance graph above for ResNet-50 vs Yolo-v3).

CEVA’s unstructured sparsity engine can take advantage of zeros present in neural network weights and data, though it works especially well if the network is pre-trained using CEVA’s tools to encourage sparsity. Gains of 3.5X can be made under certain conditions. Unstructured sparsity techniques help maintain prediction accuracy versus structured schemes.


CEVA’s Deep Neural Network (CDNN) compiler and toolkit enables hardware-aware training. A system architecture planner tool configures criteria like in-engine memory size and optimizes the number of NeuPro-M engines required for the application. CDNN’s compiler features asymmetric quantization capabilities. Overall, CEVA’s stack can support neural networks of all different types with many hundreds of layers. CDNN-Invite offers the ability to connect customers’ own custom accelerator IP into designs. Networks or network layers can be kept private from CDNN if required.

Safety and security

Customer’s neural network models can be closely-guarded IP so there is a need to keep weights and data secure. The NeuPro-M architecture supports secure access in the form of optional root of trust, authentication and cryptographic accelerators. NeuPro-‘s security IP originated with Intrinsix, a company CEVA acquired in May 2021, which is involved in the development of chiplets and secure processors for aerospace and defense customers including Darpa. Crucially, it is applicable to both SoC and die-to-die security.

For the automotive market, NeuPro-M cores along with CEVA’s CDNN compiler and toolkit comply with the ISO26262 ASIL-B standard and meets quality assurance standards IATF16949 and A-Spice.


Two pre-configured cores are available now: the NPM11 with a single NeuPro-M engine which can achieve up to 20 TOPS at 1.25 GHz, and the NPM18 with eight NeuPro-M engines, which can achieve up to 160 TOPS at 1.25 GHz.

This article was originally published on EE Times.

Sally Ward-Foxton covers AI technology and related issues for EETimes.com and all aspects of the European industry for EE Times Europe magazine. Sally has spent more than 15 years writing about the electronics industry from London, UK. She has written for Electronic Design, ECN, Electronic Specifier: Design, Components in Electronics, and many more. She holds a Masters’ degree in Electrical and Electronic Engineering from the University of Cambridge.


Subscribe to Newsletter

Leave a comment