While artificial intelligence and machine learning computation is often performed at large scale in datacentres, the latest processing devices are enabling a trend towards embedding AI/ML capability into IoT devices at the edge of the network. AI on the edge can respond quickly, without waiting for a response from the cloud. There’s no need for an expensive data upload, or for costly compute cycles in the cloud, if the inference can be done locally. Some applications also benefit from reduced concerns about privacy.

Outside of the two most common applications, speech processing and image recognition, machine learning can of course be applied to data from pretty much any type of sensor. In today’s smart factories, machine learning techniques might apply to complex sensor fusion in industrial process control, or anomaly detection and predictive maintenance schemes in industrial machines.

MCU Anomaly Detection

NXP showed a module for exactly this purpose at the Microsoft Build developer conference earlier this month. The 30x40mm board combines a handful of sensors with a powerful microcontroller – the i.MX RT1060C, a high performance Cortex-M device which runs at a very fast 600MHz – plus connectivity capability (see figure 1).

NXP AI board

Figure 1. The NXP AI-powered industrial anomaly detection module, front and back. (Source: NXP)

“The idea is that you attach this board to a rotating machine, a compressor or a motor maybe, but it could also be used for something as simple as detecting gases in a mine,” says Denis Cabrol, Executive Director and GM, IoT and Security Solutions, NXP. “There are two modes: first, the device gathers data on a normally operating system, and creates a model of normality. Then the device is deployed… we will see small variations due to temperature or normal wear of the device, but if something starts to go wrong, a bearing starts to break apart, or something gets out of balance, we can send a technician or shut the device down right away.”

Results of the AI-powered anomaly detection may be seen in a GUI (figure 2).

NXP AI hybrid

Figure 2. GUI for the NXP anomaly detection module. On the right, blue points represent normal operation, pink triangles are the limit of normal operation, and red points represent an anomaly. (Source: NXP)

The NXP module can run machine learning either completely locally, with training and inference both taking place on its MCU, or it can connect to Microsoft Azure and send all data into the cloud for training, inference, or both.

“Actually, the optimal use case is somewhere in between; to use the local intelligence to process the bulk of the data, because most of the time, the local processor has plenty of power to decide that everything is fine,” says Cabrol. “Then you only send a few data bytes to the cloud [with operating status], not the raw data.”

Used in this way, this microcontroller is more than capable of running training tasks at the edge. This is useful in this application since the machine’s exact environment will differ depending on where it is placed; training is required to create a picture of normal operation.

There are several things to look for when choosing an MCU for this type of application.

“You need to have the right processing power to make training as accurate as possible, and you need to have a large amount of memory,” says Cabrol. “The devices we have here are between 512kB and 1MB of SRAM; for an MCU that's a large amount of SRAM, so you can create a very large model.”

The module discussed here would scale to lower performance MCUs if the application was simpler, but in practice, industrial applications may not be terribly cost-sensitive. A smaller MCU may help with energy efficiency if the device is battery powered, however.

How do MCUs compare to MPUs in this context? While microprocessors provide an easier solution, microcontrollers still win on form factor and power consumption, says Cabrol.

“If you want more or less unlimited performance, you don't want to spend too much time optimising your software, or you just want to try something you quickly, use a microprocessor – it runs Linux, it has a lot more processing power, it has DRAM,” he says. “But that’s at the expense of higher power consumption, it’s significantly more costly, and typically the footprint is bigger. So it probably won’t be battery operated, and you’d need at least another $6 of hardware cost, excluding the price difference between the two chips [for supporting hardware such as DRAM, boot ROM, Flash memory, and typically a PMIC].”

FPGA AI Acceleration

FPGAs are also suited to AI at the edge as they can provide programmable hardware acceleration. While a large, high end FPGA might be overkill for our smart factory example, smaller programmable logic devices are available that may fit the bill. Lattice offers extremely low power FPGAs that have shipped in devices like smartphones: the iCE40 UltraPlus uses as little as 1mW (see figure 3).

Lattice Performance

Figure 3. Smaller FPGA devices may be suited to AI on the edge applications. (Source: Lattice Semi)

“This class of FPGA is well-suited to some of these lower end sensor processing applications that are non-vision,” says Deepak Boppana, Senior Director, Segment and Solutions Marketing, Lattice Semiconductor. “With pressure and thermal sensors you might find in a factory environment, they are not very performance intensive. I would probably put them somewhere in between the two extremes [between vision and speech processing].”

Boppana points out that while AI inference could be done on edge devices in the smart factory, more processing power might be required if the data is instead aggregated in some kind of gateway – scaling up the number of sensors feeding into each gateway would mean the performance requirements add up.

“In a sensor fusion scenario, you could have separate neural networks for each kind of sensor, or you could first do some kind of sensor fusion and then apply a common neural network on the combined model. So there are different ways of doing it,” he says.

FPGAs offer several benefits over MCUs in this application. They can provide interface flexibility, which is useful when many different types of sensors are used in the same system, and they can help with future-proofing.

“The issues with MCUs are the I/O flexibility and the performance,” says Boppana. “There’s always a question mark on whether the performance will be sufficient, because some of these industrial systems are not usually replaced for 10 years or more. So it's important to have that future proof capability, to be able to scale with new requirements and have the flexibility as well as performance headroom to implement newer algorithms. That's something you can get with FPGAs a lot more effectively.”

Lattice’s AI offering is based around its sensAI hardware/software stack, which runs on two different FPGA platforms (iCE40 UltraPlus and ECP5). The stack includes software tools, compilers and reference designs, and while it is based around vision applications, there is scope to apply it to other applications (see figure 4).

Lattice SensAI Stack

Figure 4. Lattice’s sensAI hardware/software stack includes two low power FPGA platforms, IP cores, software tools and reference designs. (Source: Lattice Semi)

GPU Deep Learning

What about using a more specialised hardware accelerator such as a GPU? GPUs are highly optimised to perform simple operations on large amounts of data in parallel, making them an efficient choice for AI. However, while GPU makers are targeting AI at the edge, they are somewhat geared towards computer vision and object recognition applications.

Nvidia released the Jetson Nano a few weeks ago, a small board which comes in two versions: a $99 dev kit and a $129 production-ready module (see figure 5). It includes 128 CUDA-X GPU cores, with a 4-core CPU and 4GB memory.

jetson nano family

Figure 5. The Nvidia Jetson Nano board comes in dev kit (left) and production-ready versions. (Source: Nvidia)

“We call this a low power AI computer, because for the first time ever people can do meaningful AI, meaningful deep learning, for $99,” said Murali Gopalakrishna, Head of Product Management for Intelligent Machines, Nvidia.

Could a relatively basic application, like our non-vision smart factory example, run on a GPU?

“Yes, absolutely,” he says. For simple applications, the system could run very quickly, “but is that application $99 worth? If not, you can probably use a microcontroller. It all boils down to the use case and how much flexibility you want, how much scaling you want, and how future proof you want your solution to be.”

The Jetson platform, including the Nano and Nvidia’s edge GPU, the TX2, supports all types of neural network.

“GPUs are very flexible,” says Gopalakrishna. “GPUs enable you to do [more things than] custom SoCs or custom ASICs… If you're using a neural network to solve a problem and you need flexibility, performance and speed, then Jetson is the right choice. if you have a very specific need that is very specifically fine tuned to a use case and you don't care about anything but that use case, then you can use an FPGA or [Google] TPU… If you want to future proof it, and continuously improve it, if you want to have the flexibility of using any network and more than one network running independently, then choose the Jetson Nano.”

Training at the edge on GPUs is of course perfectly possible, since as Gopalakrishna adds, the same GPU cores are used in cloud infrastructure as on edge devices. It’s just a matter of having enough memory and enough time to complete the training process.

In the end, machine learning models are algorithms, so may be able to run on any type of processor. Optimising for the exact application means taking into account things like the amount of memory required, the amount of power available, how training will be done and how long it will take, and possibilities for future-proofing. Most major vendors of MCUs, MPUs, FPGAs and GPUs have solutions available for embedded AI, along with resources such as software tools, for the most complex AI application to the most basic.