Edge Computing Intelligence

Article By : Ron Martino, Robert Oshana, Natraj Ekambaram, Ali Osman Örs, and Laurent Pilati, NXP Semiconductors

Here's a look at the benefits of machine learning inference at the edge, and what designers need to know regarding the workflows, frameworks and tools, and hardware and software.

We anticipate a huge leap in the smart things that make up the edge of everything in our homes, offices, factories, and vehicles. The 50 billion smart connected devices expected by 2025 can influence how individuals, communities and entire industries communicate, learn and operate. These devices will anticipate our needs and automate our environments. As an industry, it is up to all of us to ensure they meet the ultimate goals for society – a greener, safer, and more productive society. No single company can do this alone. It’s a collective effort that requires expertise across an ecosystem of semiconductor suppliers, software partners and device makers.

Widespread adoption of these smart connected devices means enormous amounts of data will be created, which is why the edge is fast becoming a requirement for the next era of the IoT. The edge puts processing power where data is generated. At the heart is silicon, but the end-to-end architecture of an edge device is much more complex than the silicon itself. Deeply intertwined in the silicon is software that brings advancements in security, low power, machine learning, and connectivity.

This book was created to share knowledge and insights to help the industry unravel this complexity and drive forward the enormous potential of the edge. Whether you’re creating SoCs or edge products, you are an enabler of a society that is greener, safer and more productive. I hope you find information in this book useful for your tasks towards realizing the future edge of everything.

This chapter presents the benefits of machine learning inference at the edge, such as uninterrupted processing, lower latency and user privacy. It examines workflows, frameworks and tools, hardware, software, application examples and other edge processing machine learning topics.

Machine learning at the edge

Machine learning (ML) is a subset of artificial intelligence (AI) that enables computer algorithms to improve automatically through experience. ML can be classified into supervised ML and unsupervised ML categories. In supervised ML, algorithms are “trained” using large sets of previously collected and labeled data from one or more sensors. In unsupervised ML, the algorithm learns over time to identify outliers and differentiation in the sensor data it is exposed to during operation.

ML is predominantly conducted in the cloud with servers and large compute and storage capacities. However, as ML models and algorithms matured, ML inferencing moved from cloud to edge devices. Billions of internet of things (IoT) devices perform control and data gathering operations. Compute power continues to increase as more complex control and operational decisions are moved to edge devices. These secure and self-reliant, albeit memory- and power-constrained, edge devices can perform real-time ML inferencing tasks locally with occasional cloud connection.

For example, ML in the cloud is the key technology applied when anyone uses a voice assistant with a smartphone or smart speaker. It also is the technology behind how social media and even smartphones group together photos featuring a specific person. But those use cases all rely on ML running on a server somewhere in the cloud.

Running ML inference at the edge has advantages. All the ML inference runs locally on an edge processor, which means that the application continues to run even if access to the network is disrupted. This is critical for applications such as surveillance or a smart home alarm hub, or when operating in remote areas without network access. It also provides much lower latency during decision-making than if the data had to be sent to a server and processed, and the server had to send the results back. Low latency is important, for example, when performing industrial factory floor visual inspection and deciding whether to accept or reject products whizzing by.


This month’s In Focus highlights the developments in artificial intelligence (AI) and machine learning (ML) sectors, the engineering challenges, and whether or not the world is ready for an AI-centric future.

 


Another key benefit of ML on the edge is user privacy. The personal data collected, such as voice communication and commands, face recognition, video and images captured by the edge device, are processed and stay local on the edge. Information is not sent for processing to the cloud, where it can be recorded and tracked. The user’s privacy remains intact, so individuals can choose whether to share their personal information in the cloud.

Given the need for ML on the edge, the question becomes how much ML performance is needed. One way to measure ML performance requirements is the number of operations per second. These are usually referred to as TOPS or tera (trillion) operations per second. This is a rudimentary benchmark because overall system performance depends on many other factors. Nonetheless, it is one of the most widely quoted ML measurements.

For example, performing full speech recognition (not just keyword spotting) on the edge takes around 1 to 2 TOPS, depending on the algorithm used and whether one wants to understand what the user is saying rather than just converting from speech to text. Performing object detection (using an algorithm such as Yolov3) at 60 frames per second (FPS) also takes around 2 to 3 TOPS.

ML development workflow

Figure 1 shows a high-level ML development workflow. Developing ML technology to deploy to an edge node requires both operation and dataflow. These steps include:

  • Collecting raw data — Identify and collect data that will be used to train an ML model.
  • Augmenting data — Artificially expand labeled training datasets to improve the performance of an ML model.
  • Extracting features — Reduce the number of features in the dataset by creating new features that summarize the original features with less information.
  • Creating training and validation sets — Separate the raw, augmented data into two datasets: one for training the ML model and one for validating the model. To test for bias and over-fitting bias, these should be separate datasets.
  • Selecting a model — Choose a model that meets application and performance requirements (image classification, object detection, speech recognition, anomaly detection, etc.).
  • Training the model — Use an ML algorithm and the raw data to create a model for performing predictions on new data.
  • Validating the model — Run a separate set of data through the trained ML model to test for accuracy and correctness.
  • Converting and quantizing the model — Approximate a floating-point-based ML network with a low-fixed-point model that reduces memory bandwidth and computational cost. In neural networks, quantization is converting a data floating-point number to a fixed-point number. Edge devices with ML accelerators are largely capable of computing at 8-bit fixed-point precision. By converting a 32-bit floating-point value to an 8-bit fixed-point integer value, model size is instantly reduced by four times. Quantization also enables faster weight transfers due to reduced precision from the main memory to local compute engines. ML accelerators typically have large local memory that can store weights, thus benefiting from reduced data transfers between the main memory and local memory.
  • Inferencing — Run new data (from a sensor or other data collection mechanism) through the ML algorithm (or model) to determine an output (for example, the classification of an object)

Figure 1: A high-level ML development workflow

Edge ML tools

Edge-based ML requires tools to create and deploy ML models starting in the cloud and ending with inferencing performed by edge device ML software stacks with optimized run times on the key edge device hardware such as graphics processing units (GPUs), central processing units (CPUs), digital signal processors (DSPs) and ML or neural processing unit (NPU) accelerators. These components are shown in Figure 2 along with the ML workflow example. Users with different roles such as embedded developers, data scientists and ML algorithm experts use ML toolkits. Because many cloud vendors provide tools for model training, edge-based tools should support the deployment of ML technology from the cloud to an edge device.

Figure 2: ML workflow and Edge ML tools.

Edge ML frameworks

Edge ML tools use several open ML frameworks. An ML framework combines libraries and tools that enable embedded developers to build, optimize and deploy ML models easier and faster. These frameworks democratize the development of ML models and abstract some but not all the underlying algorithmic details. Some of these frameworks provide pretrained models for speech recognition, object detection, natural language processing (NLP) and image recognition and classification, among others. Table 1 describes some of the popular edge-based ML frameworks.

Table 1: Common edge processing ML frameworks

Edge ML hardware

Artificial neural networks (ANN), commonly called neural networks (NN), are computing systems inspired by biological neural networks. Loosely modeled on the neurons in a biological brain, an ANN is based on a collection of connected units or nodes called artificial neurons. NN-based ML algorithms such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been shown to be very effective in ML inference tasks. These algorithms consist of multiple compute and transform layers that analyze data to detect patterns.

CNNs are feed-forward neural networks. In feed-forward networks, all the computation is performed as a sequence of operations on the outputs of a previous layer. The final set of operations generates the output of the network such as the probability that an image contains a particular object, that an audio sequence contains a particular word, that a bounding box in an image surrounds an object or that the proposed action is taken.

In CNNs, the network has no memory and the output for a given input is always the same irrespective of the sequence of inputs previously given to the network. RNNs use internal memory to allow long-term dependencies to affect the output. RNNs use time-series information for giving outputs and predicting future actions and results based on current and past data for language processing tasks, sensor analytics, anomaly detection and so on.

The major computation in CNNs and RNNs is the “weighted sum” operation that typically uses multiply-accumulate (MAC) operations. Because a MAC operation involves a multiplication followed by an addition, each MAC comprises two operations. Both CNN and RNN computations benefit from fast memory on the embedded hardware device. RNN performance and support are impacted more by limited memory because the feedback path requires it.

Typically, neural networks are trained with data in a 32-bit floating-point (FP32) representation. But because memory and compute profiles are limited in embedded devices, inference on the edge often generates a large incentive to quantize from FP32 to a fixed-point integer representation that’s 16-bit, 8-bit and even lower. Lower-bit mathematical operations with quantized parameters combined with quantized intermediate calculations of a neural network results in large computational gains and higher performance. Quantization decreases accuracy, so additional methods are needed to recover the accuracy loss to an acceptable level. The energy and area of a fixed-point multiply scale approximately quadratically with the number of bits. Reducing the precision also reduces the energy and area cost for storage, which is important because memory access and data movement dominate energy consumption and memory is limited in embedded systems.

Based on estimates done in 45nm technology:

  • An 8-bit integer ADD operation consumes 30X less energy than a 32-bit floating point ADD.
  • An 8-bit integer MUL operation consumes 18.5X less energy than a 32-bit floating point MUL.

Edge processing ML can be performed on SoC processing elements such as CPUs, GPUs, DSPs and dedicated accelerators, or a combination of these processing elements. Each has advantages and disadvantages. CPUs work well for embedded applications that require parsing or interpreting complex logic in code. They are not optimized for ML computation, but they can be used if necessary. CPUs dedicate more area to caches and control flow to handle complex logic and more sequential processing.

DSPs have been used in embedded systems for many years to efficiently and economically handle various forms of complex signal processing. Using DSPs to analyze sensor data for feature extraction is common. DSPs, which continue to evolve, use special instructions such as MACs to accelerate common signal processing structures. Vector processing units built around MAC units are being used to accelerate neural network computations. Wider single instruction, multiple data (SIMD) units are also being used with very long instruction word (VLIW) DSP architectures.

Figure 3 shows a DSP used for ML processing on a low-cost edge device. In this example, an Arm®- based Cortex Microcontroller Software Interface Standard (CMSIS) DSP standardizes the DSP code running on Cortex-M cores. PowerQuad, a coprocessor designed by NXP to improve energy efficiency and performance when implementing DSP algorithms using its MCUs based on Cortex-M33 cores, can leverage this application programming interface (API). Preprocessing of mathematical functions, like FFT, square root; activation functions like sigmoid and softmax; as well as matrix operations are supported.

Figure 3: Using a PowerQuad coprocessor on a low-cost edge device for ML.

GPUs originated from dedicated graphical rendering engines for computer games. They have evolved to accelerate additional geometric calculations such as transforming polygons or rotating images into different coordinate systems. GPUs have more logical cores such as arithmetic logic units (ALUs), which allow them to process multiple computations simultaneously. ML also requires large amounts of data, which works well with GPUs architected for high memory bandwidth. GPUs are primarily designed for pixel processing; however, highly parallel matrix mathematics can be achieved using the shader cores for ML computations.

NPUs are optimized for common edge-based use cases such as object detection and segmentation at much higher levels of performance and much lower power than CPUs. The accelerators in Figure 4 process complex workloads under a rich OS in Cortex-A systems with wide bus interfaces (128-bit) and dynamic random access memory (DRAM) support. Other optimizations such as integrated direct memory access (DMA) allow for neural network weights and activations to be fetched ahead of time using a DMA connected to system memory. The heavy compute operations, such as convolution, pooling, activation functions and primitive element wise functions, run directly on the NPU. Other kernels run automatically on a tightly coupled CPU (such as Cortex-M). Another approach to increase performance and reduce memory requirements is to conduct offline compilation and optimization of neural networks, including operator and layer fusion as well as layer reordering.

Figure 4: An NPU for edge ML processing

Edge processing SoCs contain multiple processing elements including one or more of the types mentioned previously. These processing elements can be used independently or together to perform ML at the edge. Various optimized ML pipelines can be designed to efficiently leverage the available processing power of the SoC. Edge ML computing is a system-level optimization exercise for which multiple processing elements on a SoC (see Figure 5) can be used and enabled properly to support advanced real-time edge ML processing.

Consider the ISP shown in Figure 5. Camera-based systems always include image signal processor (ISP) functionality, though sometimes it can be either integrated into a camera module or embedded in an applications processor and potentially hidden to the user. ISPs typically conduct many types of image enhancement along with their key purpose: converting the one color component per pixel output of a raw image sensor into the RGB or YUV images that are more commonly used elsewhere in the system.

Applications processors without ISPs work well in vision-based systems when the camera inputs are coming from network or web cameras that are typically connected to the applications processor by Ethernet or USB. For these applications, the camera can be up to 100m away from the processor. The camera itself has a built-in ISP and processor to convert the image sensor information and encode the video stream before sending it over the network.

For relatively low-resolution cameras, applications processors without ISPs also work well. At resolutions of 1 megapixel or below, image sensors often feature an embedded ISP and can output RGB or YUV images to an applications processor, so an ISP is not needed in the processor.

But at a resolution of around 2 megapixels (1080p) or higher, most image sensors do not have an embedded ISP; instead, they rely on an ISP somewhere else in the system. This may be a stand-alone ISP chip (which works but adds power and cost to the system) or an ISP integrated in the applications processor as shown in Figure 5.

With the combination of an ML accelerator and an ISP, the edge SoC processor can perform embedded vision system applications at the edge, whether they be for smart homes, smart buildings, smart cities or industrial IoT applications. With its embedded ISP, the edge SoC processor can be used to create high image quality optimized systems connected directly to local image sensors. It even can be used to feed this image data to the latest ML algorithms, all offloaded in the local ML accelerator.

A more generic ML development approach for edge processing includes these steps:

  1. Define the use case and the corresponding type of machine learning and model.
  2. Use a ML framework that is self-contained and does not rely on the underlying hardware.
  3. Prototype the chosen ML paradigms with the framework on a PC, cloud or higher end embedded device.
  4. Characterize the network model in terms of memory and computational overhead.
  5. Choose a hardware platform while considering the memory and computational constraints. Then cross-compile the network for the specific embedded device.
  6. Train the model on a higher end machine and transfer the weights over to the embedded device (the weights do not change, so they can be stored as a constant array in memory).
  7. Perform relevant network optimizations (pruning, quantization, precision reduction)
  8. Perform relevant hardware optimizations (alignment, SIMD instructions).
  9. Test the performance of the deployed network model and determine whether this implementation can be iterated over after deployment.

Edge processing also can be implemented on low-end microcontrollers. A possible development flow for ML on low-end MCU includes these steps:

  1. Upload the labeled data to a PC. You can use a universal asynchronous receiver-transmitter (UART), Secure Digital or an SD card.
  2. Experiment with the data and an ML toolkit using tools such as scikit-learn. Make sure an off-the-shelf method can produce good results before moving forward.
  3. Experiment with feature engineering and selection. Try to achieve the smallest feature set possible to save resources.
  4. Write an ML method to use on the embedded system (perceptrons or decision trees are good because they don’t need a lot of memory). If no floating-point unit is available, integers and a fixed-point unit can be used.
  5. Implement the normal training procedure. Use cross-validation to find the best tuning parameters, integer bit widths, radix positions, etc.
  6. Run the final trained predictor on your holdout testing set.
  7. If the trained predictor performance is satisfactory on the testing set, move the code that calculates the predictions and the model trained (for example, weights) to the MCU. The model weights will not change, so they can be stored in nonvolatile flash memory such as a constant array.

Figure 5: Edge processing SoC with NPU for ML acceleration.

Optimizing ML pipelines for edge devices

Embedded edge devices are growing more complex and powerful as they incorporate more hardware components such as CPUs, GPUs, DSPs and ML accelerators to perform various forms of ML. However, these complex hardware components must be used efficiently. Edge devices with dedicated accelerators such as GPUs and NPUs can perform matrix multiplication significantly faster than CPUs. ML frameworks can efficiently leverage these hardware components. For example, TensorFlow Lite interpreters use the concept of “delegates” that can hand over the compute intense operations to the dedicated hardware for acceleration. Software architectures to support ML can optimize the execution flow of ML in the SoC to provide high-performance, low-power solutions.

The application-specific processing pipeline shown in Figure 6 is designed in multiple stages with multiple steps in the pipeline that can be leveraged for ML processing. Key application segments include:

  • Vision pipelined for object/face detection/recognition
  • Voice and audio pipeline for speech analysis
  • Series data processing pipeline for anomaly detection

Processing pipelines and flexible software architectures provide out-of-the-box SoC and application-type optimized run-time support. This facilitates complete exploitation of heterogeneous SoC capabilities for ML and maximizes component reuse. Key benefits of this approach include improved out-of-box experience (OOBE) and ease of use; comprehensive SoC and hardware resource usage, with configurability over I/O interfaces; acceleration option configuration for different use cases; processing domains for easier customization; scalability across SoCs; and the use of open-source and other community components.

Figure 6: Edge device optimized ML pipeline

As an example of an ML-optimized pipeline in Figure 6, consider the growing demand for video intelligence (industrial inspection, face/person/object detection and classification, action recognition). This intelligence has pushed the vision paradigm to quickly incorporate ML-based techniques. The traditional vision techniques based on handcrafted feature extraction and usage are still greatly used, but the emergence of powerful hardware to run inference engines combined with the widely available ML frameworks and vision-based models lowered the barriers to fully (or almost fully) using ML to address machine vision use cases.

A capable edge SoC for ML processing in this application must first be chosen. The device in Figure 5 embeds an NPU, 2D and 3D GPUs, a dual-image signal processor and two camera inputs for an effective advanced vision system. This SoC has all the hardware elements required to address complex ML-based vision use cases.

Software must enable these hardware components. Figure 7 shows an example of an edge device software architecture to support optimized ML at the edge. This software includes:

  • Video streams and image processing from the Linux® kernel drivers to the de facto standard media stream framework GStreamer. These software components enable local and remote camera capture, local and remote video stream and picture presentation, and hardware-accelerated single picture processing (scaling, rotation, color space conversion).
  • Adaptation and optimization of the major natural language frameworks (TensorFlow Lite, ONNX, ArmNN, Glow) to run efficiently on the SoC NPU, GPU and (coming soon) DSP.
  • GStreamer plug-ins to provide a vendor-agnostic neural network integration framework that eases the integration and connection of the different hardware and software components involved in a machine vision use case. This framework, NNStreamer, an open-source technology, supports the major ML frameworks (TensorFlow Lite, ArmNN, Caffe2) and features the following:
    • Neural network framework connectivity (TensorFlow, Caffe, etc.) — Stream frameworks like GStreamer.
    • AI project streaming — Apply efficient and flexible stream pipelines to neural networks.
    • Intelligent media filters — Use a neural network model as a media filter/converter.
    • Composite models — Apply multiple neural network models in a single stream pipeline instance.
    • Multimodal intelligence — Use multiple sources and stream paths for neural network models.
  • Methods to construct media streams with neural network models using the de facto media stream framework, GStreamer. GStreamer users can apply neural network models as if they were just another media filter. Neural network developers can manage media streams easily and efficiently.
  • Real-time performance profiling of the full pipeline (CPUs, GPUs, NPUs, DSPs and memory profiling).

Figure 7: NXP eIQ ML Development Environment supports optimized machine learning at the edge

This concept can be expanded further by conducting the parallel processing of ML algorithms on a single SoC. Figure 8 shows a voice and audio ML pipeline configured to run on an Arm Cortex-M core on top of a real-time OS while a vision ML pipeline executes on the Arm Cortex-A core on top of a rich OS such as Linux.

Figure 8: Application run time simultaneously leveraging voice and vision pipelines on an NXP i.MX 8M Plus edge computing SoC.

In summary, running ML at the edge requires an awareness of the compute and memory resources available. It also requires modifications to the ML models and the process flow to fit the resource profile. In return, running ML at the edge has many advantages such as improved privacy, reduced or no dependency on a network connection, reduced power dissipation and the capability to make real-time low-latency decisions.

 

About the Author

Ron Martino is the Executive Vice President and General Manager, Edge Processing, at NXP Semiconductors.

Robert Oshana is Vice President, Edge Processing Software R&D, at NXP Semiconductors, and the technical editor-in-chief.

Natraj Ekambaram is Director, AI and ML Enablement, at NXP Semiconductors.

Ali Osman Örs is Director, AI and ML Strategy and Technologies, at NXP Semiconductors.

Laurent Pilati is Director, ML and Voice Engineering, at NXP Semiconductors.

 

Subscribe to Newsletter

Leave a comment