AI inference workloads and the computing architectures needed to support them are rapidly evolving. They're also becoming increasingly diverse in unpredictable ways.
Artificial-intelligence inference workloads and the computing architectures needed to support them are rapidly evolving. They’re also becoming increasingly diverse in unpredictable ways. This unpredictability has a significant impact on the accuracy, performance, and efficiency of AI inference.
The obvious solution is to throw more computing at the problem — faster and better processors and memory devices, and more of them. However, this brute-strength approach delivers diminishing returns, even if budget is no object and allows for the deployment of technologies such as high-bandwidth memory (HBM) and even processing in memory (PIM). Instead, targeted and more tailored alternatives are emerging, including purpose-built accelerators, and moving away from traditional computing architectures.
There are several accelerator architectures competing to best serve AI workloads, but at-memory computing is fundamentally the best architecture for not only today’s AI workloads but also those of tomorrow, even as they continue to evolve and splinter in unpredictable ways.
AI inference spans many different use cases and industries. Some are quite simple and can be found in the average person’s smartphones. However, the ones that are challenging traditional computing architectures are those handling high volumes of data in industries where decisions must be made fast to support the success of business outcomes, equipment reliability, and even functional safety.
High transaction volumes are a hallmark of the financial services industry, but AI is throwing new data-processing challenges at the sector as it looks to leverage natural-language processing (NLP) to extract data from both structured and unstructured documents. A subfield of linguistics, computer science, and AI, NLP is enabling banks to automate and optimize tasks such as searching documents and collecting customer information. It’s also helping them evaluate performance drivers and build better forecasts of markets by assessing a variety of text and speech data from different contexts.
AI is also finding its way to the factory floor and other industrial environments to transform and automate processes in several areas, including supply chain, product development and manufacturing, and field operations by employing deep learning. In some cases, industrial AI leverages data from the internet of things and other edge devices. As AI hardware has evolved and semiconductor technologies have improved, there are increasingly more dedicated offerings for industrial AI use cases that enable more autonomous equipment on the factory floor.
Similarly, AI inference plays a key role in the modern vehicle (whether or not it’s fully autonomous) and transportation systems more broadly. AI’s ability to process and predict data can allow for efficient and reliable scheduling of private and public transportation. On the car itself, accurate AI inference is critical if semi-autonomous or autonomous vehicles are to interpret their environment to safely navigate based on traffic signals, the behavior of other drivers, and unexpected obstacles on the road.
All this diversity of AI inference workloads is leading to even more diverse neural-net architectures, even within a single industry. The rapid evolution and fragmentation of these workloads and development of new architectures means even more unpredictability, which compounds the existing challenges facing AI inference and creates new ones.
Coming up with new use cases and workloads for AI inference has never been a problem, as industries such as financial services, manufacturing, and automotive have demonstrated. The challenge is making them technologically feasible, and even then, technology needs to be cost-effective and scalable.
The algorithms that make up deep learning and machine learning are power-hungry, so the obvious solution to AI inference challenges is to throw more computing capabilities at the problem by leveraging high-performance computing, the latest and greatest processors and DRAM, and even HBM. But even with the availability of on-demand computing resources through the cloud, the more advanced systems needed to do AI inference can be costly. And even if money is no object, throwing more horsepower in the form of CPUs, GPUs, and bleeding-edge memory isn’t always the best answer.
The CPU does have its role in AI workloads; it is particularly well-suited for sequential algorithms and tasks that require taking a relatively small piece of data and transforming it in sequential steps. Real-world examples of those in action include image recognition and simultaneous location and mapping for autonomous vehicles and drones, or home devices with simple NLP functions. However, modern neural networks are so computationally expensive in other layers that CPUs can’t keep up in terms of performance and efficiency requirements.
GPUs, meanwhile, have come a long way from their roots in arcade games and primarily as a means of handling graphics display processing in PCs. They are well-suited for AI workloads that can make use of massive parallelism provided by GPUs. They do present challenges, however. GPUs demand a massive amount of power and if the computation is too big for the GPU’s main memory, latency due to batching will increase.
Adding more memory can increase overall system throughput, but there are limitations with that approach as well. The proliferation and diversification of AI workloads has accelerated the evolution of DRAM specifications (both DDR and LPDDR) and has led to increased interest in HBM, as well as PIM. DRAM remains the fastest and best-understood memory, and low-power DRAM is likely to see adoption over the longer term for edge AI inference as well as automotive applications. HBM remains a premium memory, and it’s still early days for PIM, as recent developments are only just making it easier to integrate into systems without requiring a great deal of changes to software. There’s still a lot of work to be done to define PIM and make it commercially viable.
The challenges presented by AI workloads that involve ever-increasing volumes of data have also led to the development of purpose-built accelerators to get the data where it needs to be to reduce the need to move data between processors, memory, and storage. Heterogeneous computing, wherein the system can easily access the right mix of compute, memory, and storage for a given workload, as well the emergence of the Compute Express Link (CXL) specification that enables access to disaggregated pools of memory, are eliciting a great deal of interest to address the bottlenecks that hamper AI workloads.
In the meantime, challenges to successful AI inference deployments can be expected to emerge unabated, affecting accuracy, performance, and efficiency. Changing AI workloads are leading to more diverse neural-net architectures, and these architectures will evolve in unknown ways in the future. The goal is to make sure an algorithm can run successfully on the chosen network architecture. Maintaining accuracy is also a challenge, as deep quantization can lead to accuracy loss and analog techniques can drift.
Even if accuracy is maintained, performance can degrade if the architecture can’t achieve throughput and latency targets. Maximizing images and queries per second as well as minimizing batch sizes while achieving those targets is a key success metric for AI inference workloads. In turn, performance targets should not be achieved at the detriment of efficiency. The chosen architecture must get the optimum performance from the silicon while balancing power consumption — images and queries per second per watt must be optimized while bearing in mind any capital costs to setting up the system, as well as its total cost of ownership.
While in-memory may be the obvious choices from a technology perspective, at-memory compute addresses the specific challenge facing AI inference deployments today.
At-memory compute is the sweet spot for AI acceleration
Unlike today’s common near-memory and von Neumann architectures, which are dependent on long, narrow busses and deep and/or shared caches, an at-memory compute architecture employs short, massively parallel direct connections using dedicated, optimized memory for efficiency and bandwidth.
A traditional von Neumann architecture will likely have external DRAM, a cache, and a pipeline to access processing elements, whereas an at-memory compute approach has the processing elements directly attached to the memory cells. Untether AI has opted to employ SRAM in our at-memory compute architecture, but that’s a small part of the story. Not only do we have processing elements directly attached to SRAM cells, but we also employ a RISC processor and as many as 512 processing elements, each attached to their own SRAM array. Each RISC CPU is custom designed to accelerate neural networks.
By placing the entire neural network on-chip, Untether AI provides low latency and high throughput simultaneously. Our ability to drive throughput is due to physical allocation — nearest-neighbor placement increases energy efficiency and reduces latency, while routing and compute is balanced to optimize overall throughput. Cost functions are tuned to hardware parameters so that computation can be densely packed to the flow of activations. In contrast, a von Neumann architecture employs a CPU or GPU that must swap layers and coefficients from memory while batching data into larger groups to help with throughput, and as a result, latency is negatively affected.
Untether AI’s at-memory compute architecture is a “best of both worlds” approach in that it mixes multiple instruction, multiple data (MIMD) and single instruction, multiple data (SIMD) processing. MIMD allows for spatial optimization with 511 memory banks operating asynchronously, while sequential optimization is achieved through SIMD, with 512 process elements per memory bank executing on a single instruction.
Untether AI’s at-memory compute architecture is optimized for large-scale inference workloads and delivers the ultra-low latency that a typical near-memory or von Neumann architecture can’t. By using integer-only arithmetic units, we can increase the throughput while reducing the cost. Flexibility is maintained to provide broad support for a wide variety of neural networks for AI inference applications that employ NLP, vision-oriented neural networks, and recommender systems in diverse industry segments, including industrial vision, finance, smart retail, and autonomous vehicles, among others.
Our AI Compute Engine is expressed in two hardware offerings. For inference acceleration, Untether AI’s runAI200 devices operate using integer data types and a batch mode of 1, employing our unique at-memory architecture to deliver 502 tera operations per second and efficiency as high as 8 TOPS/W. These devices power our tsunAImi accelerator card, which provides 2 peta operations per second of compute power per single card, which translates into more than 80,000 frames per second of ResNet-50 v 1.5 throughput at batch = 1. For natural-language processing, tsunAImi accelerator cards can process over 12,000 queries per second of BERT-base.
But hardware alone is not enough to successfully deploy AI workloads. Untether AI’s hardware offerings are complemented by our imAIgine software development kit that’s compatible with familiar machine-learning frameworks, including TensorFlow and PyTorch with Jupyter Notebook integration. It is comprised of a compiler for automated, optimized graph lowering; a toolkit, which supports extensive allocation and simulation feedback; and easily integrated communication and health-monitoring software in the form of a runtime.
Untether AI’s at-memory compute-based hardware coupled with its software development kit provides high performance low power AI inference across a wide range of networks, making it flexible for today’s neural-network architectures and anticipate the diverse unpredictability of AI workloads in the future.
This article was originally published on EE Times.