HBM is still a premium product, but the mainstreaming of AI inference is pulling HBM into the mainstream with it.
High bandwidth memory (HBM) is becoming more mainstream. With the latest iteration’s specifications approved, vendors in the ecosystem are gearing to make sure it can be implemented so customers can begin to design, test and deploy systems.
The massive growth and diversity in artificial intelligence (AI) means HBM is less than niche. It’s even become less expensive, but it’s still a premium memory and requires expertise to implement. As a memory interface for 3D-stacked DRAM, HBM achieves higher bandwidth while using less power in a form factor that’s significantly smaller than DDR4 or GDDR5 by stacking as many as eight DRAM dies with an optional base die which can include buffer circuitry and test logic.
Like all memory, HBM makes advances in performance improvement and power consumption with every iteration. A key change when moving to HBM3 from HBM2 will a 100% performance improvement in the data transfer rate from 3.2/3.6Gbps to 6.4Gbps max per pin, said Jinhyun Kim, principal engineer with Samsung Electronics’ memory product planning team.
A second fundamental change is a 50% increase in the maximum capacity from 16GB (8H) to 24GB (12H). Finally, HBM3 implements on-die error correction code as an industry-wide standard, which improves system reliability, Kim said. “This will be critical for the next generation of artificial intelligence and machine learning systems.”
The nature of HBM is that you can’t simply pull out HBM2 and replace it with the latest and greatest, but each new generation of HBM incorporates many improvements that coincide with launches of the latest and greatest GPUs and ASICs, said Kim. Samsung, which has consistently updated its HBM portfolio over the years, aligns the lifecycle of current and next-generation HBM products with the needs of key partners prior to every engagement. “This helps to fully address the need for backwards compatibility,” he said.
Designers will have to adapt to take advantage of HBM3, which Kim said is the ideal solution to support system architectures that address the growing complexity and size of AI/ML data sets, noting that maximum bandwidth per stack in HBM2 was 409GB/s. “With HBM3, bandwidth has jumped to 819GB/s, while maximum density per HBM stack will increase to 24GB in order to manage larger data sets.”
These changes enable system designers to expand the accessibility of various applications that were limited by density constraints. “We will see HBM3 adoption expanding from general-purpose GPUs and FPGAs to the main memory through DRAM-based cache and the CPU in HPC systems,” he said.
The mainstreaming of AI/ML inference is arguably mainstreaming HBM, said Kim. “There is no shortage of new use cases, applications and workloads driving the needs of AI/ML compute power in next generation HPCs and datacenters.”
AI has been a big driver of HBM in GPUs, said Jim Handy, principal analyst with Objective Analysis. “GPUs and AI accelerators have an unbelievable hunger for bandwidth, and HBM gets them where they want to go.” Sometimes HBM makes the most economic sense, even when it used to cost six times as much as DRAM—although now it’s down to about threes times as much.
“The applications where HBM is being used are applications that need so much computing power that HBM is really the only way to do it,” Handy said.” If you tried doing it with DDR, you’d end up having to have multiple processors instead of just one to do the same job, and the processor cost would end up more than offsetting what you saved in the DRAM.”
A characteristic of AI is that it uses matrix manipulation a lot, said Handy, and when it became clear that GPUs are better than using a using standard x86 architecture, AI became cheaper, and the next step in the evolution was adopting FPGAs to have access to a dedicated processor. Because AI needs so much bandwidth, GPUs and FPGAs have been incorporating HBM.
A key characteristic of HBM is it’s not possible to pull out HBM2 and replace it with HBM3 because it’s soldered down rather than socketed, but system designers should be able to make the transition easily, Handy. “It’s pretty much a linear progression from HBM2E and earlier versions.”
Moving to HBM does require some forethought, and given the specification was just finalized, systems won’t be in production for a while but in the design, testing and validation phase. Avery Design Systems is helping to create a streamlined ecosystem for design and verification to make HBM3 adoption as easy as possible, said VP of sales and marketing Chris Browy.
The company has built a verification platform based on its own tested verification IP (VIP) portfolio to enable pre-silicon validation of design elements, including HBM3, by providing memory models, protocol checkers, performance analysis, and compliance test-suites. Its HBM2E and HBM3 speed adapters are also available for FPGA prototyping platforms. Browy said an Avery memory model is designed to work with any vendor with different memories.
The modelling capabilities offered by Avery allows customer to debugging, performance analysis, and better testing for error injection and temperature faults. Right now, it’s not possible to use an HBM3 because they don’t physically exist, but Avery offers a memory adapter that presents to the design as a HBM3 memory device.
He said early HBM3 customers are looking for Avery’s support to transition from HBM2 for high performance computing, AI, and networking applications. “They’re looking at how to architect their next gen chips for HBM3 because there still is a lead time.”
In late 2021, the company announced its comprehensive support for the HBM3 interface, as well as its partner Rambus using Avery’s HBM3 memory model to verify its HBM3 PHY and controller subsystem. Like Avery, Rambus is laying the groundwork for the adoption of the latest HBM because applications that are likely to make use of it such as ASICs used in AI applications have a design lead time as long as 18 months.
Rambus’ HBM3-ready memory interface consists of a fully integrated physical layer (PHY) and digital memory controller, the latter drawing on intellectual property from its recent acquisition of Northwest Logic. The subsystem supports data rates of up to 8.4 Gbps and delivers as much as 1 terabyte per second of bandwidth, thereby doubling the performance of high-end HBM2E memory subsystems.
Browy said Rambus uses memory model to verify its HBM3 PHY and controller and includes these memory models in its customer deliveries to enable out-of-the-box simulations with the delivered IP. “People are starting to move from architecting for HBM3 to starting chip implementation.” He said AI applications have become a leading adopter of HBM. “Now that there are more AI chips coming online and the competition is fierce, you know, everybody’s looking to take advantage of the latest memory architectures.”
In a recent webinar outlining the company’s HBM technology, Rambus’ senior director of product marketing for IP cores, Frank Ferro, said the neural networks in AI applications require a significant amount of data both for processing and training, with training sets alone growing about 10 times annually now. The data is going to grow, he said, with many companies offloading their local processing up to the cloud. “In addition to providing more infrastructure, we have to start looking at how to make that existing infrastructure much more efficient.”
One way to do that is to start to analyze the actual data for potential efficiency, Ferro said. This can be accomplished by reducing precision to consolidate storage and reduce memory bandwidth needed, not just adding more available bandwidth and processing power. Although many AI machine learning engines today are implemented on general purpose compute hardware such as CPUs and GPUs, they take up a significant portion of the network. As fast as these are engines are, he said they are not necessarily the most efficient when it comes to the memory bandwidth. “You’re not getting full use of that processing power because you have a bottleneck in the memory.”
With new engines being developed and hyperscale company looking at how they can make that hardware more efficient and more efficient, Ferro said there are a new significant number of new ASICs coming out that are tailored to a specific neural network with both the compute power and memory bandwidth that reflect the needs of the problem being solved. “We see more ASICs starting to take some of the market share from the general-purpose compute engines.”
These ASICs require a significant amount of bandwidth, said Ferro, and that’s driving the emergence of new types of memory architectures and pushing the limits of DRAM for neural network training and running inference engines. Training can take days to weeks, and the amount of data will affect the quality of the network. As the networks are pushed out to end systems, the inference engines must be more efficient because they are more cost and power sensitive, he said, and different types of memory are needed across the network from the cloud right to the edge with different performance, cost and power requirements.
Until recently, the DDR interface was the only choice, but it’s not kept up with the requirements of the many emerging AI engines, said Ferro, and even when GDDR or HBM is chosen, they’re still the same DDR technology with slight variations depending on the applications. “Now we see GDDR being used in AI, especially for AI inference and also in automotive ADAS applications.” HBM’s 3D configuration allows for a very wide data bus and additional bandwidth out of the same memory technology.
For AI training and high-performance applications, HBM3 can deliver more than one terabyte per second with two DRAM stacks, said Ferro, and with four DRAM stacks, it hits 3.2 terabytes per second, which is significant processing power for AI-and high-performance computing applications. Just as importantly, HBM3 delivers better power efficiency, he said, because the DRAM stack and the SOC is placed in a single package substrate. “You get very good area efficiency, but you also get very good power efficiency.”
Even with more efficient memory, a lot of power in AI systems still goes to moving the data to where it needs to be. Going beyond HBM, a means to get become even more efficient might be the use of processing in memory (PIM) along with HBM. Samsung’s Kim said PIM is revolutionizing system design architecture to improve energy efficiency and performance by adding value to existing memory characteristics in accelerating many computing tasks of the CPU and GPU. “An HBM-PIM system is able to deliver around 2.5 times the system performance while reducing energy consumption by 62%.” He said Samsung believes the adoption rate of PIM will accelerate as energy efficiency requirements become more critical, just as they did with edge computing and searches.
Handy said PIM is an amazing tool that has been bandied about since memory chips first arrived, since they have immense internal bandwidth that doesn’t make it out of the package. However, the technique of adding processors to memory chips has a few important hurdles to overcome, he said. “It ships in low volume, and that drives up the cost; the current ecosystem doesn’t support it, and application program support is a necessity; the processors that have so-far been added to these chips use proprietary architectures, so any would-be user has to make a big commitment to a single supplier.”
Over the long run, Handy doesn’t expect see successful products being sold as PIMs, but similar parts might succeed if they are positioned as processors with gigantic DRAM caches. “This has happened with SRAM,” he said. “Today’s processor chips allocate about half of their die area to SRAM, yet nobody refers to these as PIM chips.”
This article was originally published on EE Times.
Gary Hilson is a general contributing editor with a focus on memory and flash technologies for EE Times.