How AI Impacts Memory & Interconnect Technology

Article By : Gary Hilson

When it comes AI and machine learning applications, it's increasingly important where data and memory need to reside....

Location, location, location is not a mantra limited to real estate. To meet the needs of artificial intelligence (AI) and machine learning applications, it increasingly applies to where data needs to reside, and the memory that stores it.

But it’s not just the job of memory vendors to address these placement challenges; other AI stakeholders have a role to play and a big part of the solution is the memory interconnects, even as memory gets moved closer to compute. “We all work in different aspects of AI,” said Rambus fellow Steve Woo, who recently hosted an online roundtable discussion at the AI Hardware Summit on the challenges and solutions for memory interconnects.

Steve Woo, Rambus fellow

With events such as Oktoberfest being cancelled due to the COVID-19 pandemic, a beer analogy isn’t a bad one for memory interconnects, said Igor Arsovski, CTO of the ASIC Business Unit at Marvell, who started his career 17 years ago as an SRAM Designer. Attending a beer festival makes a pint easily accessible. “SRAM is like having a beer that’s right next to you. It’s really accessible, it’s low energy to access it and as long as it’s all you need, then you will have a nice high-performance accelerator.” But if it’s not enough memory, you end up going further to get it and it costs a lot more energy to get it, much like having to go further to get beer in larger quantities.

Igor Arsovski, CTO of the ASIC business unit at Marvell

In memory terms, that might be High Bandwidth Memory (HBM), which is increasingly being adopted for AI, said Arsovski. “It costs you about 60 times more energy to access that memory. There’s a lot more capacity there, but the bandwidth to access it is also significantly diminished.” The beer analogy can be extended to technologies such as LPDDR, which exceeds SRAM. “The power is significantly higher, but you can pack even more capacity,” he said. “That’s like going down the road to your favorite bar where there are kegs of beer.”

Where the next generation of accelerators is heading is bringing those kegs right above the accelerator to pack that memory closer through a compute, said Arsovski. His beer analogy outlines the different avenues available in terms of packaging and where you put different pieces of silicon, added David Kanter, executive director of MLCommons, an organization that provides machine learning standards and inference benchmarks, with members encompassing academia and industry. “That gives us a really nice broad view of the different workloads,” he said. “One of the things that we’re starting to morph the organization to focus on a bit is building out advisory boards to bring in some of the deep expertise in specific application areas.”

David Kanter, executive director, MLCommons

When it comes to memory, said Kanter, the overall system context matters. “You have to think about what you’re trying to do with the system, and that should drive how you think about things.” The die, package, and the board are all elements that must be thought about when it comes to where memory gets placed and connected, he said. “There are a lot of different corners that you can optimize in terms of array structure, in terms of cell type, in terms of how far away it is.”

Understanding where you need bandwidth and non-volatility are also key considerations, said Kanter. “Hopefully that will guide you to the right choice.”

These considerations are critical for companies that weren’t traditionally part of the overall memory system building process. Google software engineer Sameer Kumar is spending a lot of his time working on compilers and scalable systems where network and memory bandwidth are critical for different machine learning models, including the ability to it at scale in large batch sizes. “AI training has a lot of memory optimization involved,” he said, and it’s the most crucial step in a compiler to get very high efficiency, which means memory needs to be smarter.

Sameer Kumar, software engineer, Google

Memory interconnects are increasingly important because data movement is really starting to dominate certain phases of AI applications, said Woo. “It’s a growing problem in terms of the performance and power efficiency.” It’s challenging to increase data rates, as everyone would also like keep doubling the speeds of the data movement and keep doubling the power efficiency as well, he said. “Many of the tricks and the techniques we’ve relied on are no longer available to us or they’re slowing down. There is tremendous opportunity to think about new architectures and to innovate in the way we move data.”

Woo said that not only includes innovation in the memory devices themselves, but also in the packages and with new techniques such as stacking, while also keeping mind data security, something Rambus is seeing as growing concern.

Rambus has seen a lot of interest in 3D stacking, but without an increase in the bandwidth commensurate with the increased capacity of the stack, there are limits to usability. (Image source: Rambus)

Arsovski said Marvell is spending a lot of time with customers building AI systems, providing them information of how much bandwidth they can move per chip edge and how much bandwidth they can have access to memory. “What we’ve seen so far is that our customers are needing more memory bandwidth and more I/O of bandwidth,” he said. “If you look at how the packaging level interconnects at scale, there’s a huge mismatch. We’ve hit this bottleneck right now, and there’s a constant push for high-end chip to chip connections.”

From a memory perspective, said Arsovski, the next step for those building AI models that can’t fit on their die is looking at HBM or GDDR as the next step, but there’s also a lot interest in moving to 3D to stack up and try to get more bandwidth because you can only move so much on a chip edge. “Customers want more and more I/O bandwidth, and we’re hitting a wall with how much we can move on the edge.”

Kanter said it’s important to keep in mind that there is “huge diversity” even within the world of machine learning in terms of what people doing, which drives both constraints and variations in the ecosystem. A random lookup into an incredibly large data structure isn’t going to fit into a single node for regular DRAM, which means needing to build incredibly large clusters of systems if you want to actually hold that in memory. “That has a very different character and property than your classic vision-oriented models,” he said. “It’s very important is to keep this variety in mind on the memory side.”

The interconnect comes in when there’s a need to pull together a lot of memory and compute, said Kanter. “To train at scale, you really do need an interconnect that is both appropriate for the customer and appropriate for the problem.” The interconnect is going to be particularly important for those on the leading edge, he said. “If you only want to train with one GPU for a small network, then probably the critical dimension is really the memory bandwidth.”

Kumar said more memory bandwidth may enable different kinds of optimization, but if a model is particularly memory bound, it might make sense to bring in more compute. “If you have more memory throughput available or maybe even more interconnected throughput available, it might make the model design more flexible and enable new features and build the different kinds of models altogether.”

Woo said Rambus has seen a lot of interest in 3D stacking, but one the challenges that as you go higher,  it’s harder to keep increasing the bandwidth to go up and down through that stack. “You end up increasing the capacity of the stack, but if you don’t have that commensurate increase in the bandwidth, then there’s a question about exactly how usable that solution really is.”

The Holy Grail, he said, is finding something that has form factor power efficiency benefits of stacking while maintaining the fill frequency so there’s a relatively constant ratio of bandwidth to the capacity and the stack.

Both Kumar and Arsovski see a need for a balanced, scalable system with a well-designed software stack. “We’re describing a human brain-like structure that really scales well,” said Arsovski. It has to be low power with lots of connectivity, and right now, the closest we get to it is through 3D stacking, but there remains power, packaging, and mechanical challenges. “We need to figure out a very parallel system that’s very low power at each of these layers, so you don’t have to worry about thousands of watts that need to be cooled.”

He said it’s time to start looking what the next base building block is for AI systems. “We’ve been working with transistors and they’ve done a great job up until now. We’re stuck with one technology that we know and love, and we want to keep going with it. It’s almost like we need to rethink the device from the bottom up.”

Gary Hilson is a general contributing editor with a focus on memory and flash technologies for EE Times.

Subscribe to Newsletter

Leave a comment