UT Austin says a chip it has developed can, in simulations, beat the V100 on training neural nets. The claim was from one of several revealing papers the recent SysML event
PALO ALTO, Calif. — A researcher from the University of Texas at Austin described a chip for training deep neural networks that he said can outperform a Nvidia V100 — even using low-cost mobile DRAM. At the same event, Arm discussed research on a chip that can significantly increase efficiency for computer vision jobs run on mobile systems.
The papers were among more than 30 at the second annual SysML, a gathering at Stanford of top researchers grappling with systems-level issues in deep learning. Their work showed that it’s still early days for the fast-moving field, with engineers still finding fundamental techniques and applications for this new form of computing.
Speakers showed a willingness to talk candidly about their techniques, prototype chips, and applications in the interests of moving the emerging field forward.
IBM presented techniques for reducing neural-net precision down to 2 bits without significant loss of accuracy. For its part, Facebook showed an approach for saving costs by storing recommendation models in solid-state drives rather than DRAM.
In one of the most noteworthy papers, a researcher from UT Austin described mini-batch serialization (MBS), a method to slash memory accesses needed to train convolutional neural networks (CNNs) so that more work fits in on-chip buffers. When implemented on its own WaveCore chip, the technique reduced DRAM traffic 75%, improved performance 53%, and saved 26% of system energy compared to conventional approaches and accelerators.
“We expect a single WaveCore with MBS to achieve higher performance than one [Nvdia] V100 GPU … even with a low-cost LPDDR4 DRAM (the same DRAM used for mobile phones), WaveCore can outperform a high-end V100 GPU,” researchers wrote.
The MBS technique splits batches into small units with higher data reuse across network layers. The WaveCore accelerator, so far available only in a simulator, supports registers for two weights in its processing elements so that a second process can begin as soon as the first is completed.
Sangkug Lym, a former SKHynix engineer now earning a doctorate at UT, said that the team is considering whether to commercialize WaveCore. Overall, “we used a combination of algorithms, scheduling, and architecture to make training more efficient,” he added.
Researchers from Duke and the University of Michigan collaborated with UT on the MBS/WaveCore paper. (Source: SysML)
Arm eases computer vision for mobile systems
Arm, meanwhile, detailed FixyNN, a novel approach to handling computer vision in the budget of a mobile system in as little as 2.18 mm2 of silicon. Its key ingredient is a front-end hardware block that accelerates the task of extracting fixed-weight features from CNNs before passing them to a more conventional processing block.
In simulations, FixyNN achieved 26.6 TOPS/W — 4.81× better than similar-sized programmable chips, Arm claimed. Researchers created an open-source tool-flow called DeepFreeze that automatically generates and optimizes from a TensorFlow description the block for fixed-weight feature extractions (FFEs).
In addition, researchers used transfer learning to create a single FFE to train a number of different back-end models for different data sets. The resulting models had an accuracy loss of less than 1% and still hit 11.2 TOPS/W, twice the level of existing chips.
Like UT’s WaveCore, Arm’s FixyNN is still a research project, and both get advances with work spanning software and silicon.
“We’re basically trying to understand what the limits are on compute specialization for machine-learning applications,” said Paul Whatmough, who presented the paper. “We think it’s going to take an algorithm-hardware co-design approach to make significant progress, and FixyNN was one of our first projects along these lines.”
Arm’s FixyNN uses a front-end block to extract CNN features. (Source: SysML)
IBM researchers showed ways to clip activations and organize weights to create neural nets in a two-bit precision format “that achieves state-of-the-art classification accuracy (comparable to full-precision networks) across a range of popular models and data sets,” it said.
The paper was one of many at and beyond the event exploring ways to streamline formats, weights, or activations to simplify deep-learning computations. It showed examples in which it degraded accuracy by less than 3% yet delivered 2.7× to 3.1× speedups.
Facebook described techniques for storing in NAND vector models used in its recommendation systems and previously stored in higher-cost DRAM. The paper also gave a rare glimpse into the workings of recommendation systems widely used by web services.
To compensate for NAND’s slower response times, the so-called Bandana system stores models likely to be read together in the same physical location. It simulates dozens of small caches to determine which data to save in a DRAM for fast access. The net result boosts NAND read bandwidth by two to three times, a Facebook researcher said.