The memory extension system allows a single wafer-scale engine to train 120 trillion parameter models.
Cerebras showed off a new memory extension scheme during the Hot Chips event aimed at its second-generation CS-2 AI accelerator, allowing a single wafer-scale chip to train 120-trillion parameter models. By comparison, the capacity of the human brain is around 100 trillion parameters.
“Our space has undergone some extraordinary transformations in the past two years,” Cerebras CEO Andrew Feldman told EE Times. “We saw more than a thousand times larger models, in terms of parameters, and more than a thousand times increase in the amount of time it takes to do the work.”
Feldman was referring to the evolution of huge natural language processing models from Bert-base at 110 million parameters to models like GPT-3 at 175 billion and more. GPT-3 trained with 1,024 GPUs took four months and megawatts of power to train, for example.
To counter this rapidly accelerating model growth, Cerebras has developed MemoryX, a memory extension system for the CS-2 compute engine which Feldman says is able to retain a semblance of on-chip performance.
“We’re able to show extraordinarily high utilization on these very large networks through our memory access technology,” he said. MemoryX includes software to precisely schedule and perform weight updates to prevent dependency bottlenecks.
MemoryX, which is a combination of DRAM and flash storage, enables up to 2.4 petabytes of capacity to hold up to 120 trillion parameters. This capacity boost enables training today’s largest known models “in a weekend” with a single CS-2 system “the size of a dorm room refrigerator,” he said.
Cerebras also introduced a second execution mode for its hardware.
For existing smaller models, parameters and weights are kept on the wafer (chip) and activation data is streamed. This pipelined execution mode means the entire model can be loaded onto the CS-2 and processed with very low latency.
The new execution mode for extremely large models, weight streaming mode, means the activations are instead kept on-chip with parameters/weights streamed in from MemoryX. On the delta pass of training, gradients are streamed from the wafer back to the central store where they are used to update the weights.
Cerebras has also expanded its on-chip communication fabric, Swarm, to enable communication with more CS-2 systems for computing clusters of up to 192 units. Clusters can now be used to train single neural networks.
Adding more CS-2s to a cluster scales performance near linearly, Feldman said.
“This is profoundly different than a cluster of GPUs, because the problem doesn’t fit on any one GPU — the problem would need to be cut up into many pieces,” he said. “But because the problem fits on a CS-2, each CS-2 has the same plan of attack for the problem. It’s not doing a subset, it’s doing the whole thing; the only thing that differs is that [two] CS-2s would get a different subset of the training data. As a result, the work gets done exactly twice as fast.”
The simplicity enabled by loading the whole model layer onto a single CS-2 means upgrading from one to many CS-2s can be done without any software changes.
The new weight streaming execution mode gives Cerebras an advantage: The system has an intrinsic ability to take advantage of sparsity in parameters/weights, including unstructured and dynamic sparsity. Until now, existing hardware has been unable to leverage these two sparsity types.
As models rapidly increase in size, the ability to use sparsity is becoming more important. A neural network is “sparse” if some of its activations or weights are zero; compute time and energy can be saved by skipping multiplications and simply going straight to the answer.
“By storing the weights and their state in the MemoryX technology and streaming these over the SwarmX technology, we are able to identify zeros without regard to their structure. That is, whether they are organized neatly in blocks or distributed randomly, as well as identifying new zeros that emerge during training,” Feldman said. “By avoiding these zeros in our calculations, we directly accelerate training and reduce the time to train a model.”
Sparsity in weights (as opposed to activation sparsity) has been difficult to leverage in hardware. Feldman said it is not well understood since “it doesn’t work on GPUs or TPUs”.
Cerebras’ wafer-scale engine is based on a fine-grained data flow architecture, which means its compute cores are capable of individually ignoring zeros regardless of the pattern in which they arrive. Current sparsity techniques work on pruning sparse blocks of computations, not individual zeros. This means Cerebras can take advantage of unstructured sparsity.
Unlike activation sparsity, weight sparsity is dynamic. That is, it emerges during training. Cerebras can prune branches with dynamic weight sparsity by algorithmically applying a sparse mask, which involves ignoring irrelevant branches of calculations, during the training process.
“We love going bigger, but brute force scaling, even though it’s produced such extraordinarily great results so far, needs augmentation,” Feldman said. “You have to go bigger and you have to go smarter… sparsity enables you to achieve the answer using fewer flops.”
This article was originally published on EE Times.
Sally Ward-Foxton covers AI technology and related issues for EETimes.com and all aspects of the European industry for EE Times Europe magazine. Sally has spent more than 15 years writing about the electronics industry from London, UK. She has written for Electronic Design, ECN, Electronic Specifier: Design, Components in Electronics, and many more. She holds a Masters’ degree in Electrical and Electronic Engineering from the University of Cambridge.