Elon Musk is going all-in on neural networks. Here's an overview of the chips, systems and software for neural network training.
Tesla’s AI Day in mid-August featured the introduction of automotive chips, systems and software for machine learning and neural network training. Together, they will advance the training of models destined for self-driving cars.
Elon Musk and his team of chip and system designers provided a truck-load of technical details during a more than three-hour presentation that can be viewed here. (At last count, the presentation had attracted 1.63 million views.)
Below are the highlights.
Tesla has designed a flexible and expandable distributed computer architecture that is tailored to neural network training. Tesla’s architecture starts with the D1 special purpose chip with 354 training nodes, each with a powerful CPU. These training node CPUs are designed for high performance NN and ML tasks and has a max performance of 64 GFLOPs for 32-bit floating point operations.
For the D1 chip, with 354 CPUs, the max performance is 22.6 TFLOPs for 32-bit floating point arithmetic. For 16-bit floating point calculations, the D1 max performance jumps to 362 TFLOPs.
Tesla introduced two systems for neural network training: the Training Tile and ExaPOD. A training tile has 25 connected D1 chips in a multi-chip package. A training tile with 25 D1 chips constitute 8,850 training nodes with each having the high-performance CPU summarized above. The max performance of a training tile is 565 TFLOPs for 32-bit floating point calculations.
The ExaPOD connects 120 training tiles into a system, or 3,000 D1 chips with 1.062 million training nodes. The max performance of an ExaPOD is 67.8 PFLOPs for 32-bit floating point calculations.
Tesla neural network announcement details
The introduction of the D1 chip and Dojo neural network training system show Tesla’s direction. The R&D investment to get these products into production is undoubtedly very high. Tesla is likely to share this technology with other companies — to create another revenue stream similar to BEV credits sold to other OEMs.
The next table lists the characteristics of Tesla’s neural network product announcements. The data has been extracted from the video of the August 19 event. I have added my understanding of the chip and system architecture in a few places.
Tesla’s design goal was to scale three system characteristics across its chip and systems: compute performance, high bandwidth and low latency communication between compute nodes. High bandwidth and low latency have always been difficult to scale to hundreds or thousands of compute nodes. It looks like Tesla has been successful in scaling all three parameters organized in a connected 2D mesh format.
The training node is the smallest training unit on the D1 chip. It has a 64-bit processor with 4-wide scalar and 4-way multi-threaded program execution. The CPU also has 2-wide vector data path with 8×8 vector multiplication.
The instruction set architecture (ISA) of the CPU is tailored to machine learning and neural network training tasks. The CPU supports multiple floating-point formats — 32-bit, 16-bit and 8-bit: FP32, BFP16, and a new format: CFP8 or Configurable FP8.
The processor has 1.25MB high-speed SRAM memory for program and data storage. The memory uses ECC or error correction code for increased reliability.
To get low latency between training nodes, Tesla pick the farthest distance a signal could travel in one cycle of 2GHz+ clock frequency. This defined how close the training nodes should be and how complex the CPU and its support electronics could be. These parameters also allowed a CPU to communicate with four adjacent training nodes at 512 Gbit per second.
The maximum performance of the training node varies by what arithmetic is used. Floating point performance is commonly used for comparison. The max training tile 32-bit floating point performance (FP32) is 64 GFLOPs. The max performance for BFP16 or CFP8 arithmetic is 1,024 GFLOPs.
The impressive Tesla D1 chip is a special-purpose design for neural network training. Manufactured in a 7nm process, the D1 packs 50 billion transistors in a die measuring 645 square millimeters. The chip has over 11 miles of wires and power consumption in the 400-watt range.
The D1 chip has an I/O ring with high-speed, low-power SerDes — a total of 576 lanes that surrounds the chip. Each lane has a transfer rate of 112 Gbps. The maximum D1 on-chip transfer rate is 10 Tbps (10 terabits per second). The maximum off-board transfer rate is 4 Tbps for each side of the chip.
With each of the 354 CPUs on a D1 chip having 1.25 MB of SRAM, this adds up to over 442 MB of SRAM. The maximum performance of the D1 chip is also based on the CPU array of 354 training nodes.
D1 max performance for 32-bit floating point calculations reach 22.6 TFLOPs. Max performance for 16-bit floating point calculations is 362 TFLOPs.
Tesla’s Training Tile is the building block for scaling AI training systems. A Training Tile integrates 25 D1 dies onto a wafer and is packaged as multichip module (MCM). Tesla believes this may be the largest MCM in the chip industry. The training tile is packaged as a large chip that can be connected to other training tiles via a high bandwidth connector that retains the bandwidth of the training tile.
The training tile packaging includes multiple layers of power and control, current distribution, compute plane (25 D1 chips) and cooling system. The training tile is for use in IT centers—not in autonomous vehicles.
The training tile provides 25X performance of a single D1 chip or up 9 Peta FLOPs for 16-bit floating point calculations and up to 565 TFLOPs for 32-bit floating point calculations.
12 training tiles in 2x3x2 configuration can be packed in a cabinet and Tesla calls it a Training Matrix.
The largest system Tesla described is the ExaPOD. It is built from 120 training tiles. This adds up to 3,000 D1 chips and 1.062 million training nodes. It fits in 10 Cabinets. It is clearly intended for IT center use.
ExaPOD maximum performance is 1.09 Exa FLOPs for 16-bit floating point calculations and 67.8 Peta FLOPs for 32-bit floating point calculations.
Dojo software & DPU
The Dojo software is designed to support training of large and small neural networks. Tesla has a compiler to create software code that leverage the structure and capabilities of the training nodes, D1 chips, training tiles and ExaPOD systems. It is using the PyTorch open-source machine learning library with extensions to leverage the D1 chip and Dojo system architecture.
These capabilities allow big neural networks to be partitioned and mapped to extract different parallelism, model, graph, data parallelism to speed up large neural network training. The compiler uses multiple techniques to extract parallelism. It can transform the networks to achieve fine-grain parallelism using data model graph parallelism techniques and can optimize to reduce memory footprints.
The Dojo interface processors are used to communicate with host computers in IT and datacenters. It is connected with PCIe 4.0 to host computers and to D1-based system via the high bandwidth explained above. The interface processors also provide high bandwidth DRAM shared memory for the D1 systems.
D1-based systems can be subdivided and partitioned into units called Dojo Processing Unit. DPU consists of one or more D1 chips, an interface processor and one or more computer hosts. The DPU virtual system can be scaled up or down, as needed by neural network running on it.
The Tesla neural network training chip, system and software are very impressive. There is a lot of innovation such as retaining tremendous bandwidth and low latency from chip to systems. The packaging of the Training Tile for power and cooling looks innovative.
The neural network training systems are for datacenter use and will certainly be used for improving Tesla’s AV software. It is likely that other companies will also use these Tesla neural network training systems.
A key question is how the neural network systems will be used in inferencing applications in AVs? The Training Tile power consumption looks too high for auto use in the current version. One picture in presentation had “15 KW Heat Rejection” label for the Training Tile. A D1 chip is probably in the range with 400-watt TDP listed in a slide.
It looks like Tesla is hoping and/or depending on this neural network training innovation to make its Autopilot into L3 or L4 capable system — with only camera-based sensors. Is this a good bet? Time will tell, but so far most of Elon Musk’s bets have been good — with some delay.
This is set to begin at the beginning of the presentation, 37 minutes in.
This article was originally published on EE Times.
Egil Juliussen has over 35 years’ experience in the high-tech and automotive industries. Most recently he was director of research at the automotive technology group of IHS Markit. His latest research was focused on autonomous vehicles and mobility-as-a-service. He was co-founder of Telematics Research Group, which was acquired by iSuppli (IHS acquired iSuppli in 2010); before that he co-founded Future Computing and Computer Industry Almanac. Previously, Dr. Juliussen was with Texas Instruments where he was a strategic and product planner for microprocessors and PCs. He is the author of over 700 papers, reports and conference presentations. He received B.S., M.S., and Ph.D. degrees in electrical engineering from Purdue University, and is a member of SAE and IEEE.