MOUNTAIN VIEW, Calif. — The rubber is about to meet the road in what’s projected to be a $25 billion market for deep-learning accelerators. Data centers are testing multiple chips in the labs now and expect to deploy some next year, probably picking multiple accelerators for different workloads.

So far, Graphcore, Habana, ThinCI, and Wave Computing are in the small subset of 50 vendors who have chips that customers are testing in their labs. Representatives from both groups staked out their positions at the AI Hardware Summit here.

One issue becoming clear is that “there’s no such thing as a general-purpose compiler — these chip architectures are too different,” said Marc Tremblay, a distinguished silicon engineer for Microsoft’s Azure group that operates more than a million servers.

The data center giant is developing its own runtime called Lotus to map AI graphs into hardware language. Last week, Facebook announced support for its own approach called Glow, a generic deep-learning compiler.

Data centers are hungry for big leaps in AI performance beyond Nvidia’s Volta, the king of training accelerators today. “Some training jobs take 22 days to run on GPUs, and one takes more than two months, but we’d like an answer over lunch,” said Tremblay in a keynote here.

One speech-recognition app uses 48 million parameters. Researchers are working on neural nets that generate their own models using non-symmetrical connections that take compute requirements to new levels.

“We need 10 to 50 times more bandwidth to support more esoteric neural nets coming up,” said Tremblay.

Today’s GPUs are pricey and power-hungry at about $400,000 for a 16-chip system that requires heat sinks even for its switch chips. Getting linear scaling on clusters of the chips “sometimes requires work that our engineers don’t want to do,” he said.

For now, Microsoft is using V100 and prior-generation GPUs and “paying attention” to the T4 chip that Nvidia announced last week. It looks promising for running multiple neural networks simultaneously, noted Tremblay.

In addition, Microsoft and other data center giants run many deep-learning jobs on their big banks of x86 CPUs. “For us, it’s often free because the x86 chips are not running all the time,” he said, noting that software optimizations such as a new AI instruction in Intel’s Cascade Lake will drive advances for many years.

Looking forward, data centers are likely to adopt multiple accelerators, each mapped to specific workloads that they best fit. Tremblay outlined a variety of speech, vision, language, search, and other AI apps, each with its own latency and throughput requirements.

Keynoter Tremblay outlined the landscape of AI silicon. (Image: Microsoft)
Keynoter Tremblay outlined the landscape of AI silicon. (Image: Microsoft)

Some apps use as many as 20 types of neural nets, making flexibility across models a requirement. They also range from using a single batch for latency-sensitive Bing searches to more than 100 batches for other apps. Thus, Tremblay assigns the chips that he tests a robustness number, a measure of their flexibility.

Among the bugaboos, “startups forget about things like security and virtualization,” he said. “They don’t need to have everything on Day 1, but eventually we have to get into the class of features that we have with mature CPUs and GPUs.”

Overall, the good news in data center AI is that “we have a long way to go, but progress has been incredible … there are lots of innovations coming, and the future is bright for AI,” he concluded.

Wave Computing stood out among startups for providing details of its architecture. Like rival Cerebras, it will sell full systems because the performance gains in targets require advances beyond the processor.

Specifically, Wave’s current 16-nm processor uses the 15-GByte/s ports on HMC memory to link between four chips on a board and four boards on a system. The memory and its interconnect are key to streaming graphs through clusters of its processors, avoiding the latency of being fed by a processor over a relatively narrow PCI Express bus.

Wave chose HMC in part out of expediency. The startup had a strategic alliance with HMC vendor Micron, and rival HBM memory seemed too complex and risky for a relatively small startup.

About six companies in markets such as finance, video-on-demand, and manufacturing are now testing partial rack geared for use in their IT departments. To serve big data centers such as Microsoft, the company needs a system the size of a full rack that will be based on a next-generation 7-nm processor using HBM.

Wave Computing's AI solution

The initial Wave system uses HMC to connect four quad-processor boards. (Images: Wave Computing)

The startup is still working out how it will make the shift from the serial HMC to the parallel HBM memory as its key interconnect. While HMC sports multiple ports, HBM is typically configured with one fast port running up to 307 Gbytes/s based on 2.4 Gbits/s from each of its 1,024 I/O pins.

The initial focus on corporate end users forced Wave to evolve into a services business. It set up a 20-person team in the Philippines as part of a center that will help IT departments learn how to develop their own deep-learning models, something that data scientists on staff at big data centers do themselves.

Interestingly, Wave started in 2009 with a team incubated at Tallwood Venture Capital, three years before the deep-learning boom. At the time, it aimed to build a more efficient rival to FPGAs that could be programmed in high-level languages, a rival to Tabula and Achronix.

As a deep-learning processor, Wave’s approach lets elements of a graph flow through circuits and execute. Instructions can set optimal precision formats for the task at hand, and circuits go back to a sleep state when they are done executing, said Wave co-founder and CTO Chris Nichol in a talk here. A market watcher released a whitepaper on the system architecture timed with the talk.

Wave Computing's AI cluster

Wave clusters its processors to flow graph data through them.

Graphcore piles it all on its Colossus

Graphcore provided a glimpse of its 23.6 billion-transistor Colossus that aims to hold an entire neural-net model in its 300 Mbytes of on-chip memory. The startup claims that it can process 7,000 programs in parallel on its 1,216 cores, each capable of 100 GFlops.

Colossus sports an aggregate internal memory bandwidth of 30 TBytes/s. Externally, it supports 2.5 TBits/s chip-to-chip split over 80 channels. Two of the chips are packed in a single PCIe Gen4 x16 card delivering 31.5 GByte/s in I/O.

Cerebras chief executive Andrew Feldman declined to describe his startup’s architecture or timeframe, but he defended the need to deliver full systems. “If you arrive on a PCI card, you are constrained by power, cooling, and I/O,” he said in a panel here.

Delivering full systems does not create a scaling hurdle. Cerebras might use the same contract manufacturer that builds server racks for Google or Microsoft, and “we can do a billion dollars in this zip code alone,” said Feldman of the need to build out a sales force.

New hardware will pave the road for new AI workloads, fueling demand. Deep-learning “researchers are afraid of being smoked,” he added. “They have a queue of questions and big ideas, and [today’s relatively slow] computers are in the way.”

As for his products, he said that they will deliver 1,000x performance gains in part by managing neural-net sparsity. They will use no exotic technology; however, they do require a novel core, memory architecture, compiler, fabric, and techniques for cooling and delivering power.

One of the newest startups to debut, SambaNova Systems, provided the fewest details. Like Cerebras, Graphcore, and Wave, it sports a team of veteran architects marrying a compiler based on Stanford’s Spatial to a dataflow chip.

— Rick Merritt, Silicon Valley Bureau Chief, EE Times