CUPERTINO, Calif. — This year’s Hot Chips hosted 25 talks, 16 of them focused at least in part on chips handling artificial intelligence jobs. They spanned a broad range from ultra-low-power devices for the Internet of Things and smartphones to power-hungry slabs of silicon for the data center.

Industry consolidation around the x86 made this microprocessor event less interesting for a few years. But with the rise of machine learning, it’s become a hot spot again for engineers who specialize in chip architectures.

Believe it or not, there’s more to the chip world these days than deep learning. One speaker described a contender to replace DRAM and called for more talks on memories at the event given the work in alternative RAMs bubbling under the surface.

For its part, Xilinx showed a major new variant of the FPGA, geared for AI and more. And attendees heard a call to action to design a whole new computing architecture grouded in security.

Keynoter John Hennessey, chairman of Alphabet, noted that the widely used technique of speculative execution had been vulnerable to side-channel attacks for 20 years before computer architects at Google saw the open door.

“It makes you wonder what else we haven’t noticed … its amazing given the complexity of these products that they work so well or work at all,” said Nathan Brookwood, analyst at Insight64 and a veteran Hot Chips attendee.

In the following pages, we highlight an array of interesting talks that we did not write about in the immediate aftermath of the event. We start with a handful of them that impressed us as most bold in their ambitions and/or creative in their thinking.

Startup Tachyum was, no doubt, the boldest of all, but informal conversations named it the least likely to succeed. It aims to win sockets as a mainstream server processor and an AI accelerator with its Prodigy chip whose cores it claims are “faster than a Xeon and smaller than an Arm.”

The 7-nm 290-mm2 chip with up to 64 cores will tape out next year, delivering up to 2 TFlops at 4 GHz, claims the company. Initially, it will depend on a combination of server software that the company ported and an emulator to run other code.

Data center operators are not likely to add a startup’s chip and software to their x86 racks without major performance boosts and lots of testing. Analyst Brookwood expressed skepticism about the startup’s use of VLIW, a technique that Intel failed to master with Itanium. If the chip gains any traction, Tachyum is likely to face patent suits from giants such as Intel, he added.

Tachyum's Prodigy sports nine-stage integer and 14-stage floating-point pipelines. (All images: Hot Chips)
Tachyum’s Prodigy sports nine-stage integer and 14-stage floating-point pipelines. (All images: Hot Chips)

Could Optane draw a legal challenge?

Intel described Cascade Lake, its latest 14-nm Xeon server processor. The company tipped the high-level news on the chip at an event a few weeks ago, but Hot Chips provided more details — and a dash of controversy.

Cascade Lake uses the same mechanical, thermal, and socket interface as Intel’s existing 14-nm Xeon and sports the same core count, cache structure, and I/O speeds. The new bits include 14-nm process tweaks to eke out a bit more performance and less power consumption. In addition, the chip supports a new AI instruction and hardware mitigations for the side-channel attacks exposed by Meltdown/Spectre.

But the big news is that Cascade Lakes is the first Xeon with a memory controller that supports Intel’s Optane, aka 3D XPoint memories, opening a door to up to 3 TBytes of main memory per socket as well as gains in read/write speeds over DRAM.

An Intel engineer giving the talk would not comment on endurance of the Optane media. However, he did say that the boards use a Jedec DDR4 electrical bus with a proprietary Intel protocol that won’t be available to rivals for the foreseeable future.

“I don’t think that will stand a legal challenge,” said Brookwood.

“If I was IBM or AMD and Optane DIMMs became popular in data centers and I couldn’t get them, I’d be a little ticked. Intel commands something like 98% of the server market, and that, in my mind, is a monopoly.”

Intel Optane DC

Intel is leading work at the Storage Networking Industry Association to create a software platform for alternative main memories such as Optane.

NEC accelerator lowballs Nvidia V100

NEC described a new vector engine (below) that can ride a PCIe Gen 3 card drawing less than 200 W. The chip is designed for use with both the SX-Aurora supercomputer and an x86 host in a Linux server at prices that it says will be “much cheaper than [an Nvidia] V100.”

NEC claims that its vector chip delivers up to 307 GFlops of double-precision performance. That’s somewhere in between the performance of a Xeon and a V100 on most benchmarks. However, the NEC chip has slightly more memory bandwidth and nearly as much performance/watt on some workloads as the Nvidia GPU, claims the company.

The 1.6-GHz, 16-nm vector chip has a relatively small 480-mm2 die compared to Nvidia’s V100, nearly a full reticle at 840 mm2. The NEC chip supports a whopping six Hi8 or Hi4 HBM2 memory stacks delivering up to 48 GBytes of total memory.

NEC Vector Engine

Harvard takes AI to new low for IoT

Researchers from Harvard and Arm described an ultra-low-power accelerator for running deep-learning jobs in the internet of things. The so-called SMIV chip (below) measures 25 mm2 in a TSMC 16-nm FFC process.

SMIV claims to be the first academic chip to use an Arm Cortex-A core. It employs near-threshold operation in an always-on accelerator cluster and an emedded FPGA block delivering about 80 hardware MACs and 44 Kbits of RAM.

As a result, the chip delivered a new watermark in high accuracy at low power among published work to date. It showed nearly 10x gains in power and area efficiency over rival approaches.

Harvard SMIV

MIT beats Arm with navigation chip

Researchers from MIT claim that they use significantly less energy than an Arm CPU core with a custom-designed navigation chip for robots and drones. Navion (below) carves a visual-inertial odometry engine into a 20-mm2 die in 65-nm CMOS.

The chip delivers 2x to 3x the performance of a standard CPU and can reduce memory footprint up to 5.4x, said researchers. It draws 24 mW in a maximum configuration and as little as 2 mW in an optimized configuration still capable of real-time navigation.

Many Hot Chips talks simply provided more details on devices already announced and sometimes even shipping. In the following pages, we’ll first look at a few AI accelerators and CPUs for client systems, then turn our attention to server processors and accelerators.

Navion MIT

Arm flexes muscle of its new ML core

Arm provided a deep dive on its ML core expected to appear in chips at the end of the year. It will deliver about 4 Tera-operations/second at 1 GHz and more than 3 TOPS/W in a 2.5-mm2 core made in a 7-nm process. Its multiply-accumulate unit supports eight 16-bit wide dot products.

ARM machine learning performance

Arm details results of feature map compression per 8 x 8 block on its ML core.

Samsung pushes up smartphone performance

Samsung gave an example of how clever engineers can deliver significant performance increases with hard work at a time of mediocre improvements in process technology. The 2.7-GHz M3 applications processor now shipping in its smartphones handily beats its former M2 typically by 50% or more on a range of benchmarks (below).

The effort included a performace team with a model based on 4,800 traces correlated by a separate team to an RTL model developed by a third team. It uses neural networks in its branch predictor, leveraging academic work from Daniel A. Jiménez, a Texas A&M professor.

The team was allowed to more than double M3 die area over the M2. However, it used a 10 LPP process, a relatively small upgrade from Samsung’s 10 LPE.

Samsung M3 performance

Mythic shows processor-in-memory in progress

Mythic described details of its implementation of a processor-in-memory (PIM) for processing images with deep learning that executes 0.5 picojoules/MAC. The total chip, aimed at surveillance and factory cameras, consumes 5 W including all digital control logic.

The PIM concept has been around for years but only recently is being applied to AI. Mythic creates a variable resistor array based on NOR cells but does not write and read deep-learning weights to the cells. Instead, it applies voltage to array lines to sum and read out current levels, saving energy.

The initial chip can handle a limited number of weights, but the tile-based design could scale to five-fold as many weights for a full-reticle chip. An Arm core could be added to create a programmable device, and multiple chips can work together to run larger apps or run them faster. One down side: The design cannot take advantage of neural-net sparsity.

It claims GPU performamce with a 40-nm chip using a fraction of a GPU’s power. The proof of its pudding will come with samples in the middle of next year, with volume production expected at the end of 2019.

Mythic's PIM aims to deliver GPU performance at MCU power but does not prune sparse neural nets.
Mythic’s PIM aims to deliver GPU performance at MCU power but does not prune sparse neural nets.

Google gives snapshot of Pixel Visual Core

Google gave a tour of the Pixel Visual Core in its latest smartphones. The A53-based device is a programmable engine for running the latest versions of the still-evolving HDR+ algorithms for handset cameras. “It makes your social media pictures not suck,” quipped a Google engineer who was part of its 30-person-plus design team.

Interestingly, one engineer from Samsung’s memory group asked if future generatons will abandon a classical image processing pipeline in favor of emerging deep-learning techniques. “We haven’t announced much on AI algorithms in this area yet,” replied the Google engineer.

Google claims that its 28-nm Pixel core runs HDR+ jobs at least 2.8x faster than the CPU in a 10-nm mobile applications processor.
Google claims that its 28-nm Pixel core runs HDR+ jobs at least 2.8x faster than the CPU in a 10-nm mobile applications processor.

IBM takes a breather at the 14-nm node

In servers, IBM is like Intel, parking for a while at the 14-nm node. It described its latest efforts to bolster I/O and memory bandwidth on systems based on its Power 9 processor, but it will be 2020 or beyond before it delivers a new design in a new process.

IBM Power roadmap
IBM aims to eke more memory bandwidth out of its Power 9 servers while it gears up for designs based on a 7-nm processor.

Fujitsu brings Arm core to supercomputers

Fujitsu described the 7-nm A64FX, its design aimed to be one of the first Arm cores in a supercomputer. The 512-bit SIMD chip brings vector extensions to the Arm architecture to run both traditional supercomputing and new AI tasks. The 52-core chip uses 32 GBytes of HBM2 memory to deliver 2.7 TFlops and a whopping 1,024-GBytes/second memory bandwidth.

Fujitsu A64FX ARM core

The A64FX, Fujitsu’s first post-Sparc design, aims to appear in Japan’s follow-on to the K computer in 2021.

Nvidia shows the prowess of its GPU server

Nvidia made its bid to expand from a chip to a system provider with a tour of its DGX-2 and the NVLink interconnect inside it. It showed several benchmarks, including those below, in which the DGX-2 outperformed a standard dual-GPU system.

NVIDIA DGX2 benchmark figures

Click to enlarge.

Intel, AMD, and Middle East peace

Intel described how it used its Embedded Multi-Die Interconnect Bridge (EMIB) technique to connect its Kaby Lake desktop x86 CPU to an AMD Radeon RX Vega M GPU in a single module (below) for thin and light notebooks.

In one of the lighter moments of the event, analyst Brookwood joked with the Intel presenter, “Whoever negotiated that deal should be assigned next to work on Middle East peace.”

Intel Kaby Lake G

— Rick Merritt, Silicon Valley Bureau Chief, EE Times