Flex Logix Takes on Nvidia with Edge AI Accelerator

Article By : Sally Ward-Foxton

Flex Logix’ InferX X1 AI accelerator packs a punch on Yolo v3 and is priced for high volume...

Flex Logix has launched its InferX X1 AI accelerator chip for edge systems, claiming it outperforms the Nvidia Jetson Xavier on popular object detection model Yolo v3 by 30%. The chip, which will be sampling next quarter, has also been priced deliberately to encourage the widespread adoption of AI inference techniques in high-volume applications.

The InferX X1 is pitched at applications including robotics, industrial automation, medical imaging, gene sequencing, bank security, retail analytics, autonomous vehicles and aerospace.

“All of these people today have inference shipping in their products,” said Geoff Tate, CEO of Flex Logix, in an interview with EE Times. “It adds value to their products, but what they tell us is that while what they’re shipping is good, they want more performance, and they want lower price so they can put it into higher volume versions of their systems.”

Flex Logix InferX X1 board
The InferX X1 comes either as a chip-only or on a half-height half-length PCIe board (Image: Flex Logix)

Typical approaches to the edge AI chip market (outside of Nvidia) have focused on meeting the needs of a particular niche, be it high reliability, low power, low latency, small size or some specific combination of factors. Can one architecture really meet the needs of both autonomous driving and medical equipment? Both bank security and robotics? Tate is adamant that it can.

“We have models and customers representing every one of these categories and more,” he said. “Where you see a sensor [today], you’ll see an inference chip in the future. There are a lot of sensors and products out there. So this market is going to be very broad and there’ll be many, many different kinds of applications that will use [the InferX X1].”

According to Tate, the space between the data center and the world of ultra-low power devices is a sweet spot for Flex Logix’ AI accelerator chip.

“The segment that we’re going after right now is where there’s the biggest need, where they’re the most starved for performance and where there’s the least competition,” he said. “Below us and above us, the markets are much more crowded… There’s a lot more chips in the 1-W range, and there’s huge number of people in the data center.”

Performance
Flex Logix’ tests have the InferX X1 running Yolo v3 30% faster than the current market leader in the edge space, the Nvidia Jetson Xavier (Flex Logix is pitching the InferX X1 against the Nvidia Jetson Xavier for embedded edge applications and the Nvidia Tesla T4 for bigger edge equipment applications such as medical imaging).

Yolo v3 is currently considered the state-of-the-art in object detection and recognition, and Tate said this is a model customers are very interested in across the board (the algorithm can be retrained to look for the kinds of objects pertinent to the application). It has 62 million weights.

Flex Logix also showed performance figures versus Xavier for two of its customer models which Tate says are representative of typical customer requirements, model X and model Z. For model Z, the InferX X1 could run it 50% faster than the Xavier. For model X, the difference was much more pronounced – InferX X1 beat Nvidia Xavier by a factor of 11. This is due to the nature of this model, which uses a new type of convolution operator called 3D convolution. Flex Logix’ architecture, which is inherently adaptable, is better at processing new mathematical operators than fixed architectures.

The InferX X1 is compact at 54 mm2 in TSMC 16nm (compared to the Nvidia Jetson Xavier at 350 mm2, though Xavier is an SoC with a host, while InferX X1 is solely an accelerator). Xavier also uses four times as much DRAM.

Flex Logix’ figures even have the InferX X1 beating the Tesla T4 on model X by a factor of two. For Yolo v3 it performed a third as fast, and for model Z, almost half as fast. Not bad, says Tate, considering the T4 is 10x the size (545 mm2) and uses 8x the DRAM.

The chip consumes between 7 and 13W total design power (TDP), though this is a worst-case scenario with a realistic consumption being “about half” that, Tate said. The exact power figure will vary depending on which part number is chosen – there are four different versions running at between 533 and 933 MHz.

Architecture
Flex Logix’ architecture is based around a one-dimensional tensor processing unit (TPU) with four tiles of 16 TPUs on the InferX X1 chip. These 64 TPUs can connect to each other, the inputs, outputs or memory in a highly configurable way (see “Configurable interconnect” below). As each matrix multiply operation is completed, the next input tensor is shifted in while the result is shifted out, for efficiency.

The memory system has been carefully designed with “layers” of SRAM (not caches). Weights are stored in L0 (closest to the compute), while L1 holds weights for the next layer close by so they can be loaded quickly (once the previous layer finishes executing). L2 holds activations, the intermediate functions between layers. L3 SRAM acts as a scratchpad, which can be used to effectively cache configurations that may be reused in future.

Configurable interconnect
One of the InferX X1’s key features is the programmable interconnect fabric between the TPUs, which is already used in the company’s established eFPGA product. This interconnect fabric allows any TPU to connect to any other TPU for unblocked data transfer without contention. This interconnect can be reconfigured in as little as 4 µs.

Flex Logix configurable interconnect
One of the ingredients in Flex Logix’ secret sauce is its configurable interconnect technology (Image: Flex Logix)

“We’re getting ASIC-like performance because data can flow at full speed from memory through the TPUs, back to memory for all of the paths on a chip with no contention,” Tate said. “This building block approach also gives us the ability to handle operators which come along, like [3D convolution in] model X, which we would not have anticipated when we designed the original architecture… as the market keeps innovating, our advantage for operators that weren’t even contemplated when people designed their chips should give us an increasing advantage over time.”

Tricks
Flex Logix has another couple of tricks up its sleeve. The first is the “fusing” of layers in a neural network to be processed at the same time. Depending on the exact data rate and other factors, sometimes compute efficiency can be increased by having the TPUs calculate the first and second levels at the same time.

“Not all layers can be fused, but when they can, the activations are done in soft logic or embedded FPGA LUTs rather than being stored in memory, and then fed directly into the next TPU, so we don’t have to store that intermediate activation,” Tate explained.

Flex Logix fused layer concept
“Fusing” layers together so they can be processed at the same time – without reconfiguring the interconnect – means compute efficiency can be increased (Image: Flex Logix)

The InferX X1 has around 14 MB of SRAM in total, but Yolo v3’s first layer has 64 MB of activation data. If all those activations had to go to (relatively slow off-chip) DRAM, the processor would have to wait for the data to arrive. Fusing the layers and processing them at the same time effectively eliminates that delay.

The company also uses another memory trick. After each layer has been calculated, the code and weights for the next layer have to be brought to SRAM from DRAM. Flex Logix does this in the background while the previous layer is still being calculated to ensure there is no delay.

“The result is that most of the DRAM traffic that is normally required for other chips is either eliminated by fusion [of layers] or is hidden in the background. So it does not affect computation,” Tate said.

Availability
There will be four part numbers of InferX X1 available which operate between 533 and 933 MHz. The fastest will range in price from $199 in quantities of one thousand to $69 in quantities of one million. The slowest will be $99 in quantities of one thousand or $34 in quantities of one million.

“What this allows us to do is rapidly penetrate the market. People will switch over, but what we really want to do with our million-piece pricing is dramatically expand the market,” said Tate. “Everybody’s talking about how the edge market is going to grow to $10 billion in five years. Well, it won’t get there with modules priced at $300,” he added.

There will be wider temperature range versions available for demanding environments (Tj of -40 to +100 °C for industrial and -40 to +125 °C for aerospace — the regular chip can cope with Tj of 0 to 85 °C).

There will also be a half height, half length PCIe board available with one InferX X1 chip on it. Flex Logix plans to launch an M.2 format board in the near future, and next year, the same PCIe board with four InferX X1s on. This four-chip version is designed to compete with the Nvidia Tesla T4’s performance at much lower price.

Which market will be first to reach high volume production with millions of units shipping with AI inference capability? Tate suspects it will be cameras, because of their ubiquity, but he also said that the real killer app may not have been developed yet.

“When we started [high-speed memory company Rambus, of which Tate is co-founder and former CEO], we figured our customers would be high-speed graphics and high-performance workstation computing,” he said. “But our first volume application was the Nintendo 64. A toy! But it sold 10 million units in the first year. So it wouldn’t surprise me if the first million-unit application [for the InferX X1] was one that I’m not even thinking about today, just like at Rambus.”

Samples will be available to early engagement customers soon with broader sampling in Q1 2021. Production parts are scheduled for Q2 2021.

Leave a comment