Linley Fall Processor Conference Dominated by AI

Article By : Kevin Krewell

Highlights include Intel's "Tremont" Atom processor; SiFive's U8 architecture; Marvell's push for Arm in data centers; Mellonox' Bluefield-2 I/O processor, and Achronix moving FPGAs into data center accelerated computing.

SANTA Clara, Calif. — Intel, SiFive, Marvell, Mellonox, and Achronix all announced important developments at the Linley Processor Conference. Most of the innovations were either about artificial intelligence processors or about supporting AI workloads in data centers. 

Silicon Valley has more than its share of technical conferences, more so than you could possibly attend in the year. And even the conferences you do attend have multiple tracks, making it difficult to catch all the presentations. With that said, there are certain conferences that stand out every year for new chip announcements, including Hot Chips, the newly formed AI Hardware Summit, and the Linley Processor Conferences. The latter two conferences have sponsors present, so the content can vary, but The Linley Group is good at keeping the content technical and relevant.

Linley Gwenapp moderating a panel discussion on AI at the Edge (Image: Kevin Krewell) Linley Gwennap, president and principal analyst at The Linley Group; Flex Logix co-founder Cheng C. Wang; Mythic CTO/founder Dave Fick; Brainchip chief data officer & co-founder Anil Mankar; GrAI Matter Labs chief strategy officer Jonathan Tapson; NovuMind CTO Chien-Ping Lu

Linley Gwenapp moderating a panel discussion on AI at the Edge
(Image: Kevin Krewell)

Linley Gwennap, president and principal analyst at The Linley Group; Flex Logix co-founder Cheng C. Wang; Mythic CTO/founder Dave Fick; Brainchip chief data officer & co-founder Anil Mankar; GrAI Matter Labs chief strategy officer Jonathan Tapson; NovuMind CTO Chien-Ping Lu

Most of these conferences have been inundated by AI hardware startups that have flooded the market with new chip designs in all manner of forms. To capitalize on market changes, established players have also embraced AI in their product lines. While the Linley Fall Processor Conference was not exclusively about AI, most of the presentations revolved around various forms of AI from the cloud to the network edge, and even the extremely low-power IoT edge. As a result, Linley Gwennap, the president of The Linley Group and the conference host, started off the conference by giving a background on the various forms of machine learning, the various trade-offs in designing machine learning processors, and a survey of existing AI chip suppliers.

Hot Chips 2019: The Hottest and Biggest Yet

Despite the preponderance of presentations from AI and machine learning (ML) startups, some of the major news from the Linley Fall Processor Conference were from the more traditional CPU vendors like Arm, Intel, Marvell, and SiFive. Without the Intel Developer Forum, the company has taken to introduce new CPU architectures at various other conferences including Hot Chips and the Linley conference. Intel chose the Linley conference to reveal the details of its new power efficient Atom processor called Tremont. The RISC-V startup SiFive unveiled its new U8 processor architecture which is its first out-of-order core.

Don’t Belittle Intel’s Tremont
Intel’s new 10nm Tremont core is promoted as a power-efficient core that can be mated with the higher performing Intel Core processor architectures like Sunny Cove. Tremont is also designed to be modular with various sizes of caches and different instruction decoder options. In addition, it plugs directly into Intel’s mesh network to attach to larger cores. In some ways this is like Arm’s big.LITTLE architecture where the power-efficient core runs until performance requirements call for a higher performing, but more power hungry, processor core. Intel refuses a called Tremont a “little” core and for good reason – Tremont is not so little in terms of performance and instructions per cycle (IPC). In fact, the Tremont architecture is a wide-issue, three-instruction-decode core with 208 reorder buffers and an IPC similar to a Skylake processor from just a few years ago.

Performance – Hybrid

Click here for larger image
(Source: Intel)

Click here for larger image
(Source: Intel)

There were still several trade-offs made in the design of Tremont to lower its overall power consumption and to support lower clock speeds.

The tradeoffs include a fixed decoder with no micro-op cache. Instead, Intel chose to include dual three-issue decoders to decode both branch streams. So, when a loop is completed, the instruction can execute immediately.

The integer execution pipeline has three ALUs, two address generation units, a load unit, and a store unit. In the interest of lower power and smaller die area, Tremont does not include Intel simultaneous multithreading capability and does not include the AVX SIMD instruction extensions. Tremont does include the latest SSE 8.2 extensions and includes hardware crypto processors.

While Tremont may not be optimized for higher clock speeds, it will offer exceptional IPC resulting in good single-thread performance and power efficiency. Unfortunately, the Intel spokesperson would not talk about clock speeds, which may vary depending on application. There was also no indication that Tremont includes or supports burst clock modes because it is designed to run in a power-efficient state and then move execution threads into the higher performing core when burst performance is required.

The first announced application for Tremont is in the Lakefield PC platform that has one “big” CPU core and four Tremont cores. Lakefield is using Intel’s latest die stacking technology called Foveros to create a PC processor plus memory in a very small footprint. Lakefield will power next year’s Microsoft Surface Neo folding tablet. Additional processors using Tremont should also ship in 2020.

SiFive Goes OoO (Out of Order)
The SiFive announcement of a 64-bit out-of-order microarchitecture called U8 was also a big step forward for the company. The initial disclosure centered around a specific instantiation of the U8 architecture, called the U84. The U84 is designed to go head-to-head with Arm’s popular Coretx-A72 CPU core IP.

Click here for larger image
(Source: SiFive)
Click here for larger image

(Source: SiFive)

The U84 represent a nominal high-performing version of the U8 architecture with sustained three-issue out-of-order execution (it can burst up to six instructions) and a 10 to 12 stage execution pipeline. The reorder buffer has “dozens of entries,” but the presenter did not provide more detail.

The U84 is also completely customizable. The company’s core configurator tool allows customers to create for their own versions by modifying the number of issue slots, functional units, cache sizes, floating-point unit, vector instructions, and more. Next year, SiFive expects to introduce the U87 instantiation that will include SIMD Vector extensions, which are expected to be approved shortly by the RISC-V foundation.

The U8 represents a family of 64-bit processors that can be configured for different requirements and has the flexibility to cover many specific Arm Cortex- A implementations. While the company believes it can deliver a U84 at 2.6GHz in 7nm, we’ll have to see some benchmark numbers to judge the delivered performance.

SiFive also revealed a comprehensive secure platform architecture called SiFive Shield, which includes hardware crypto support. The specification is open, and the platform is designed for scalability. It offers system level protection and SiFive will provide support services for root of trust and secure boot.

Processors matter
To round out the mainstream processing applications, Marvell talked about why Arm processors belong in the data center. The company promises a two-year cadence of new processor starting with the 7nm ThunderX3 in 2020 and a ThunderX4 in 2022. Each product will have IPC improvements resulting from more and better caches, more ALUs, larger Out-of-Order structures, better branch predictors, front-end enhancements, higher frequency, and power optimizations.

Mellonox talked about the Bluefield-2 I/O processor, with eight Arm Cortex-A72 cores, designed to offload networking workloads such as TCP/IP, software-defined networking, security. Bluefield-2 supports Ethernet, Infiniband, and RoCE interfaces. The Mellonox SmartNIC with the Bluefield-2 can offload many network, storage, and security processing from the main CPU for better performance and more secure data center.

The AI Take Over
Most of the other presentations focused on AI and ML. The second day keynote from Facebook discussed cloud scaling issues of machine learning accelerators. The deep learning computations for image recognition, speech recognition, and ranking and recommendation engines, at scale, stresses the entire cloud computing system. Improving the throughput of cloud inferences requires more than fast matrix math operations.

Following in the footprints of Altera and Xilinx, Achronix is getting into FPGAs for data center accelerated computing. It’s new 7nm Speedster7t product will ship in the VectorPath PCIe accelerator card, but is also available for IP licensing, setting the company apart from its rivals. The company has focused on fast I/O with PCIe gen 5 support and serdes speeds up to 112 Gbps. The machine learning inference performance will exceed 80 TOPS using int8 data processing. The chip will also support int16, int4, fp24, fp16, and BFloat16. Achronix will demo the cards at the forthcoming Supercomputer conference (SC19) and the PCIe cards are available for pre-order, with general Speedster 7t sampling early next year.

Intel’s director of its Neuromorphic Computing Lab, Mike Davies, talked about the neuromorphic research chips the company is developing. Davies pointed out the deep learning algorithms that have taken the industry by storm are not biologically inspired. Deep learning has been very effective due to back propagation, which doesn’t happen in the brain. Intel’s neuromorphic research tries to better simulate a biological brain but with digital logic. The design of the spiking neural network Loihi research chip has 128 neuromorphic cores and while digital, it is asynchronous. It can support up to 128K neurons and 128M synapses. The adaptive architecture is self-modifying and “plastic,” with no separate training and inference phases. The design is scalable from a block on an SoC, to a rack-mounted data center card. For now, it’s just research, but it could have applications for inference in robotics, autonomous systems, security and industrial monitoring, and human-computer interfacing. The biggest challenge will be software.

Habana discussed its “Gaudi” machine learning training accelerator. The Habana Gaudi processor and HLS-1 system go head-to-head against Nvidia’s V100-based cards and Nvidia’s DGX rack systems. The Habana HLS-1 uses a PCIe switch to connect multiple Gaudi processors in the HLS-1 rack versus the proprietary NVLink bus used by Nvidia. The key to Habana’s performance is the RoCA v2 interface using ten 100G Ethernet links through a non-blocking Ethernet switch. We’ll see more MLPerf benchmarks from Habana soon.

IP Providers Cadence and Arm discussed their low-power accelerators for ML. Cadence has multiple IP offerings: Tensilica HiFi 5 DSP for neural net-based speech and audio processing, Tensilica Vision Q7 DSP for Vision and AI DSP, and Tensilica DNA 100 Processor Standalone AI processors for AI inference at the edge (node). Arm talked about its Ethos family of ML accelerator IP. The top of the line is the Ethos-N77 with up to 4 TOP/s at 1 GHz and configurable SRAM array from 1-4MB.  The Ethos-N57 supports up to 2 TOP/s at 1 GHz and has 512KB Internal SRAM. The entry level Ethos-N37 has up to 1 TOP/s at 1 GHz and 512 KB of SRAM. Arm also had a session on the Vector Extensions (MVE) for future Cortex-M processors as part of Arm’s Helium technology. The extensions in Armv8.1-M add 128-bit Q registers for SIMD operations.

While Arm and Cadence focused on edge devices, several IP vendors are targeting autonomous vehicles. CEVA is addressing the automotive market with its NeuPro-S AI & Vision Processor.  The Synposys ARC VPX5 DSP Processor is also addressing the automotive market for sensor processing and sensor fusion with high precision floating-point processing. A company named Cornami is building a large fabric of systolic array cores to address autonomous driving processing challenges. It wants to distribute its cells at sensors and the central compute element. ArterisIP had its automotive hardened network-on-chip (NoC) to tie all the CPUs and accelerators together.

All the Interesting Products Live on the Edge
One vendor is targeting the “extreme edge” devices where energy and cost is at a premium. Eta Compute has a microcontroller with a Cortex-M3, a custom DSP, and interfaces that only consume 500nA in sleep mode with real time clock (RTC) or 750nA in sleep with RTC and 32KB active. The microcontroller is rated at 13uA/MHz running EEMBC Coremark. The application example was a shipping pallet tracker. The pallet tracker connects to a Low-Power Wide Area Network (LPWAN) and must run on 4 AA batteries for five years. The key design factor was it’s a semi-asynch design running at near threshold voltages where the clock speed of the CPU follows voltage levels. The processor can run a low-level convolutional neural network (CNN) to detect movement, vibrations, etc. with only 0.4 mJ per inference. The chip is sampling today with production in 2020.

Lattice is offering an inference FPGA in a tiny package. The tiny iCE40UltraPlus comes in a 5.4mm2 package with less than 10mW average power consumption. The applications include image classification, object detection, one-shot learning, multiple sensor fusions, and other vision and audio processing at the sensor edge.

FlexLogix is offering the InferX X1 processor. While the company was founded as an eFPGA company, it has decided to build a dedicated ML chip to prove its capability. Like other FPGA vendors, the company is focused on low-latency inference with very low batching. The InferX X1 processor is expected to tape out in December 2019 and a first public demonstration in April 2020 (if all goes well).

Mythic is developing a unique graph data flow architecture. It uses flash memory for an analog compute-in-memory architecture. In addition to the analog flash cells, the token-based mash architecture allows efficient synchronization operation.

No-Shows Blight AI Hardware Summit

BrainChip is developing a fully digital Spiking neural net (SNN) chip called Akida. The advantage of the SNN is that it only processes events, which are inherently sparse. This reduces the total number of operations and saves power. The use of quantizing weights and low-bit activations (1, 2 or 4 bits) reduces memory requirements. Each NN layer computations are performed on allocated neural processor units (NPUs) and all NPUs run in parallel. The intermediate results are stored in on-chip memory to reduce any off-chip memory accesses. The Akida chip can runs the entire NN without a CPU. Akida also allows inference and incremental learning on edge devices, all within a low power, size, and computation budget.

GrAI (pronounced “grey”) Matter Labs announced its first chip called GrAI One. This chip is designed for high data flow applications such as live streams for rapid and autonomous reactivity. The GrAI One uses a neuromorphic computing model (like Intel’s Loihi) combined with dataflow architecture. While this first chip is smaller, the company believes it can scale to much larger chips. The chip is designed to extract information on real-time sensory Inputs. The GrAI One processor will have 196 cores and can model 200k neurons. The chip is only 20mm2 and will be available in the first half of 2020.

Finally, another startup, Novumind is also targeting video processing applications – a very popular target market for AI startups. The company has a working test chip and is developing a 12nm production chip. At the conference, the company was showing its test chip and it expects a nearly 30x improvement in performance per watt over the text chip. While most of the new chip designs shun external memory connections, The NovuMind chip uses DRAM for model size flexibility. The first production chip will have eight cores with 2,304 MACs per core. The 12nm chip will consume 5 Watts at 1GHz. The chip is capable of processing 8K Super Resolution at 60 FPS. The target markets include in-camera AI for smart cities.

Many of these unique architectures targeting the extreme edge are designed around power limitations and data sparsity..

The Linley Processor Conference was jam-packed with new announcements and AI startups. The big announcements came from Intel and SiFive, but there are plenty of smaller startups hoping to find a market niche. However, Gwennap, the conference host, believes that ML functionality eventually becomes pervasive, it will be integrated into chips as a feature and not be a dedicated market. Which company survives may depend on how approachable the software tools are, not just the hardware features.    

All the slide materials will be available at the Linley Group website later this week.

Subscribe to Newsletter

Leave a comment