A potential AMD-Xilinx deal could make the scrappy CPU/GPU company a real contender in the data center AI market...
AMD is rumored to be in talks to buy Xilinx for close to $30 billion, according to The Wall Street Journal. Does an AMD-Xilinx deal make sense, and why is AMD interested in Xilinx? The obvious answer is that AMD is looking to boost its data center offering with particular emphasis on AI acceleration, a rapidly growing and lucrative market.
Both companies have established AI accelerator product lines, but what could this mean for a potentially combined offering?
AMD, under the leadership of Lisa Su since 2015, has brought itself back from the brink with a stream
of new CPU and GPU products to become a serious player once again. The company is gaining ground in high-performance computing (HPC) with high-profile exascale design wins for its Epyc CPUs and Radeon GPUs in the Frontier and El Capitan, which will be the most powerful machines on the planet by far when they come online. Currently AMD Epyc CPUs power ten supercomputers in the TOP500, including four in the top 50.
For AI acceleration in servers of all kinds, AMD offers Radeon Instinct accelerator cards. The most powerful in the family is the Radeon Instinct MI50, launched in November 2018, which offers 13.3 TFLOPS peak single precision performance (FP32) or 26.5 TFLOPS at half precision (FP16) for AI training. For AI inference, the card offers 53 TOPS INT8 performance. It is built on AMD’s Vega 7nm GPUs.
Xilinx’ offering for AI in the data center is based around its Versal ACAP (adaptive compute acceleration platform), with its SoC-like chips which feature a combination of CPU cores, programmable logic and ASIC elements, including a dedicated hard-wired AI accelerator block which is present in the Versal AI Core chip. This ASIC block can achieve up to 133 TOPS for bulk AI inference workloads (at INT8). For data center applications that don’t require this many TOPS, all the ACAP devices have a sizeable programmable logic block which can be programmed to accelerate AI.
On the software side, Xilinx has done a lot of work on Vitis, its tool platform that is designed to make its heterogeneous compute products accessible to hardware developers, software developers and data scientists alike. The company also previously acquired DeePhi, a specialist in pruning and quantization, techniques which reduce the computational requirements of neural networks.
One of FPGAs’ strengths is they are both flexible and future-proof, a big plus in the rapidly evolving world of AI. Xilinx’ devices in particular are aimed at low-latency AI inference (their devices are not currently pitched at AI training applications).
Deployments in Public Cloud
Here’s an interesting snapshot into the market share of AMD and Xilinx in the public cloud.
Data from public cloud technology analysts Liftr Insights showed that as of March 2020, AI accelerator chips were attached almost exclusively to Intel Xeon CPUs, with AMD’s Epyc CPUs the exception at Microsoft Azure.
In the public cloud (Alibaba, AWS and Microsoft Azure), Nvidia holds 86% market share for AI accelerator chips, while AMD legacy and Radeon GPUs held about 2% each. FPGAs fared slightly better, with Intel Arria 10 holding around 4% and Xilinx UltraScale+ at 5% (note that as of March 2020, newer FPGA chips such as Versal ACAPs had yet to make their way into the public cloud. Hyperscalers’ refresh cycles can be in the range of 3-5 years).
(Read Liftr Insights’ full article for EE Times here).
The incumbent in data center compute, Intel, holds an almost exclusive position with its Xeon CPUs in hyperscale and enterprise data centers. Xeon CPUs still carry out the vast majority of AI inference in these applications, simply because they are already there (though Intel likes to say that some of its customers like to preserve flexibility in the types of workload they can process, and therefore may like to stick with CPUs). Intel has added AI-specific features to Xeon CPUs in the last couple of years (DLBoost).
Intel purchased Xilinx’ main competitor, Altera, for $16.7 billion in 2015, and the business unit is successful in the data center. Could AMD be trying to beat Intel at its own game by copying its strategy? It’s possible, but is it too late? After all, 5 years is eons in semiconductor terms. Or has Xilinx, left to its own devices in the last 5 years, been able to develop a superior FPGA-based AI acceleration offering for servers? I believe this is the case; Intel released its first AI-optimized FPGA, the Stratix 10-NX, this summer, while Xilinx has been working on its ACAP silicon for several years so its AI FPGA offering is much more mature.
Of course, Intel has other offerings for AI in the data center, including Xeon CPUs and dedicated AI accelerator ASICs for the data center courtesy of acquisition Habana Labs, which may have been the reason it was slower to produce optimized FPGAs for this field.
Where Intel is certainly behind is on the data center GPU front. Ponte Vecchio was expected this year but has been delayed until 2021 or 2022.
The main contender taking on Intel in the data center (and HPC) for AI acceleration is Nvidia with its GPU accelerators. Nvidia is going from strength to strength; the company recently made a deal to buy IP giant Arm for $40 billion. Nvidia’s data center GPU business has overtaken its graphics card business in recent months and the company is the “one to beat” when it comes to dedicated AI accelerators in the data center.
Aside from the leading AI training technology on the market, Nvidia has other strings in its bow. The company’s CUDA platform is popular with developers; combine it with Arm’s enormous developer community and that will be a force to be reckoned with.
Nvidia also recently acquired Mellanox, the data center networking IC/SmartNIC maker, in order to expand its data center offering. Nvidia has plans to combine Mellanox’ SmartNICs with Arm CPU accelerators and VLIW acceleration blocks to make what it calls a DPU (data processing unit) — on the roadmap for 2023 is a DPU with integrated GPU accelerator for AI network functions (such as anomaly detection as mentioned above). AMD doesn’t have anything in this area, but Xilinx does — a SmartNIC platform based on its FPGAs was launched this spring.
Is AMD trying to build a complete data center computing platform, similar to what Nvidia is trying to do with Arm? Nvidia GPU + Arm CPU + Nvidia DPU is a compelling combination for building a complete data centre offering. AMD’s platform would be AMD CPU + AMD GPU + Xilinx FPGA + Xilinx SmartNIC. Will this recipe be tasty enough to beat Nvidia’s, given Nvidia’s head start?
Of course, the real recipe for success has a lot more elements than just heterogeneous compute architectures and their relative performance — distribution channels, customer relationships, software maturity, performance versus power consumption, cost… with the possible exception of cost, Xilinx has a lot going for it here too. This includes well-established customer relationships in the cloud and enterprise data center markets, in everything from the cloud to industrial to fintech to network infrastructure.
There’s no doubt that we are beginning the age of the complete computing platform, as heterogeneous compute gains importance because of AI’s growing pervasiveness, and the industry undergoes waves of consolidation, empires are being built. Ultimately, if AMD is serious about competing with Intel and Nvidia in the data center, it needs to expand its offering and broaden its platform. Acquiring Xilinx could be a great way to do that.