AMD-Xilinx Debuts First Versal PCIe Accelerator Card

Article By : Steve Leibson

The card is part of the Versal ACAP AI Core Series, which is designed to boost the performance of key applications run in servers and data centers.

AMD had just barely announced the completion of its acquisition of FPGA maker Xilinx when the entrance sign to the south San Jose Xilinx campus on Union Street (which was once a popular 9-hole golf course) flipped over to display the new owner’s corporate name and logo. Now, a week later, AMD-Xilinx has announced its first Data Center Accelerator Card based on a member of the Versal ACAP (adaptive compute acceleration platform) AI Core Series. (ACAP is the name AMD-Xilinx uses to designate its newest line of SoCs based on FPGA technology.)

The new card, dubbed the Xilinx VCK5000, looks like your typical FPGA-based PCIe accelerator card designed to boost the performance of key applications being run in servers and data centers. These key applications include AI and ML (machine learning) applications in addition to many other varied tasks such as genomics, drug discovery, data analytics, and video transcoding. Of course, Nvidia is the 800-pound gorilla in this space and the performance benchmarks that AMD-Xilinx is using are aimed straight at that competitor.

Over the past several months, using the same Xilinx VCK5000 hardware, AMD-Xilinx has been able to boost the performance of this accelerator card by a factor of 2.5x to 3x as measured by two specific figures of merit: performance/watt and performance/dollar (aka TCO) for one specific ML workload: ResNet-50 v1.5. AMD-Xilinx has focused on these figures of merit because the absolute performance of the Xilinx VCK5000 is not at the top of the pack, but when also considering its low power consumption and lower acquisition costs, the AMD-Xilinx acceleration card looks very good.

More specifically, AMD-Xilinx claims that the Xilinx VCK5000 can outperform GPUs in these two figures of merit because of throughput efficiency, due to the inherent architectural advantage of an FPGA-based implementation that squeezes out the “data bubbles” inherent in ML applications.

Fixed GPU architectures are designed to handle data chunks in fixed sizes, so when the size of these chunks varies, GPU architectures can have difficulty accommodating these changes. The resulting data bubbles lead to computational inefficiency because the GPU’s computational elements lack data to crunch much of the time. By contrast, the reprogrammable nature of the programmable logic in an AMD-Xilinx ACAP allows the device to be reconfigured so that the hardware more closely matches the data formats being used for the computation at hand.

However, ResNet-50 performance — or any benchmark performance for that matter — is not the only story for FPGA-based ML implementations. In real-world ML applications, running the ML model to identify objects is not the be-all or end-all for the application. There are other practical tasks to be accomplished, as shown in the image below.

In the example above, the first step, which precedes image recognition using an ML model, is to decode and resize the incoming video stream to match the data input requirements of the ML model. After the model has detected, identified, and classified the object(s) in the video, there are additional tasks to perform including object cropping, image resizing, and object tracking. The programmable logic in an FPGA or ACAP can be configured to implement these additional tasks while a GPU is less suited for this sort of computational work due to its relative lack of algorithmic flexibility.

The very best metric or benchmark for any processor, AI or otherwise, is the actual application or workload you will run on the device. That means that standardized benchmarks like ResNet-50 can give you a relative feel for performance among alternatives, but you only know for sure when you run your target application on the processor. Vendors also use TOPS as an easily derived proxy for performance, but the calculated Peak TOPS for a processor is probably not the same as the actual TOPS delivered while running real AI workloads. Different users prefer different metrics for comparing alternative accelerators. One user may prefer the fastest execution for specific workload(s), another may prefer the best performance/watt, and yet another may prefer the best performance/dollar. No AI accelerator excels at all three.

This article was originally published on EE Times.

Steve Leibson is a Principal Analyst at Tirias Research. He has 45 years of industry-leading expertise in the development of advanced electronic systems using a wide range of technologies and has held managerial and technical positions at several leading electronics companies including HP, Cadnetix, Tensilica, Cadence Design Systems, Xilinx, and Intel. An industry expert and thought leader since 1985, Steve has been writing about electronic development in several leading industry publications including EDN Magazine and Microprocessor Report. He served as the founding editor for Wind River’s Embedded Developer’s Journal. He has also published several books and book chapters covering many electronics topics including the use of processor IP for ASIC development and he has presented numerous technical seminars and webinars to technical audiences, spoken at major industry events worldwide, and has provided strategic consulting to many leading technology companies.

 

Subscribe to Newsletter

Leave a comment