Real World Bifurcated (2x) Gen4x8 PCIe performance in excess of 22GB/s using ADM-PCIE-9H7/ASUS Pro WS X-570/AMD Ryzen

Alpha Data have demonstrated real world bifurcated (2x) Gen4x8 PCIe performance between AMD Ryzen 5-3600 CPU and Xilinx Virtex Ultrascale+™ VU37P HBM FPGA, using the ADM-PCIE-9H7 FPGA Accelerator Card on the ASUS Pro WS X570-ACE Motherboard.

Processor: AMD Ryzen 5 3600 6-Core
Motherboard: ASUS Pro WS X570-ACE (BIOS ver.9904)
RAM: 16GB DDR4 @ 3000MHz (dual bank)
VGA: NVIDIA GeForce 210
OS: Windows10 (x64)
Accelerator: Alpha Data ADM-PCIE-9H7


The Xilinx Virtex Ultrascale+™ HBM FPGAs support PCIe Gen4 and have multiple 8 lane capable endpoints in the device. The ADM-PCIE-9H7 board allows 2 of these to connect to the 16 lane edge connector. The AMD Ryzen CPU and the ASUS Pro WS X570-ACE Motherboard also now support PCIe Gen4, and the 9904 BIOS update allows the splitting of a Gen4x16 slot into two Gen4x8 devices. Compatibility and performance of these has now been verified by Alpha Data at their labs in Edinburgh

Reference Design

A bifurcated reference design has been created to demonstrate these features and is available on request to Alpha Data customers. The design implements 2 instantiations of the Xilinx XDMA core which connect to the very high speed high bandwidth on-chip BlockRAM and UltraRAM. The Alpha Data ADXDMA driver and API allows multi-threaded host access to these endpoints and DMA engines to allow maximum practical transfer performance to be achieved. The reference design is available for the ADM-PCIE-9H7 full height double width VU37P based accelerator as well as the low profile ADM-PCIE-9H3 VU33P based accelerator.

Performance Results

Compared to the line bit rate of 32GB/s, testing under the Windows 10 Operating System, the software and FPGA design achieved aggregate rates, in a single direction of more than 22GB/s and in some cases as high as 25GB/s.

Using UltraRAM instead of BRAM

Changing the design to implement 4MB of UltraRam (instead of 1MB of BlockRam), faster transfer speeds were achieved:

  • x10 times average: 26.21 GB/s
  • peak: 26.65 GB/s
  • valley: 25.76 GB/s


Using UltraRam simplifies the clock scheme to a single clock for both memory ports removing the need for cross-clock boundary logic, reducing latency and increasing the maximum throughput. This is achieved by setting the Optimization strategy to “Maximize Performance”.



