



Space Tolerant CNN FPGA Deployment Part 4: Deployment on the ADA-SDEV-KIT3

A. C. McCormick

# Space Tolerant CNN FPGA Deployment, Part 4

This paper is the final part of a four-part series of white papers providing an educational overview of the issues surrounding the deployment of Convolutional Neural Network solutions on FPGAs in a radiation susceptible environment. The first part documented a practical CNN processing core, which can be used to implement a wide range of CNN solutions. The second part, discussed the Space Hardening of that core, adding in Tiple-Mode Redundancy for radiation effect telerance to control path circuity. This third part documented the higher level control structures necessary to move data to and from the CNN core and dynamically reconfigure is operation to marke the functional requirements of a parts of the model. This final fourth part of this series documents the deployment of this FPGA solution on the Alpha Data ADA-SDEV-KIT3 Space Development kit for the XIII'NX XDRI LIND FPGA Afeuts.

## ADA-SDEV-KIT3 Development Board



Figure 1 : ADA-SDEV-KIT3



options to allow customers to configure their IO requirements, including the Config-FMC, which includes a PCIe compatible IPASS connector which will be used to transfer data to and from the DPU design. The board also features DDR3 memory on a SODIMM, which will also be utilized by the DPU design.

The example code structure for this part of the paper is based on the reference designs for the ADA-SDEV-KIT3. Therefore it has a more complex directory structure when compared with the other parts of this paper, with directories for FPGA designs, source code and projects. There are also directories for forts othware since a PCIE connected host will be used for overall system control and data transfer and evaluation. The directory structure is as follows:



The code developed for the earlier parts of the paper must be included in the indicated folders to be sourced by the IP packaging scripts.

## **DPU IP Packaging**

In order to combine the DPU design developed in the first 3 parts of this paper with other IP necessary to build a board level system, the packaging up of the DPU IP is required. By encapsulating the VHDL design as an XIII of IP core it can be used along side other off-the-shelf IP in the XIIinx IP Integrator, block diagram capture tool that sits within the Vindor design suite.

The design, as left at the end of the simuations in part 3, still requires a few minor modifications to make it suitable for integration with other IP coses. The control and status monitoring signals need to be wrapped up into an AXI4 compatible form, and made accessible as registers. This will allow a remote processor to configure and AXI4 compatible form, and made accessible as registers. This will allow a remote processor to configure and trigger the DPU unus and monitor the performance. The IPI tool and the other IP required to interface to the outside world does not support TMR and is not aware of the TMR types used and therefore the TMR DPU core needs to be wrapped with signals; resolved to standards lody evector types.

Therefore in the source code 2 new files are added: reg\_bank\_axi4l.vhd which provides a generic AXI4 Lite

Page 2 ad-an-0119 v1 0.pdf



register bank, with programmable bit fields that can be connected to the DPU core control and status signals and dpu\_top.vhd which connects and wraps up the DPU core and register bank and resolves any TMR signals.

The IP is packaged using the script *mikp.tc*/which pulls in the existing files from the folders from previous parts of the paper, and combines them with the 2 new files. The DataMover IP cores are also generated and included in the project. This is then all packaged up and saved as an IP core  $dpu_L top_L v_L 0.2p$  in the ip subfolder for inclusion in the main project.



Figure 2 : Packaging DPU IP Using Vivado

## IP Integration

Since we now have the DPU packaged up neatly as an IP Core it can now be used with the Xlimx IP integrator tool within Vawda. The block diagrams can be schematically edited, connecting up signals via the GUI. The block diagrams can also be built using TCL scripts for a more traceable and repeatable design flow, that can work easily with version control systems. The report folder contains scripts for generating a project to build the PFGA betterem (inport-sub060\_fit.ch), and this calls the script that generates the block diagram which will be the call to the cript that generates the block diagram which are





Figure 3 : Vivado IPI Integrator Block Diagram

The scripts are based on the PCIe reference design for the ADA-SDEV-KIT and use a PCIe configuration based on that example. They also instantiate the DDR3 core, and set it up with the same parameters as used in the DDR3 example from the ADA-SDEV-KIT reference designs. The script then adds in the DPU core and connects up the AXIA buses so that there is high performance access from both the host (PCIe core) and the DPU cores memory port to the DDR3 interface! P block. AXIA bus connection is also set up to enable the host (Via the PCIe core) to access the AXI-Lite register interface on the DPU. Additionally the block diagram connects up some of the LEDs on the board to show the status of the application. One indicates that PCIe is online and another that the DDR3 memory has trained correctly. The other 4 LEDs connect to status bits that show the state of the DPU.

| State                     | LED2 | LED3 | LED4 | LED5 |
|---------------------------|------|------|------|------|
| Idle                      | On   | On   | On   | On   |
| Read Instruction Word Cmd | On   | On   | On   | Off  |
| Wait for Instruction Word | Off  | On   | On   | On   |
| Reading Instruction Word  | On   | On   | Off  | On   |
| Read Weights Cmd          | On   | On   | Off  | Off  |
| Wait for Weights          | Off  | On   | On   | Off  |
| Reading Weights           | On   | Off  | On   | On   |
| Start Writer Cmd          | On   | Off  | On   | Off  |
| Start Writer 2 Cmd        | Off  | On   | Off  | Off  |
| Start Reader 2 Cmd        | Off  | On   | Off  | On   |
| Start DPU Cmd             | On   | Off  | Off  | On   |
| DPU Active                | On   | Off  | Off  | Off  |
| DPU Pausing               | Off  | On   | Off  | Off  |
| Error State               | Off  | Off  | Off  | Off  |

Table 1 : Data Files

Page 4 ad-an-0119\_v1\_0.pdf



#### Host Software

While the DPU runs autonomously on data and a program placed in SDRAM, some higher level control is still required to set up the contents of SDRAM, generate the DPU program, transport the input and output data and start and monitor the DPU until it completes.

The example host software is however a simple example that aims to replicate the behaviour of the test bench tests used in previous papers, reading in the text only Caffe file to get the weights and bias definitions for the YoloV3 network and use them to conflioure DPU to run the various different leavers.

The example code is split into 3 files: read\_xcaffe\_file.c, create\_dpu\_iword.c and the top level run\_dpu\_test.c

read\_xcaffe\_file.c is a port from VHDL to C of the test bench function with the same name. This function extracts from a text only Caffe file, the weight and bias data required for a specified network layer. This is then stored in the required 16 bit integer format in host memory ready to transfer to the FPGA attached memory. This function does not parse the Caffe file for the network structure. This is hard-coded into the top level code.

create\_dpu\_iword.c exports a function of the same name that assembles the 128 byte instruction word for the Disead on the parameters passed in. These parameters come from the network description hard-coded into the top layer.

run, dpu, text is the top level main function that tests the DPU network on the FPGA. This design hard codes in the Yold/3 network structure in the order to used by the DPU. While based on the Caller model, this information may not be automatically be estractable, and there are cases where one layer in the Caller model maps to multiple sequential run of a subset of neurons on that layer. This is currently mapped by hand. The resulting DPU program runs as 29 sequential layers - where the Calle description only features 21 (some of these 21 are May-Tool layers that are absorbed in the preceding convolutional layer in the DPU, DA well as the structure being hard coded here, the memory map for layer inputs and outputs is also mapped manually. In cases where the Calle Model Lever is soil time multiple sequential runs of 128 neurons. the memory writes are interfered.

The test program has some command line parameters to simplify any required debug. These allow the number of DPU layers activally run to be controlled, which can be useful for sporting which layer causes a look up, due to incornect specification of the instruction word. A second option allows the layers to be run sequentially by the CPU, rather than as a linked list, where the CPU waits for the last layer to complete. This can also isolate bugs to a single layer.

With this hard coded network definition in place and with hard-coded file links to the Calfe model text description for weights and bissa and to the right image file, the example code is not very featible for tragetiffs different models. While most Calfe models should be targetable at the DPU, a bit of manual investigation will be required in most cases to evaluate suitability and so automatic network definition from Calfe files is not supported. For example larger networks might benefit from a larger OPU size (e.g., one that is built from 256, 512 or 1024 neurons).

The example code takes the hard coded network definition and builds up an image in memory to be transferred to the FPGA attached memory. The code loops through all enabled layers, up to 29, reading the weights from the Calfle file, and building the Instruction World from the hard coded network information. Weight memory start addresses are dynamically allocated based on the memory used. Once the network image is completed, the injury data is also added to the memory buffer

The example application then uses the Alpha Data ADXDMA driver and API to control the data transfer over PCIe to the ADA-SDEV-KIT3 board DDR3 memory. The DMA functions are used to simply copy the 12MB of data across to the DDR3 DIMM on the ADA-SDEV-KIT from where it can be accessed by the DPU.

The DPU is started with register writes to the start address. The CPU then simply polls the DPU registers every 250us until completion, displaying DPU state, time since the DPU start and time DPU active. By default this is just run once for all the DPU layers, but optionally each layer can be started in turn with the time for each byer recorded separately, which can be useful in determining the layer efficiency, which can be quite low for the batch



1 size used for certain layers with a high number of weights.

After the register reads detect that the DPU has competed, DMA functions then transfer back the data from the results areas of DDR3 into host memory. This data is then written out as two text files for further processing and analysis.

Page 6 ad-an-0119\_v1\_0.pdf



## Deployment



Figure 4 : ADA-SDEV-KIT3 Connection to PCIe

The equipment reguled to run the application in hardware in the same as described in appendix A of advapofex\_V\_1, C. ANA.SEVEKT ONA Demonstration FPGA Design Release, 1.0. Sepcifically a Linux test machine is required to host the PCIe extender card. This machine (or another closely located) must also be able to facilitate the TATE or perparaming of the FPGA using VANA behaviors what any and a XIIIn VSB JATG cable. An ADA.SEVEXTS (or ADA.SEVEXTS) powerd by an ATX power supply is required with the ADM-SEVEXTG (or ADM-SEVEXTC) configuration FMC filted to provide access to the IPASS cable. A PCIe extender card (e.g. OSS-PCIE-HIBZS-v4-H) with an IPASS cable must be plugged into the test machine to allow the software to transfer data across to the FPGA and its attached DDR3 memory, Figure 4, shows the ADA-SEVEXTS itted with the ADM-SEVEXTG module (and another unused FMC card) and also the far end PCIE card for connecting the other end for the IPASS cable.

Using this test set up the bitstream was tested by downloading over JTAG into the KUB60 FPGA. The test Linux, PCP, naming Ushurt 16.04 was then reset allowing it to identify the PClee endpoint. The ADXDMA driver was installed on the test Linux PC allowing it to identify and control the FPGA PCle endpoint. The test application was compiled and run on this set up with the data and model files copied to the local directory. Testing was performed in both layer-by-layer mode and with a full ent of end run.

Running a full end to end run results in a total DPU processing time of 38101671 clock cycles and an active time of 30168115 clock cycles. The clock used for the DPU is the ISBME backend clock from the PDE. In theory a faster clock could be used for the DPU and the AXI connections between the DPU and the DDR3 memory, improving the performace, but perhaps making the design more difficult to foliow. The resulting DPU processing time for the network is 304ms, operating at an efficiency of sound 95%, with around 25% of the three beings used to load weights. This efficiency could be improved by batching up the processing, assuming latency of response is not critical.



### Conclusions and Summary

This paper series has covered the implementation of a DPU suitable for applications in radiation tolerant environments. The first paper covered a basic though rescalable DPU core implementation. The second paper covered the selective use of triple mode redundancy on the control side to make the design robust in those challenging environments. The third paper covered many of the practical data transport and control issues surrounding the core DPU processing, such as fetching the model weights and data from memory for the processing, and configurity and scheduling processing runs for each layer in sequence.

This final paper has taken the DPU design, proven in simulation used in the first 3 papers, and surrounded it with appropriate bell design for deployment on real FPGA hardware. The ADA-SDE-WK17 is used as it is a reference platform for the KU060 FPGA (a part also available in a Radiation Tolerant package for Space flight deployment.) To ease the use of the core DPU IP with the other shell infrastructure, the VHDL code was packaged using Klimit P Packager. This then allows a simple TCL sorpt to instantiate the board specific shell components (XDMA PCIe endpoint and DDR3 controller) along with the DPU and other AX infrastructure in the same Vivado IP Internator block disarram, which can then be used to operate the FPGA bitterant.

To test the bistream a simple C application was written. This included as port of the V-PIOL testbench code which read in the Caffe Model layers from the text version of the Caffe Model file. This also includes a function for converting a model layer description into an instruction word for the DPU processor. The main function constructed a memory image to copy to the FPGA DDRS SDRAM, and then used the ADXMDA driver and API to copy this across over PGC. The application then triggered the DPU to run, and when the DPU completed, the application also copied back the results data over PCIe for further analysis, demonstrating the operation of the DPU on all the Volky Revokx levers and allowing some performance metrics to be read back.

This series has covered the implementation of a potentially radiation tolerant machine learning CNN solution on the Xilins KU606 FPGA device. Starting from a subside CNN implementation structure a general purpose DPU cover was developed that can support many different neural network layers. The selective use of tiple mode redundancy was employed to make the DPU design moter robust to single event upsets. Higher level control and data transport structures have been wamped amound the processing core to stay towards a practical implementation and a few offer network specific ad-hoc features were added to specifically support the Yolov's structure. This core was then transported and the KU600 device on the ADA-SDEV-TATG development boards, and wrapped by with support to DDPGs element manner, element PFLe population and reached of the DDP structure. The processing core is the processing of the structure of the ADA-SDEV-TATG development boards, and wrapped by with support of DDPGs element manner, element PFLe population and reached of the DDP structure. The processing of the structure of the ADA-SDEV-TATG development boards, and therefore could be embedded in a different PFQA system, where the data is collected directly via another interface, running of the DPU is audinously triggered by this, and results are acted upon by other hardware modules in the PFQA. And therefore the DPU could be used in situations where external CPU control is not available.

Page 8 ad-an-0119\_v1\_0.pdf



# **Revision History**

| Date     | Revision | Nature of Change |
|----------|----------|------------------|
| 20/01/22 | 0.1      | First draft      |
| 21/03/22 | 1.0      | First release    |

Address: Suite L4A, 160 Dundee Street, Edinburgh, EH11 1DQ, UK Telephone: +44 131 558 2600 Fax: +44 131 558 2700 email: sales@alpha-data.com website: http://www.alpha-data.com Address: 10822 West Toller Drive, Suite 250 Littleton, CO 80127 Telephone: (303) 954 8788 Fax: (866) 820 9956 - toll free email: sales @ alpha-data.com website: http://www.alpha-data.com