VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY COMPUTER ENGINEERING DEPARTMENT PHAN TUAN THANH VŨ HOANG HY GRADUATION THESIS THE RESEARCH & IMPLEMENTATION OF CNN ALGORITHM (YOLO) ON ZEDBOARD ZYNQ 7000 NGHIEN CUU VA THUC HIEN THUAT TOAN CNN (YOLO) TREN ZEDBOARD ZYNQ 7000 ENGINEER OF COMPUTER ENGINEERING HO CHi MINH CITY, 2021 VIETNAM NATION UNIVERSITY HO CHI MINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY COMPUTER ENGINEERING DEPARTMENT PHAN TUAN THANH- 16521807 VŨ HOANG HY - 16520545 GRADUATION THESIS THE RESEARCH & IMPLEMENTATION OF CNN ALGORITHM (YOLO) ON ZEDBOARD ZYNQ 7000 NGHIEN CUU VA THUC HIEN THUAT TOAN CNN (YOLO) TREN ZEDBOARD ZYNQ 7000 ENGINEER OF COMPUTER ENGINEERING INSTRUCTOR PhD. NGUYEN MINH SON HO CHi MINH CITY, 2021 LIST OF THE COUNCIL TO PROTECT THE THESIS Graduation thesis grading council established under Decision no 70/QD-DHCNTT dated January 27, 2021 of the Rector of University of Information Technology. THANK YOU To complete this graduation thesis, we would like to send our sincere thanks to the teachers of Computer Engineering and the University of Information Technology - National University of Ho Chi Minh have taught us knowledge and impart invaluable experiences throughout the past leaning journey. Especially, we are also sincere thank Mr.
Nguyen Minh Son helped us and take the time to guide and instruct us throughout the thesis so that we can complete the graduation thesis. Thank kith and kin who helped us in finding information during the time of this thesis. Thank you for accompanying through the past 5 years of school. Once again, we would like to sincerely thank everyone for their time and effort to help us in the graduation thesis process.
We apologize to everyone because the mistakes and fault are unavoidable during the thesis, we hope teachers and you can ignore and forgive. Student implementation Phan Tuan Thanh Vi Hoang Hy Department of Computer Engineering, class MTCL2016. ST HH1 100 re. Overseas SIfUA(IOH.- óc kh HH ng rà2 1.
Fully Connected Layer. Neural Network for YOLOV1 ou. Intersection Over Union. Disadvantages of YOLOV1 00.
cece nes eseseeeseseetseeneseneeeenenes 6 P (909/22. _ High Resolution Classifier. Use Anchor Box Architecture to Make Predictions. K-mean Clustering for Anchor Selection.
Direct Location Prediction. Add Fine-grained F€aftUres. Multi-Scale Training. Light weight Backbone .-ccccccc St net 25 2.
Petalinux + Xilinx SDK. Training Before Recognizing. Field Programmable Gate Array (FPGA). Overview of ZedBOard.
Advanced Extensible Interface (AXI) BUS. Development Tool and Overall Architecture?. Neuron Network TFÏOW. cà Sà kg ey35 2.
Execution Time TT. Accelerator Overall Architecfure. How It Works in General. SYSTEM DESIGN AND IMPLEMENTATION.
Input and Output Module .--- th HH ngờ44 3.- «6 tt vn ngư 45 3. Implement of recognition DFOC€SSINE. Implement on Vivado. Implement on Zedboard.
Verified on Some preliminary SC€TATIO. Verification and ACCUTaCY.-- 5-55 St‡t‡ttrrrkekererrkree 53 Chapter 4. CONCLUSION AND DEVELOPMENT DIRECTIONS. The status of the entire project.
Write_back_output. Weight_mmepy_everyKXk. Weight_load_reorg oo. SH HH HH HH HH6l 4.
Mimepy_outputpixel oc. Mimcpu_inputport oo. Mimcpu_inputport2 wo. Copy_input2buf rOwW.
Copy_input_ Weigl. What We Gained, Limitation and Direction of DevelopmeIt. What We Gained 00. Direction of Development.- - 5 +5++c+£+xe£erzkexererereree 74 LIST OF FIGURES Figure 1: The common Neural Network stream to process the input image and classify objects based On vaÌÏue.
5 Figure 2: Neural calculation DFOC€SS. -- ¿5 55252 2*2E93* 2E2vErrkrkekerrrree 5 Figure 3: A 5x5 filter is used to detect the angle / edỹe. -¿- ¿+ «cccxscerrxe+ 6 Figure 4: ReLU process .- - -- 522 2222222223219 221217121 13 17171121111 111 ke7 Figure 5: How YOLOvÏ pT€diCS.- - - ¿5S S22 k2 1 111111 1 re9 Figure 6: YOLOvI architecture. Figure 7: Frames per second of each aÏlgOrithim.-- - - +5 c+++x+xexexzererexsee 1 Figure 8: Example the only object that the square Contains.
---‹-+---+ 2 Figure 9: Each square is responsible for each predicting 2 boundary boxes. 2 Figure 10: A vector has 2 boxes and 3 layers for each square.---- - - eee 3 Figure 11: Neural Network model for YOLOvIL.----¿-5-+555<+<+5<++ 3 Figure 12: IOU eXaImpÌE.- ¿5c kt tt 3312k kề 5 Figure 13: NN output (i, j) maps to image (i, ]). Figure 14: Down sample uu. ccccesceseseesesceseseseseeseseassessesssescsesesessseessnaneaesseneeaees 9 Figure 15: Anchor box refining the anchor box position and size.--- 20 Figure 16: Predict the bounding box in YOLOV2.ccseeeeesereeeeeeseeeeeeeteeeeee 22 Figure 17: YOLOv2 ArChit€C(UT€.
23 Figure 18: The Reorg technique in YOLOV2 .---- 5 25+ 5++++£‡£+£s£ztztsrerrx 23 Figure 19: Darknet- [9. 24 Figure 20: WordTree-YOLO9000. Figure 21: Vivado Design HLS flow.--- Street 26 Figure 22: Config to make boot file can access root file system in SD card. 27 Figure 23: Config bootloader for Zynq ZC702 and Zedboard.----27 Figure 24: Config to make Zedboard use boot file on SD card.-----27 Figure 25: Config to attach bootloader to BOOT.ccc<c 28 Figure 26: Config to access the kernel file from SD card.----‹ s«-s5+++s+> 28 Figure 27: General FPGA architecture .c cscs eeceeseseeseesesesesseseseseessneseneaeene 29 Figure 28: Structure of Configurable Logic Block.
Figure 29: Functional Overlay of ZedBoard.--- --¿-5555<+c+cece<+ecexsee--2 Í Figure 30: Block Diagram of ZedBoard.32 Figure 3l: Interface and interCOMNect .233 Figure 32: Channel in A XI interface. cece c5 E2 2ESEk+kEk2 E11 gưên 34 Figure 33: Overall Architecture. - 6 S191 1E EE 2 E1 TH HH gướn35 Figure 34: Execution Schedule .---- - 5+ + 522++++2tS2x2EEevererkrkerrkerrkrrrree 38 Figure 35: Overall architecture of accelerator. Figure 36: Single channel data transimiSSIO.- - «¿5 S2 *£+x++Evxervrkevevee 42 Figure 37: Neuron module (Tn = 4 and Tm = 2) .---¿- 5 55+ 5+ £+£+++zxzxcxcxe 43 Figure 38: Pooling mOdUÏe.-- ¿c6 tt 1 E1 111 1 111 1H11 kg rư44 Figure 39: Reorg Module TT.
45 Figure 40: Overall Architecture implement on ZedBoard.-------- - 46 Figure 41: Block design on VIVadO. á cà tk SH Hư47 Figure 42: Result of simulate with close object. Figure 43: Output image of close Object 2. ee seeeeeseeseteeteneeeeeseeseeeneneeereseeeeneneee 49 Figure 44: Result of simulate with far ObjeC(.- s52 525555 sccstzerrrerreeeree 49 Figure 45: Output image of far Obj€C.
- - 5c 2S 1 ghê50 Figure 46: Aero plane €f€CfiOI. - -- 6S 1 1 12121 1131 101210111111 kg gưư50 Figure 47: Bicycle detection. S11 211g HH Hư51 Figure 48: Bus deteCfiOn. 12 221212 2112111 111210 H0 rrưưn51 Figure 49: Car det€CtÏOn.
52 Figure 50: Fire hydrant det€CfÏOH.--¿- - + S512 k2 2E 112 2112121 1g 1x tr52 Figure 51: Giraffe and zebra detection. Figure 52: Result of processing on Z@dbOarrd. -- 6 + 5c 53 Figure 53: Detail of project in Vivado HLS .- 5-5-5252 5+5++c+cccsrcreresrer 57 Figure 54: Result of running bitstream.--- - - +55 scccvztzterrrerrereeree 72 Figure 55: Power Resources after generating bitstream.----«-+-«es+cs+ 73 LIST OF TABLES Table 1: Compare small filter and large Ẩi](€T.--¿-¿- ¿5+ 6+ s*£v+s£vxexexrrxrre+ 6 Table 2: Batch Normalization Transform, applied to activation x over a mini-batch —. 16 Table 3: Neuron Network FÏOW.
te 36 Table 4: Compare result between software darknet and our implement. 54 Table 5: Timing (ns) of YOLO2_FPGA module:.- 2-5-5552 5<5+++z++£+s++ 57 Table 6: Utilization estimates of YOLO2_FPGA module. Table 7: Timing (ns) of Write_back_output moduÌe.--¿- s5 +5+5+ 5c58 Table 8: Utilization estimates of Write_back_outputt module.--------- 58 Table 9: Timing (ns) of Weight_mmcpy_everyKxk module.-:---- 59 Table 10: Utilization estimates of Weight_mmcpy_everyKxk module. 59 Table 11: Timing (ns) of Weight_load_reorg module.
-:---‹-‹-=+<+++ 60 Table 12: Utilization estimates of Weight_load_reorg module.------ 60 Table 13: Timing (ns) of Reorg_yolo module. Table 14: Utilization estimates of Reorg_yolo module.----¿- - + 5<++ 61 Table 15: Timing (ns) of Pool_yolo moduÏe.-----5: 2 5+5+5+2ss+s++++x+x+svss+ 62 Table 16: Utilization estimates of Pool_yolo module.--‹-s -«ex++<+s«ceses+ 62 Table 17: Timing (ns) of Outputpixel2buf module.---- - 5-52 5555 £+s+s<++ 62 Table 18: Utilization estimates of Outputpixel2buf module. --- --:---- 63 Table 19: Timing (ns) of Mmcpy_outputport] module.----:--:-:-<:‹-+ 63 Table 20: Utilization estimates of Mmcpy_outputport! module.603 Table 21: Timing (ns) of Mmcpy_outputport module.--‹-- - «5s5s5c+s«s+ 64 Table 22: Utilization estimates of Mmcpy_outputport module.----- 64 Table 23: Timing (ns) of Mmcpy_outputpixel module. cesses 65 Table 24: Utilization estimates of Mmcpy_outputpixel module.
65 Table 25: Timing (ns) of Mmcpy_inputport module.----‹--- 55s66 Table 26: Utilization estimates of Mmcpy_inputport module.--- ‹-‹-+66 Table 27: Timing (ns) of Mmcpy_inputport] module.---- - - s5567 Table 28: Utilization estimates of Mmcpy_inputport! module. 67 Table 29: Timing (ns) of Mmcpy_inputport2 module.-- -- - «5+5 <+s«s+67 Table 30: Utilization estimates of Mmcpy_inputport2 module. --------68 Table 31: Timing (ns) of Mmepy_inputport3 module.------ --- s568 Table 32: Utilization estimates of Mmcpy_inputport3 module.----- 68 Table 33: Timing (ns) of Copy_input2buf_row module.----‹-‹-5-5+5+<+ 69 Table 34: Utilization estimates of Copy_input2buf_row module. -- --69 Table 35: Timing (s) of Copy_input_weight module.
Table 36: Utilization estimates of Copy_input_weight module.--- 70 Table 37: Timing (ns) of Compute4 module.--¿-- 55525555 55+s+s+s+>+ 71 Table 38: Utilization estimates of Compute4 module.---------:-+-+ 71 LIST OF ACRONYMS ASIC: Application Specific Integrated Circuit AXI: Advanced Extensible Interface BN: Batch Normalization BRAM: Block Random Access Memory CLB: Configurable Logic Block COCO: Common Objects in Context DDR: Double Data Rate DMA: Direct Memory Access DSP: Digital Signal Processing DSP48E: Digital Signal Processing Logic Element FF: Flip-Flop FIFO: First in First Out FPGA: Field Programmable Gate Array GPU: Graphics Processing Unit HDF: Hardware Descriptor File HDL: Hardware Description language HLS: High-level Synthesis IOU: Intersection Over Union IP: Intellectual Property ISE: Integrated Synthesis Environment LUT: Look-Up Table mAP: mean average precision. NNW: Neuron Net Work PL: Processing Logic PS: Processing System RAM: Read Access Memory RCNN: Region-Based Convolution Neural Networks ReLU: Rectified Linear Units RTL: Register-transfer Level SD: Secure Digital SRAM: Static Random-access Memory YOLO: You Only Look One SUMMARY OF THESIS In recent years, the Deep Learning models has been interested by many scientists to participate in research, notably the Neural Network (NNW) model as a good candidate to solve problems such as recognize object processing. Many practical applications often have low latency in software or a lot of resources in hardware. In response to this problem, optimization strategies such as implement to FPGA is adopted.
We plan strategy, design, and implement a FPGA of Neural Network accelerator architecture which use You Only Look Once version 2 (YOLOv2) detection algorithm. Overseas situation In the overseas, there are many countries that lead about microchip such as United States, Indian, Germany, China, Japan,. Therefore, we could be easy to find out the documents about microchip through the Internet and Indian is the one of the most country have a huge of online documents about microchip. There are many projects, events or reports about object detection such as: In Association for the Advancement of Artificial Intelligence (AAAI) Award 2020, Geoffrey Hinton (Google, The Vector Institute, and University of Toronto) bring new approach to object recognition for neural network.
As of late 2017, China has deployed facial recognition and artificial intelligence technology in Xinjiang! Project of WalkerLau about Acceleration Neural Network with FPGA on github. In this project, they can compare the speed of CPU only and CPU + FPGA by execute face recognition with 7 neural layers. The result is CPU + FPGA acceleration system works 45x-75x faster.” Project of Mohamed Atri, Fethi Smach, Johel Miteran and Mohamed Abid about Design of a Neural Networks Classifier for Face Detection. In this topic, they have a new concept that Multi-layer Perceptron.
The result of that project is that they get 99,67% accuracy of 500 sample images by using Xilinx FPGA. Situation in the country In Viet Nam, we have lots of limit about the resources, expense, technology leading to human resources must study abroad to get high quality. Most of all companies work about microchip also stop at pack the circuit or intermediaries for large 1 The 16® AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. ? Acceleration CNN computation with FPGA from WaklerLau.
3 Neural Networks Classifier for Face Detection. company outside Viet Nam. So, most research topics are published from foreign companies. The research and implementation of Neural Network YOLO algorithm on FPGA is difficult because we could not find out any documents or projects made by people in Viet Nam.