Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2016 Increasing Off-Chip Bandwidth and Mitigating Dark Silicon via Switchable Pins Shaoming Chen Louisiana State University and Agricultural and Mechanical College Follow this and additional works at: https://digitalcommons.edu/gradschool_dissertations Part of the Electrical and Computer Engineering Commons Recommended Citation Chen, Shaoming, "Increasing Off-Chip Bandwidth and Mitigating Dark Silicon via Switchable Pins" (2016). LSU Doctoral Dissertations.edu/gradschool_dissertations/3337 This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU Doctoral Dissertations by an authorized graduate school editor of LSU Digital Commons. For more information, please contactgradetd@lsu.
INCREASING OFF-CHIP BANDWIDTH AND MITIGATING DARK SILICON VIA SWITCHABLE PINS A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and Agricultural and Mechanical College in partial fulfillment of the requirements for the degree of Doctor of Philosophy in School of Electrical Engineering and Computer Science by Shaoming Chen B., Huazhong University of Science and Technology, Wuhan, China 2008 M., Huazhong University of Science and Technology, Wuhan, China 2011 August 2016 ACKNOWLEDGEMENTS I would like to dedicate the dissertation to my parents and my friends for their continuous support and encouragement throughout my entire life. This dissertation is completed with the valuable help and support from a lot of people including my advisor, Dr. He thoughtfully guided me to pick up the emerging topic and offered valuable advises for my study. Ashok Srivastava gave appreciated help of circuit design and encouraged me to response bitter feedbacks from reviewers.
Bin Li from Department of Experimental Statistics provided insightful statistical methods to help me to analyze experimental data. David Koppelman and Dr. Rudy Hirschheim as my committee members take their valuable time to supervise my dissertation and attend my defense. I sincerely appreciate the professors’ supports as the foundation of the work.
I also would like to thanks my co-workers in my lab. Zhang provided insight thoughts that helped me to develop the dissertation. Yue devoted enormous effort on circuit design especially on power delivery network. Zhou elaborated the circuit design and helped me to improve the circuit design.
Sam proofread the work and gave me wonderful feedbacks. I am thankful to the Department of Electrical and Computer Engineering for providing assistantship throughout my study. Finally, I would like thank all the friends I met at LSU for making my life here wonderful and memorable. ii TABLE OF CONTENTS ACKNOWLEDGEMENTS.
ii LIST OF TABLES. v LIST OF FIGURES. 6 INCREASING OFF-CHIP BANDWIDTH IN MULTI-CORE PROCESSORS WITH SWITCHABLE PINS. 9 Off-Chip Bus Connection.
16 Power Delivery Simulation. 18 Runtime Switch Conditions. 25 Performance and Energy Efficiency Metrics. 27 Memory-Intensive Workloads.
28 Wide-bus mode. 35 Compute-Intensive Workloads. 38 MITIGATING DARK SILICON VIA SWITCHABLE PINS. 43 Power Delivery Network.
50 Dynamic Pin Switching based on Program Phases. 56 Dim Silicon Result. 60 BOOSTING OFF-CHIP BANDWIDTH WITH PCM VIA SWITCHABLE PINS. 64 Memory-Intensive Multi-threaded Workloads.
64 Memory-Intensive Multi-programmed Workloads using PCM. 65 Memory-Intensive Multi-threaded Workloads using PCM. 66 Mixed Multi-program Workloads on the memory subsystem using PCM. 68 INCREASING INTER-SOCKET BANDWIDTH VIA SWITCHABLE PINS.
72 Off-chip connection. 80 Area Overhead & Propagation Delay. 86 Performance of the static switching. 86 Performance of the dynamical switching.
89 Enhancement from a stride prefetcher. 90 The bandwidth of the DRAM cache. 92 The size of DRAM cache. 93 The frequency of QPI buses.
107 iv LIST OF TABLES Table 2-1. Pin allocation of an Intel Processor i5-4670. Power network model parameters. Processor power and frequency parameters for different number of buses.
The Configuration of the simulated system. The selected memory-intensive and compute-intensive workloads. Pin allocation of the Intel Xeon Processor E5-2450L. Processor configurations under different cooling techniques.
Parameters of the performance and power models. Simulated multi-program workloads. Benchmark memory statistics. The configuration of the simulated system.
The selected workloads. The intervals in the multi-link mode and in the single-link mode as well as the times of switching to the multi-link mode and the single-link mode. 89 v LIST OF FIGURES Figure 1-1. Normalized weighted speedup and off-chip bandwidth of 4 lbm co-running on a processor with 1,2,3,4 memory channels.
Power and memory bandwidth (8 copies of DEALII from SPEC2006). The latency breakdown of un-core requests in the simulated system with two sockets. The circuit of pin switch. The overview of the hardware design of off-chip bus connection for switching between the Multi-bus mode and the Single-bus mode.
The Overview of the hardware design of memory controller for switching between the Multi-bus mode and the Single-bus mode. Spice models for signal integrity simulation. The eye diagrams. RLC power delivery model.
The normalized off-chip latencies and on-chip latencies of workloads against the total execution time. The normalized weighted speedup of memory-intensive workloads with 2, 3, and 4 buses against the each baseline. The average normalized weighted speedup of memory workloads in geometric mean with multi-bus mode. Each normalize to the same configuration with single bus mode.
The normalized weighted speedup of memory intensive workloads boosted by Static Switching and Dynamic Switching with 3 buses against the baseline. The increased bandwidth due to pin switching. The normalized bandwidth of baseline, static pin switching, and dynamic pin switching. The improved throughput of Dynamic Switching boosted by a stride prefetchers (degree = 1, 2, 4) for memory-Intensive workloads.
The off-chip bandwidth of Dynamic Switching improved by a stride prefetcher (degree = 1, 2, 4) for memory-Intensive workloads. The performance of memory intensive workloads for the baseline (core frequency of 4GHz and a memory bus of 64 bits) and two configurations of wide bus vi mode (core frequency of 3.6GHz and a memory bus of 128 bits; core frequency of 2.8GHz and a memory bus of 256 bits). The off-chip bandwidth of memory intensive workloads for the baseline (core frequency of 4GHz and a memory bus of 64 bits) and two configurations of wide bus mode (core frequency of 3.6GHz and a memory bus of 128 bits, core frequency of 2.8GHz and a memory bus of 256 bits). The normalized EPI of Dynamic Switching for memory intensive workloads with 3 buses, and the EPI from DVFS (running on 2.4GHz with the single bus).
The normalized weighted speedup of mixed workloads boosted by Static Switching and Dynamic Switching. The improved throughput of Dynamic Switching boosted by a stride prefetchers (degree = 1, 2, 4) for mixed workloads. The normalized weighted speedup of Compute-Intensive workloads with Static Switching and Dynamic Switching. Structure of a packaged chip (8 copies of DEALII from SPEC2006).
Design overview on the proposed scheme. Layout of wrapped around large transistor. Circuits when a switchable pin is used for signal transmission. Received eye diagram.
Workflow of dynamic switching. Floorplan of the chip multiprocessor. Performance speedup when the processor is in dim silicon mode. Number of L2 cache misses per 1K instructions on a processor configured to 8×2.
Prediction accuracy on a processor in dim silicon mode. Performance evaluation of multi-threaded workloads with Dynamic Switching and prefetching (degree = 1, 4). Normalized consumption of off-chip bandwidth of multi-threaded workloads using Dynamic Switching and prefetching (degree = 1, 4). Improved throughput of Dynamic Switching boosted by stride prefetchers (degree = 1, 2, 4) for memory-Intensive workloads using PCM.
Normalized off-chip bandwidth of Dynamic Switching boosted by stride prefetchers (degree = 1, 2, 4) for memory-Intensive workloads using PCM. Performance evaluation of multi-threaded workloads using Dynamic Switching and prefetching (degree=1, 4) on the PCM subsystem. Normalized consumed off-chip bandwidth of multi-threaded workloads using Dynamic Switching and prefetching (degree =1, 4) on the PCM subsystem. The improved throughput of Dynamic Switching boosted by stride prefetchers (degree = 1, 4) for mixed workloads with PCM.
The simulated system running in the single-link mode and the multi-link mode. The off-chip bus connection in the single-link mode and the multi-link mode. The memory controller running in the single-link mode and the multi-link mode. The physical layers of QPI running in the single-link mode and the multi-link mode.
The Spice models for QPI buses and memory buses in single-link mode and the multi-link mode. The normalized speedup of the static switching and the dynamic switching compared with the baseline. The latency of un-core requests for the static switching normalized against that of the baseline. The normalized speedup of the static switching and the dynamic switching compared with baseline for the workloads with moderate or low inter-socket traffic.
The energy consumption in the static switching normalized against the baseline. The normalized speedup of the static switching with a prefetcher (degree 1, 2, 4) compared with baseline and the prefetcher. The ratio between the un-core latencies of QPI and the total un-core latencies with the baseline and a prefetcher (degree 1, 2, 4). The normalized speedup of the static switching with the different bandwidths of DRAM cache.
The normalized speedup of the static switching with the different sizes of DRAM cache. The normalized latencies of un-core requests in the static switching with the different sizes of DRAM cache. The normalized speedup of the static switching with the different frequencies of QPI. The ratio between the un-core latencies of QPI and the total un-core latencies with the baseline and the different frequencies of QPI buses.
96 ix ABSTRACT Off-chip memory bandwidth has been considered as one of the major limiting factors to processor performance, especially for multi-cores and many-cores. Conventional processor design allocates a large portion of off-chip pins to deliver power, leaving a small number of pins for processor signal communication. We observed that the processor requires much less power than that can be supplied during memory intensive stages in some cases. In this work, we propose a dynamic pin switch technique to alleviate the bandwidth limitation issue.
The technique is introduced to dynamically exploit the surplus pins for power delivery in the memory intensive phases and uses them to provide extra bandwidth for the program executions, thus significantly boosting the performance. We also explore its performance benefit in the era of Phase-change memory (PCM) and prove that the technique can be applied beyond DRAM-based memory systems. On the other hand, the end of Dennard Scaling has led to a large amount of inactive or significantly under-clocked transistors on modern chip multi-processors in order to comply with the power budget and prevent the processors from overheating. This so-called “dark silicon” is one of the most critical constraints that will hinder the scaling with Moore’s Law in the future.
While advanced cooling techniques, such as liquid cooling, can effectively decrease the chip temperature and alleviate the power constraints; the peak performance, determined by the maximum number of transistors which are allowed to switch simultaneously, is still confined by the amount of power pins on the chip package. In this paper, we propose a novel mechanism to power up the dark silicon by dynamically switching a portion of I/O pins to power pins when off-chip x communications are less frequent. By enabling extra cores or increasing processor frequency, the proposed strategy can significantly boost performance compared with traditional designs. Using the switchable pins can increase inter-socket bandwidth as one of performance bottlenecks.
Multi-socket computer systems are popular in workstations and servers. However, they suffer from the relatively low bandwidth of inter-socket communication especially for massive parallel workloads that generates many inter-socket requests for synchronizations and remote memory accesses. The inter-socket traffic poses a huge pressure on the underlying networks fully connecting all processors with the limited bandwidth that is confined by pin resources.