Tăng băng thông ngoài chip và giảm thiểu silicon tối qua các chân chuyển đổi

Tài liệu nghiên cứu Increasing off chip bandwidth and mitigating dark silicon via switchable pins, tổng hợp lý thuyết và thực hành, cung cấp kiến thức chuyên sâu về .

Chuyên ngành

Electrical Engineering and Computer Science

Người đăng

Ẩn danh

Thể loại

dissertation

2016

119
2
0

Phí lưu trữ

35 Point

Mục lục chi tiết

ACKNOWLEDGEMENTS

1. INCREASING OFF-CHIP BANDWIDTH IN MULTI-CORE PROCESSORS WITH SWITCHABLE PINS

1.1. Off-Chip Bus Connection

1.2. Power Delivery Simulation

1.3. Runtime Switch Conditions

1.4. Performance and Energy Efficiency Metrics

1.5. Memory-Intensive Workloads

1.6. Wide-bus mode

1.7. Compute-Intensive Workloads

2. MITIGATING DARK SILICON VIA SWITCHABLE PINS

2.1. Power Delivery Network

2.2. Dynamic Pin Switching based on Program Phases

2.3. Dim Silicon Result

3. BOOSTING OFF-CHIP BANDWIDTH WITH PCM VIA SWITCHABLE PINS

3.1. Memory-Intensive Multi-threaded Workloads

3.2. Memory-Intensive Multi-programmed Workloads using PCM

3.3. Memory-Intensive Multi-threaded Workloads using PCM

3.4. Mixed Multi-program Workloads on the memory subsystem using PCM

4. INCREASING INTER-SOCKET BANDWIDTH VIA SWITCHABLE PINS

4.1. Off-chip connection

4.2. Area Overhead & Propagation Delay

4.3. Performance of the static switching

4.4. Performance of the dynamical switching

4.5. Enhancement from a stride prefetcher

4.6. The bandwidth of the DRAM cache

4.7. The size of DRAM cache

4.8. The frequency of QPI buses

LIST OF TABLES

LIST OF FIGURES

ABSTRACT

Tóm tắt

I. Tổng quan về việc tăng băng thông ngoài chip và giảm thiểu silicon

Băng thông ngoài chip là một yếu tố quan trọng trong hiệu suất của các bộ xử lý hiện đại. Việc tối ưu hóa băng thông này không chỉ giúp cải thiện tốc độ xử lý mà còn giảm thiểu lượng silicon cần thiết cho các chân chuyển đổi. Điều này đặc biệt quan trọng trong bối cảnh công nghệ ngày càng phát triển và yêu cầu về hiệu suất ngày càng cao.

1.1. Tại sao băng thông ngoài chip lại quan trọng

Băng thông ngoài chip ảnh hưởng trực tiếp đến khả năng xử lý của bộ vi xử lý. Khi băng thông không đủ, hiệu suất của hệ thống sẽ bị giảm sút, dẫn đến thời gian chờ đợi lâu hơn cho các tác vụ.

1.2. Silicon tối thiểu và vai trò của chân chuyển đổi

Giảm thiểu silicon thông qua các chân chuyển đổi giúp tiết kiệm chi phí sản xuất và năng lượng tiêu thụ. Điều này cũng giúp cải thiện hiệu suất tổng thể của chip.

II. Vấn đề và thách thức trong việc tăng băng thông ngoài chip

Mặc dù có nhiều phương pháp để tăng băng thông ngoài chip, nhưng vẫn tồn tại nhiều thách thức. Các vấn đề như nhiệt độ, tiêu thụ năng lượng và khả năng tương thích giữa các thành phần là những yếu tố cần được xem xét kỹ lưỡng.

2.1. Nhiệt độ và hiệu suất của chip

Nhiệt độ cao có thể làm giảm hiệu suất của chip. Việc quản lý nhiệt độ là rất quan trọng để đảm bảo rằng chip hoạt động ở hiệu suất tối ưu.

2.2. Tiêu thụ năng lượng và băng thông

Tiêu thụ năng lượng cao có thể dẫn đến việc giảm băng thông. Cần có các giải pháp để tối ưu hóa năng lượng mà không làm giảm hiệu suất.

III. Phương pháp tăng băng thông ngoài chip hiệu quả

Có nhiều phương pháp để tăng băng thông ngoài chip, bao gồm việc sử dụng các chân chuyển đổi linh hoạt và tối ưu hóa thiết kế vi mạch. Những phương pháp này không chỉ giúp tăng băng thông mà còn giảm thiểu lượng silicon cần thiết.

3.1. Sử dụng chân chuyển đổi linh hoạt

Chân chuyển đổi linh hoạt cho phép điều chỉnh băng thông theo nhu cầu thực tế của hệ thống, từ đó tối ưu hóa hiệu suất.

3.2. Tối ưu hóa thiết kế vi mạch

Thiết kế vi mạch thông minh giúp giảm thiểu lượng silicon cần thiết mà vẫn đảm bảo hiệu suất cao. Việc này có thể thực hiện thông qua các công nghệ mới như thiết kế vi mạch 3D.

IV. Ứng dụng thực tiễn của việc tăng băng thông ngoài chip

Việc tăng băng thông ngoài chip có thể mang lại nhiều lợi ích cho các ứng dụng thực tiễn, từ máy chủ đến thiết bị di động. Các ứng dụng này yêu cầu băng thông cao để xử lý dữ liệu nhanh chóng và hiệu quả.

4.1. Ứng dụng trong máy chủ

Máy chủ cần băng thông cao để xử lý nhiều yêu cầu đồng thời. Việc tối ưu hóa băng thông giúp cải thiện hiệu suất và giảm thời gian phản hồi.

4.2. Ứng dụng trong thiết bị di động

Thiết bị di động cũng cần băng thông cao để xử lý dữ liệu từ các ứng dụng và dịch vụ trực tuyến. Tăng băng thông giúp cải thiện trải nghiệm người dùng.

V. Kết luận và tương lai của băng thông ngoài chip

Tương lai của băng thông ngoài chip sẽ phụ thuộc vào sự phát triển của công nghệ và nhu cầu ngày càng cao từ người dùng. Việc nghiên cứu và phát triển các phương pháp mới sẽ giúp tối ưu hóa băng thông và giảm thiểu silicon.

5.1. Xu hướng công nghệ mới

Công nghệ mới như AI và IoT sẽ tạo ra nhu cầu băng thông cao hơn. Cần có các giải pháp để đáp ứng nhu cầu này.

5.2. Tầm quan trọng của nghiên cứu và phát triển

Nghiên cứu và phát triển sẽ đóng vai trò quan trọng trong việc tìm ra các giải pháp tối ưu cho băng thông ngoài chip trong tương lai.

25/07/2025

Trích đoạn nội dung tài liệu

Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2016 Increasing Off-Chip Bandwidth and Mitigating Dark Silicon via Switchable Pins Shaoming Chen Louisiana State University and Agricultural and Mechanical College Follow this and additional works at: https://digitalcommons.edu/gradschool_dissertations Part of the Electrical and Computer Engineering Commons Recommended Citation Chen, Shaoming, "Increasing Off-Chip Bandwidth and Mitigating Dark Silicon via Switchable Pins" (2016). LSU Doctoral Dissertations.edu/gradschool_dissertations/3337 This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU Doctoral Dissertations by an authorized graduate school editor of LSU Digital Commons. For more information, please contactgradetd@lsu.

INCREASING OFF-CHIP BANDWIDTH AND MITIGATING DARK SILICON VIA SWITCHABLE PINS A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and Agricultural and Mechanical College in partial fulfillment of the requirements for the degree of Doctor of Philosophy in School of Electrical Engineering and Computer Science by Shaoming Chen B., Huazhong University of Science and Technology, Wuhan, China 2008 M., Huazhong University of Science and Technology, Wuhan, China 2011 August 2016 ACKNOWLEDGEMENTS I would like to dedicate the dissertation to my parents and my friends for their continuous support and encouragement throughout my entire life. This dissertation is completed with the valuable help and support from a lot of people including my advisor, Dr. He thoughtfully guided me to pick up the emerging topic and offered valuable advises for my study. Ashok Srivastava gave appreciated help of circuit design and encouraged me to response bitter feedbacks from reviewers.

Bin Li from Department of Experimental Statistics provided insightful statistical methods to help me to analyze experimental data. David Koppelman and Dr. Rudy Hirschheim as my committee members take their valuable time to supervise my dissertation and attend my defense. I sincerely appreciate the professors’ supports as the foundation of the work.

I also would like to thanks my co-workers in my lab. Zhang provided insight thoughts that helped me to develop the dissertation. Yue devoted enormous effort on circuit design especially on power delivery network. Zhou elaborated the circuit design and helped me to improve the circuit design.

Sam proofread the work and gave me wonderful feedbacks. I am thankful to the Department of Electrical and Computer Engineering for providing assistantship throughout my study. Finally, I would like thank all the friends I met at LSU for making my life here wonderful and memorable. ii TABLE OF CONTENTS ACKNOWLEDGEMENTS.

ii LIST OF TABLES. v LIST OF FIGURES. 6 INCREASING OFF-CHIP BANDWIDTH IN MULTI-CORE PROCESSORS WITH SWITCHABLE PINS. 9 Off-Chip Bus Connection.

16 Power Delivery Simulation. 18 Runtime Switch Conditions. 25 Performance and Energy Efficiency Metrics. 27 Memory-Intensive Workloads.

28 Wide-bus mode. 35 Compute-Intensive Workloads. 38 MITIGATING DARK SILICON VIA SWITCHABLE PINS. 43 Power Delivery Network.

50 Dynamic Pin Switching based on Program Phases. 56 Dim Silicon Result. 60 BOOSTING OFF-CHIP BANDWIDTH WITH PCM VIA SWITCHABLE PINS. 64 Memory-Intensive Multi-threaded Workloads.

64 Memory-Intensive Multi-programmed Workloads using PCM. 65 Memory-Intensive Multi-threaded Workloads using PCM. 66 Mixed Multi-program Workloads on the memory subsystem using PCM. 68 INCREASING INTER-SOCKET BANDWIDTH VIA SWITCHABLE PINS.

72 Off-chip connection. 80 Area Overhead & Propagation Delay. 86 Performance of the static switching. 86 Performance of the dynamical switching.

89 Enhancement from a stride prefetcher. 90 The bandwidth of the DRAM cache. 92 The size of DRAM cache. 93 The frequency of QPI buses.

107 iv LIST OF TABLES Table 2-1. Pin allocation of an Intel Processor i5-4670. Power network model parameters. Processor power and frequency parameters for different number of buses.

The Configuration of the simulated system. The selected memory-intensive and compute-intensive workloads. Pin allocation of the Intel Xeon Processor E5-2450L. Processor configurations under different cooling techniques.

Parameters of the performance and power models. Simulated multi-program workloads. Benchmark memory statistics. The configuration of the simulated system.

The selected workloads. The intervals in the multi-link mode and in the single-link mode as well as the times of switching to the multi-link mode and the single-link mode. 89 v LIST OF FIGURES Figure 1-1. Normalized weighted speedup and off-chip bandwidth of 4 lbm co-running on a processor with 1,2,3,4 memory channels.

Power and memory bandwidth (8 copies of DEALII from SPEC2006). The latency breakdown of un-core requests in the simulated system with two sockets. The circuit of pin switch. The overview of the hardware design of off-chip bus connection for switching between the Multi-bus mode and the Single-bus mode.

The Overview of the hardware design of memory controller for switching between the Multi-bus mode and the Single-bus mode. Spice models for signal integrity simulation. The eye diagrams. RLC power delivery model.

The normalized off-chip latencies and on-chip latencies of workloads against the total execution time. The normalized weighted speedup of memory-intensive workloads with 2, 3, and 4 buses against the each baseline. The average normalized weighted speedup of memory workloads in geometric mean with multi-bus mode. Each normalize to the same configuration with single bus mode.

The normalized weighted speedup of memory intensive workloads boosted by Static Switching and Dynamic Switching with 3 buses against the baseline. The increased bandwidth due to pin switching. The normalized bandwidth of baseline, static pin switching, and dynamic pin switching. The improved throughput of Dynamic Switching boosted by a stride prefetchers (degree = 1, 2, 4) for memory-Intensive workloads.

The off-chip bandwidth of Dynamic Switching improved by a stride prefetcher (degree = 1, 2, 4) for memory-Intensive workloads. The performance of memory intensive workloads for the baseline (core frequency of 4GHz and a memory bus of 64 bits) and two configurations of wide bus vi mode (core frequency of 3.6GHz and a memory bus of 128 bits; core frequency of 2.8GHz and a memory bus of 256 bits). The off-chip bandwidth of memory intensive workloads for the baseline (core frequency of 4GHz and a memory bus of 64 bits) and two configurations of wide bus mode (core frequency of 3.6GHz and a memory bus of 128 bits, core frequency of 2.8GHz and a memory bus of 256 bits). The normalized EPI of Dynamic Switching for memory intensive workloads with 3 buses, and the EPI from DVFS (running on 2.4GHz with the single bus).

The normalized weighted speedup of mixed workloads boosted by Static Switching and Dynamic Switching. The improved throughput of Dynamic Switching boosted by a stride prefetchers (degree = 1, 2, 4) for mixed workloads. The normalized weighted speedup of Compute-Intensive workloads with Static Switching and Dynamic Switching. Structure of a packaged chip (8 copies of DEALII from SPEC2006).

Design overview on the proposed scheme. Layout of wrapped around large transistor. Circuits when a switchable pin is used for signal transmission. Received eye diagram.

Workflow of dynamic switching. Floorplan of the chip multiprocessor. Performance speedup when the processor is in dim silicon mode. Number of L2 cache misses per 1K instructions on a processor configured to 8×2.

Prediction accuracy on a processor in dim silicon mode. Performance evaluation of multi-threaded workloads with Dynamic Switching and prefetching (degree = 1, 4). Normalized consumption of off-chip bandwidth of multi-threaded workloads using Dynamic Switching and prefetching (degree = 1, 4). Improved throughput of Dynamic Switching boosted by stride prefetchers (degree = 1, 2, 4) for memory-Intensive workloads using PCM.

Normalized off-chip bandwidth of Dynamic Switching boosted by stride prefetchers (degree = 1, 2, 4) for memory-Intensive workloads using PCM. Performance evaluation of multi-threaded workloads using Dynamic Switching and prefetching (degree=1, 4) on the PCM subsystem. Normalized consumed off-chip bandwidth of multi-threaded workloads using Dynamic Switching and prefetching (degree =1, 4) on the PCM subsystem. The improved throughput of Dynamic Switching boosted by stride prefetchers (degree = 1, 4) for mixed workloads with PCM.

The simulated system running in the single-link mode and the multi-link mode. The off-chip bus connection in the single-link mode and the multi-link mode. The memory controller running in the single-link mode and the multi-link mode. The physical layers of QPI running in the single-link mode and the multi-link mode.

The Spice models for QPI buses and memory buses in single-link mode and the multi-link mode. The normalized speedup of the static switching and the dynamic switching compared with the baseline. The latency of un-core requests for the static switching normalized against that of the baseline. The normalized speedup of the static switching and the dynamic switching compared with baseline for the workloads with moderate or low inter-socket traffic.

The energy consumption in the static switching normalized against the baseline. The normalized speedup of the static switching with a prefetcher (degree 1, 2, 4) compared with baseline and the prefetcher. The ratio between the un-core latencies of QPI and the total un-core latencies with the baseline and a prefetcher (degree 1, 2, 4). The normalized speedup of the static switching with the different bandwidths of DRAM cache.

The normalized speedup of the static switching with the different sizes of DRAM cache. The normalized latencies of un-core requests in the static switching with the different sizes of DRAM cache. The normalized speedup of the static switching with the different frequencies of QPI. The ratio between the un-core latencies of QPI and the total un-core latencies with the baseline and the different frequencies of QPI buses.

96 ix ABSTRACT Off-chip memory bandwidth has been considered as one of the major limiting factors to processor performance, especially for multi-cores and many-cores. Conventional processor design allocates a large portion of off-chip pins to deliver power, leaving a small number of pins for processor signal communication. We observed that the processor requires much less power than that can be supplied during memory intensive stages in some cases. In this work, we propose a dynamic pin switch technique to alleviate the bandwidth limitation issue.

The technique is introduced to dynamically exploit the surplus pins for power delivery in the memory intensive phases and uses them to provide extra bandwidth for the program executions, thus significantly boosting the performance. We also explore its performance benefit in the era of Phase-change memory (PCM) and prove that the technique can be applied beyond DRAM-based memory systems. On the other hand, the end of Dennard Scaling has led to a large amount of inactive or significantly under-clocked transistors on modern chip multi-processors in order to comply with the power budget and prevent the processors from overheating. This so-called “dark silicon” is one of the most critical constraints that will hinder the scaling with Moore’s Law in the future.

While advanced cooling techniques, such as liquid cooling, can effectively decrease the chip temperature and alleviate the power constraints; the peak performance, determined by the maximum number of transistors which are allowed to switch simultaneously, is still confined by the amount of power pins on the chip package. In this paper, we propose a novel mechanism to power up the dark silicon by dynamically switching a portion of I/O pins to power pins when off-chip x communications are less frequent. By enabling extra cores or increasing processor frequency, the proposed strategy can significantly boost performance compared with traditional designs. Using the switchable pins can increase inter-socket bandwidth as one of performance bottlenecks.

Multi-socket computer systems are popular in workstations and servers. However, they suffer from the relatively low bandwidth of inter-socket communication especially for massive parallel workloads that generates many inter-socket requests for synchronizations and remote memory accesses. The inter-socket traffic poses a huge pressure on the underlying networks fully connecting all processors with the limited bandwidth that is confined by pin resources.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ