PARALLEL PROGRAMMING USING THREAD-LEVEL SPECULATION A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Manohar Karkal Prabhu December 2005 UMI Number: 3197497 Copyright 2006 by Prabhu, Manohar Karkal All rights reserved. INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted.
Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3197497 Copyright 2006 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company 300 North Zeeb Road P. Box 1346 Ann Arbor, MI 48106-1346 © Copyright by Manohar K. Prabhu 2006 All Rights Reserved ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Olukotun I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.
A Christos Kozyrakis / I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. ted, Mark Horowitz \ Approved for the University Committee on Graduate Studies. iii Abstract As the performance increases of single-threaded processors diminish, consumer desktop processors are moving toward multi-core designs. Thread-level speculation (TLS) increases the space of applications that can benefit from these designs.
With TLS, a sequential application is divided into fairly independent tasks that are speculatively executed in parallel, while the hardware dynamically enforces data dependencies to provide the appearance of sequential execution. This thesis demonstrates that support for TLS greatly eases the task of manual parallel programming. Because TLS provides a sequential programming interface to parallel hardware, it enables the programmer to focus only on issues of performance, rather than correctness. The dissertation starts by demonstrating the parallelization of a microbenchmark to introduce a number of techniques for manual TLS parallelization.
Several of the advanced techniques leverage programmer expertise to surpass the capabilities of current advanced, automated parallelizers; the research presented here can provide guidance for the future development of such tools. Following this, the use of these techniques to parallelize seven of the SPEC CPU2000 applications is described. TLS parallelization yielded an average 120% speedup on four floating point applications and 70% speedup on three integer applications, while requiring only approximately 80 programmer hours and 150 lines of non-template code per application. These strong parallel performance results generated with relatively modest programmer effort support the inclusion of TLS in future chip multiprocessor designs.
iv For each application parallelized, a detailed description is provided of how and where parallelism was located, the impediments to extracting it using TLS, and the code transformations that were required to overcome these impediments. The results on these applications demonstrate that using advanced manual techniques is essential to effectively parallelize integer benchmarks. This leads to a discussion of common hindrances to TLS parallelization, and a subsequent description of methods of programming that help expose the parallelism in applications to TLS systems. These programming guidelines can help uniprocessor programmers create applications that can be easily ported to future TLS systems and yield good performance.
In closing, the dissertation reviews the many advantages of manual TLS parallel programming and specifies potential future research areas. Acknowledgments I would like to thank the many people who have provided me the support and encouragement to complete this dissertation and my Ph. There are so many family members, friends and associates that it is hard to know where to stop, but I do know where to start the list. I would like to thank my daughter Vaishali, first and foremost.
While many would argue that students with children take longer to complete, no distraction could be quite so grand as dear little Vaishali. Whether she was a baby sitting and cooing on my lap while I was debugging code, or was instead demanding I take time off to pay her some attention as she grew older, she has always made working from home the best way to get the job done. She is my other advisor, my live-in advisor (and is much more demanding, I might add!). I would like to thank so many of my family members, as well.
My mom, my brother and my two sisters have provided much of the inspiration that has led me down this path. Since moving to sunny California, there have been a host of other relatives who have provided fabulous fun and cheer, including Anita, Vivek, Farzaneh, Pandu, Mala and a bunch more. And many a friend has brightened my way through grad school, as have so many workmates. I am always indebted to “Uncle Lance,” who has earned his title not by being here at Stanford for more years than me, but from the non-stop fun and action he provides Vaishali on her every visit to the lab.
His presence in the great halls of Gates will be sorely missed by many. And likewise, it has been fun hanging out with Murali ee ee eee ean vi and Tara, the Hydra gang of old and new and the Future/Alumni Professors of Manufacturing. I am indebted to the various people who have worked behind the scenes to make my education possible, Darlene Hadding, Charlie Orgish and Marianne Marx, to name just a few. And, out in the “real” world, I owe a heap of gratitude to my many managers and work associates at HP, including most of all Ray, Bob, Emmanuelle and Steve.
But of course, the list would be incomplete without expressing my profound appreciation for the many advisors who have helped steer my path through to the light at the end of the tunnel. I thank Christos Kozyrakis and Mark Horowitz for the interest they have taken in my research and in providing me feedback on my conference presentations, my orals and this dissertation. I wish to thank Rick Reis and a variety of other professors at Stanford and beyond, who have motivated me to pursue a career in academia. But most of all, I | wish to thank Kunle Olukotun, my doctoral advisor, for being a continuing and unwavering support through the many twists and turns of the Ph.
Kunle has not only been an advisor, but also a friend, and I feel fortunate to have done my doctorate under an advisor whomI hold in such high regard. vii Table of Contents 1 Introduction and BackgrOUunnd. -- -« -- + + xn cknHHnHnnH HnH nHngEg 1 1.1 Evolution of HardWare. sàn HH HH HH TH HH HH tàng 3 1.
Increasing difficulties of hardware design. | Methods for reducing hardware design cornpÏ€XIẨV.2 Design of Parallel SOẨYWATG.- TT Hà HH HH TH TH Hết 6 1. Granularities of paraÏleliSim. -- << HT TH 9n nh 7 1.
Ability to automate paralle]1Zat1O. Challenges to extracting parallelism. su kén HH nhu 10 1. Approaches to parallelization of appÏiCatiOnS.
Contributions of This Dissertation over Related Research. Objectives and approach.-- -- «+ + kh HH HH HH tưng ngâm 18 1.- sọ TH HH Họ HH nHkt 20 1. Measurement and sampling strategy. esses ce ngàng 21 1.
sọ HH HH TT gu Hit 23 2 Thread-Level Speculation (TT S).---- vn ng HT ki 26 2.1 Ideal TLS SYSt€mS. HH HT HH Ki He 26 2.2 Practical implementations of TLS SYSẦ€TNS.- ng HH HH ky 30 2.3 Performance limiters of TLS SVS€INS. Gì HT HHnKkHưct 33 2. Primary performance ÌiTtIf€FS.- sọ TH HH ng nh như 34 2.
Secondary performance limiters .- - cá vn ng nà 4l 2. Measuring and understanding performance losses .4 TLS CMP hardware simulaf€d. sàng HH TH HH Hàn 47 Manual Programming With TLS .1 Parallel programming process using TLS. --- << HH HH TH HT HH HH 57 3.
Heap Sort Exarmple.- -s «su nh Hư ch tr 58 3. — Parallelizing with TLS.- 2-5 + SH TH TH HH nh 62 3. Ease of TLS Parallelization. Performance of TLS Paralle]izatiOH.
Optimizing TLS PerÍOrmance. Complex Value PrediCtIO'. Algorithm AdJustme€ni(.- - 5 ch nh TH HT HH gàng 73 3. Additional Automatic Optirn1zatiO'S.- ác se SH ng, 75 3.
-- -- ch nh ng 76 4 Manual TLS Parallelization of Whole ApplÌicatiOnS.1 SPEC2000, benchmark selection and execution sampÏing. cà HH HH TH Hi ng ng nh thiệp 86 4. LH HH HH Hà TH HH kh 89 4. nàn TH ky kế 90 4.- ch HT HH nh tre 92 4.
HH HH HH HH ng rh rcư 94 4.- -Ă- ác HH ng ng ng 95 4. ch ng nh nen 59k 96 4.< SH HH KH ch 97 4.3 Performance-related ODs€rVAfIOTS.-- Gà TH TH HH HH Hư 100 4.4 Additional simulator r€SUÏ{S. - ĂS HH TH HT nh nHtt 103 4.5 Programmer effort r€QUIT€Ở.- - Ăn nh nh ng 106 5 Observations and ConcÏUS1OTS.-- --- << HT nu ngu ch 110 5.1 Hindrances to TLS paralle]1ZafIOTI.- - Ăn vn ng như kt 110 5.2 TLS-friendly uniprocessor DrOBTaInTHInE,.- -- Gà HH kt 112 =. HH HT nh Km TT nh Ti nh tà EEE EE ERE nh ES 123 List of Tables Table 2-1: Memory system specifications .ccecseseeeesceseeeecesecsececeeseesseeenerseeneetaeenaees 48 Table 2-2: Loop-only TLS oVerh€aÌS.
--- --- 5 + HH Hà ng ng 49 Table 4-1: Benchmarks comprising SPEC CPU2000. 83 Table 4-2: Source code lengths of the SPEC CPU2000 benchmarks selected. 84 Table 4-3: Code transfOTf4f1OfAS. - TH TH TH HT Tu nu ch ta S6 Table 4-4: Speedup resulting from each additional transformatfIon.
--‹- «se s«+2 88 Table 4-5: Speculative thread lengths, regions and COV€TA. 104 Table 4-6: Breakdown of parallelized execution times .- cà Sex 105 Table 4-7: Lines of code added to parallelize appÌicatiOns.- óc s« se 107 xi List of Figures Figure 2-1: Thread-level speCuÏ4tiOT. - Án TH HT HH nh kh 28 Figure 2-2: Hydra chip mulfIDTOC€SSOT. - Gà TH ng nh nh He 47 Figure 3-1: Organization of the heap aT†TAy.
-á- ác HH HT ng Tu ng ghe 58 Figure 3-2: Top node removal and update of the heap .-- 5 àcnsseeseeerre 60 Figure 3-3: Code for top node removal and heap update.- --- «cty 61 Figure 3-4: Performance of incremental OpITT1ZAf1OTNS. án ngư 70 Figure 3-5: Original code with independent tasks sana, ". 77 Figure 3-6: Speculatively pipelined code ready for loop-only TLS .~- 77 Figure 4-1: Execution pattern and violations of 177.- co teen 91 Figure 4-2: Thread formulation for 1§§. ng ng HH ngư 93 Figure 4-3: Whole application speedups under various memory and TLS models.
103 Figure 5-1: Good and bad thread length sequences. teeeeeeseseeseeesenetseteneeeeeteeeeees 112 xii 1 Introduction and Background Workloads run on modern computer systems exhibit a large degree of inherent parallelism, which means that significant portions of the workloads can be executed concurrently. Computers can greatly improve their computational performance by exploiting inherent parallelism, which often exists at many different levels. At one extreme, instruction-level parallelism (ILP) occurs between the individual computer instructions which were intended to be executed sequentially.
At the other extreme, process-level parallelism allows multi-tasking operating systems to execute separate, possibly unrelated instruction streams on the same computer hardware at different times, thereby tolerating latency and allowing more efficient use of a computer system's resources. Between these extremes lie various forms of thread-level parallelism (TLP).