OpenIntro Statistics Third Edition David M Diez Quantitative Analyst david@openintro.org Christopher D Barr Graduate Student Yale School of Management chris@openintro.org Mine Çetinkaya-Rundel Assistant Professor of the Practice Department of Statistics Duke University mine@openintro. Printing: August 1st, 2015. This textbook is available under a Creative Commons license.org for a free PDF, to download the textbook’s source files, or for more information about the license. Contents 1 Introduction to data 7 1.1 Case study: using stents to prevent strokes .3 Overview of data collection principles .4 Observational studies and sampling strategies .6 Examining numerical data .7 Considering categorical data .8 Case study: gender discrimination (special topic) .3 Sampling from a small population (special topic).
116 3 Distributions of random variables 127 3.2 Evaluating the normal approximation .5 More discrete distributions (special topic). 158 4 Foundations for inference 168 4.1 Variability in estimates .4 Examining the Central Limit Theorem .5 Inference for other estimators. 203 3 4 CONTENTS 5 Inference for numerical data 219 5.1 One-sample means with the t-distribution .3 Difference of two means .4 Power calculations for a difference of means (special topic) .5 Comparing many means with ANOVA (special topic). 257 6 Inference for categorical data 274 6.1 Inference for a single proportion .2 Difference of two proportions .3 Testing for goodness of fit using chi-square (special topic) .4 Testing for independence in two-way tables (special topic) .5 Small sample hypothesis testing for a proportion (special topic).
312 7 Introduction to linear regression 331 7.1 Line fitting, residuals, and correlation .2 Fitting a line by least squares regression .3 Types of outliers in linear regression .4 Inference for linear regression. 356 8 Multiple and logistic regression 372 8.1 Introduction to multiple regression .3 Checking model assumptions using graphs .4 Introduction to logistic regression. 395 A End of chapter exercise solutions 405 B Distribution tables 427 B.1 Normal Probability Table .3 Chi-Square Probability Table. 432 Preface This book may be downloaded as a free PDF at openintro.
We hope readers will take away three ideas from this book in addition to forming a foun- dation of statistical thinking and methods. (1) Statistics is an applied field with a wide range of practical applications. (2) You don’t have to be a math guru to learn from real, interesting data. (3) Data are messy, and statistical tools are imperfect.
But, when you understand the strengths and weaknesses of these tools, you can use them to learn about the real world. Textbook overview The chapters of this book are as follows: 1. Introduction to data. Data structures, variables, summaries, graphics, and basic data collection techniques.
The basic principles of probability. An understanding of this chapter is not required for the main content in Chapters 3-8. Distributions of random variables. Introduction to the normal model and other key distributions.
Foundations for inference. General ideas for statistical inference in the context of estimating the population mean. Inference for numerical data. Inference for one or two sample means using the t-distribution, and also comparisons of many means using ANOVA.
Inference for categorical data. Inference for proportions using the normal and chi- square distributions, as well as simulation and randomization techniques. Introduction to linear regression. An introduction to regression with two variables.
Most of this chapter could be covered after Chapter 1. Multiple and logistic regression. A light introduction to multiple regression and logistic regression for an accelerated course. OpenIntro Statistics was written to allow flexibility in choosing and ordering course topics.
The material is divided into two pieces: main text and special topics. The main text has been structured to bring statistical inference and modeling closer to the front of a course. Special topics, labeled in the table of contents and in section titles, may be added to a course as they arise naturally in the curriculum. 5 6 CONTENTS Videos for sections and calculators The icon indicates that a section or topic has a video overview readily available.
The icons are hyperlinked in the textbook PDF, and the videos may also be found at www.org/stat/videos.php Examples, exercises, and appendices Examples and Guided Practice throughout the textbook may be identified by their distinc- tive bullets: Example 0.1 Large filled bullets signal the start of an example. Full solutions to examples are provided and may include an accompanying table or figure.2 Large empty bullets signal to readers that an exercise has been inserted into the text for additional practice and guidance. Students may find it useful to fill in the bullet after understanding or successfully completing the exercise. Solutions are provided for all Guided Practice in footnotes.1 There are exercises at the end of each chapter for practice or homework assignments.
Odd-numbered exercise solutions are in Appendix A. Probability tables for the normal, t, and chi-square distributions are in Appendix B. OpenIntro, online resources, and getting involved OpenIntro is an organization focused on developing free and affordable education materials. OpenIntro Statistics is intended for introductory statistics courses at the college level.
We offer another title, Advanced High School Statistics, for high school courses. We encourage anyone learning or teaching statistics to visit openintro.org and get involved. We also provide many free online resources, including free course software. Data sets for this textbook are available on the website and through a companion R package.2 All of these resources are free and may be used with or without this textbook as a companion.
We value your feedback. If there is a particular component of the project you especially like or think needs improvement, we want to hear from you. You may find our contact information on the title page of this book or on the About section of openintro. Acknowledgements This project would not be possible without the passion and dedication of all those involved.
The authors would like to thank the OpenIntro Staff for their involvement and ongoing contributions. We are also very grateful to the hundreds of students and instructors who have provided us with valuable feedback over the last several years. 1 Full solutions are located down here in the footnote! 2 Diez DM, Barr CD, Çetinkaya-Rundel M. openintro: OpenIntro data sets and supplement functions.com/OpenIntroOrg/openintro-r-package.
Chapter 1 Introduction to data Scientists seek to answer questions using rigorous methods and careful observations. These observations – collected from the likes of field notes, surveys, and experiments – form the backbone of a statistical investigation and are called data. Statistics is the study of how best to collect, analyze, and draw conclusions from data. It is helpful to put statistics in the context of a general process of investigation: 1.
Identify a question or problem. Collect relevant data on the topic. Analyze the data. Statistics as a subject focuses on making stages 2-4 objective, rigorous, and efficient.
That is, statistics has three primary components: How best can we collect data? How should it be analyzed? And what can we infer from the analysis? The topics scientists investigate are as diverse as the questions they ask. However, many of these investigations can be addressed with a small number of data collection techniques, analytic tools, and fundamental concepts in statistical inference. This chapter provides a glimpse into these and other themes we will encounter throughout the rest of the book. We introduce the basic principles of each branch and learn some tools along the way.
We will encounter applications from other fields, some of which are not typically associated with science but nonetheless can benefit from statistical study.1 Case study: using stents to prevent strokes Section 1.1 introduces a classic challenge in statistics: evaluating the efficacy of a medical treatment. Terms in this section, and indeed much of this chapter, will all be revisited later in the text. The plan for now is simply to get a sense of the role statistics can play in practice. In this section we will consider an experiment that studies effectiveness of stents in treating patients at risk of stroke.1 Stents are devices put inside blood vessels that assist 1 Chimowitz MI, Lynn MJ, Derdeyn CP, et al.
Stenting versus Aggressive Med- ical Therapy for Intracranial Arterial Stenosis. New England Journal of Medicine 365:993- 1003.org/doi/full/10. NY Times article reporting on the study: www.com/2011/09/08/health/research/08stent. INTRODUCTION TO DATA in patient recovery after cardiac events and reduce the risk of an additional heart attack or death.
Many doctors have hoped that there would be similar benefits for patients at risk of stroke. We start by writing the principal question the researchers hope to answer: Does the use of stents reduce the risk of stroke? The researchers who asked this question collected data on 451 at-risk patients. Each volunteer patient was randomly assigned to one of two groups: Treatment group. Patients in the treatment group received a stent and medical management.
The medical management included medications, management of risk factors, and help in lifestyle modification. Patients in the control group received the same medical manage- ment as the treatment group, but they did not receive stents. Researchers randomly assigned 224 patients to the treatment group and 227 to the control group. In this study, the control group provides a reference point against which we can measure the medical impact of stents in the treatment group.
Researchers studied the effect of stents at two time points: 30 days after enrollment and 365 days after enrollment. The results of 5 patients are summarized in Table 1. Patient outcomes are recorded as “stroke” or “no event”, representing whether or not the patient had a stroke at the end of a time period. Patient group 0-30 days 0-365 days 1 treatment no event no event 2 treatment stroke stroke 3 treatment no event no event.
450 control no event no event 451 control no event no event Table 1.1: Results for five patients from the stent study. Considering data from each patient individually would be a long, cumbersome path towards answering the original research question. Instead, performing a statistical data analysis allows us to consider all of the data at once.2 summarizes the raw data in a more helpful way. In this table, we can quickly see what happened over the entire study.
For instance, to identify the number of patients in the treatment group who had a stroke within 30 days, we look on the left-side of the table at the intersection of the treatment and stroke: 33. 0-30 days 0-365 days stroke no event stroke no event treatment 33 191 45 179 control 13 214 28 199 Total 46 405 73 378 Table 1.2: Descriptive statistics for the stent study. DATA BASICS 9 J Guided Practice 1.1 Of the 224 patients in the treatment group, 45 had a stroke by the end of the first year. Using these two numbers, compute the proportion of patients in the treatment group who had a stroke by the end of their first year.
(Please note: answers to all in-text exercises are provided using footnotes.)2 We can compute summary statistics from the table. A summary statistic is a single number summarizing a large amount of data.3 For instance, the primary results of the study after 1 year could be described by two summary statistics: the proportion of people who had a stroke in the treatment and control groups. Proportion who had a stroke in the treatment (stent) group: 45/224 = 0. Proportion who had a stroke in the control group: 28/227 = 0.
These two summary statistics are useful in looking for differences in the groups, and we are in for a surprise: an additional 8% of patients in the treatment group had a stroke! This is important for two reasons. First, it is contrary to what doctors expected, which was that stents would reduce the rate of strokes. Second, it leads to a statistical question: do the data show a “real” difference between the groups? This second question is subtle. Suppose you flip a coin 100 times.