Gary King: A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data is published by Princeton University Press and copyrighted, 1997, Princeton University Press. All rights reserved. This text may be used and shared in accordance with the fair-use provisions of US copyright law, and it may be archived and redistributed in electronic form, provided that this notice is carried, Princeton University Press is notified, the entire original is distributed without modification, and no fee is charged for access. Archiving, redistribution, or republication of this text on other terms, in any medium, requires the consent of Princeton University Press.
For COURSE PACK PERMISSIONS, refer to entry on previous menu. For more information, send e-mail to permissions@pupress.edu A Solution to the Ecological Inference Problem A Solution to the Ecological Inference Problem reconstructing individual behavior from aggregate data Gary King PRINCETON UNIVERSITY PRESS P R I N C E T O N, N E W J E R S E Y Copyright © 1997 by Princeton University Press Published by Princeton University Press, 41 William Street, Princeton, New Jersey 08540 In the United Kingdom: Princeton University Press, Chichester, West Sussex All Rights Reserved Library of Congress Cataloging-in-Publication Data King, Gary. A solution to the ecological inference problem: reconstructing individual behavior from aggregate data / Gary King. Includes bibliographical references and index.
Political science—Statistical methods.072—dc20 9632986 CIP This book has been composed in Palatino Princeton University Press books are printed on acid-free paper and meet the guidelines for permanence and durability of the Committee on Production Guidelines for Book Longevity of the Council on Library Resources Printed in the United States of America by Princeton Academic Press 1 3 5 7 9 10 8 6 4 2 1 3 5 7 9 10 8 6 4 2 (Pbk.) For Ella Michelle King Contents List of Figures xi List of Tables xiii Preface xv Part I: Introduction 1 1 Qualitative Overview 3 1.1 The Necessity of Ecological Inferences 7 1.5 The Method 26 2 Formal Statement of the Problem 28 Part II: Catalog of Problems to Fix 35 3 Aggregation Problems 37 3.1 Goodman’s Regression: A Definition 37 3.2 The Indeterminacy Problem 39 3.3 The Grouping Problem 46 3.4 Equivalence of the Grouping and Indeterminacy Problems 53 3.5 A Concluding Definition 54 4 Non-Aggregation Problems 56 4.1 Goodman Regression Model Problems 56 4.2 Applying Goodman’s Regression in 2 × 3 Tables 68 4.3 Double Regression Problems 71 4.4 Concluding Remarks 73 Part III: The Proposed Solution 75 5 The Data: Generalizing the Method of Bounds 77 5.1 Homogeneous Precincts: No Uncertainty 78 viii Contents 5.2 Heterogeneous Precincts: Upper and Lower Bounds 79 5.1 Precinct-Level Quantities of Interest 79 5.2 District-Level Quantities of Interest 83 5.3 An Easy Visual Method for Computing Bounds 85 6 The Model 91 6.1 The Basic Model 92 6.1 Observable Implications of Model Parameters 96 6.2 Parameterizing the Truncated Bivariate Normal 102 6.3 Computing 2p Parameters from Only p Observations 106 6.4 Connections to the Statistics of Medical and Seismic Imaging 112 6.5 Would a Model of Individual-Level Choices Help? 119 7 Preliminary Estimation 123 7.2 The Likelihood Function 132 7.5 Summarizing Information about Estimated Parameters 139 8 Calculating Quantities of Interest 141 8.1 Simulation Is Easier than Analytical Derivation 141 8.1 Definitions and Examples 142 8.2 Simulation for Ecological Inference 144 8.2 Precinct-Level Quantities 145 8.3 District-Level Quantities 149 8.4 Quantities of Interest from Larger Tables 151 8.1 A Multiple Imputation Approach 151 8.2 An Approach Related to Double Regression 153 8.5 Other Quantities of Interest 156 9 Model Extensions 158 9.1 What Can Go Wrong? 158 9.2 Incorrect Distributional Assumptions 161 9.2 Avoiding Aggregation Bias 168 9.1 Using External Information 169 Contents ix 9.2 Unconditional Estimation: Xi as a Covariate 174 9.3 Tradeoffs and Priors for the Extended Model 179 9.4 Ex Post Diagnostics 183 9.3 Avoiding Distributional Problems 184 9.2 A Nonparametric Approach 191 Part IV: Verification 197 10 A Typical Application Described in Detail: Voter Registration by Race 199 10.3 Computing Quantities of Interest 207 10.3 Other Quantities of Interest 215 11 Robustness to Aggregation Bias: Poverty Status by Sex 217 11.1 Data and Notation 217 11.2 Verifying the Existence of Aggregation Bias 218 11.3 Fitting the Data 220 11.4 Empirical Results 222 12 Estimation without Information: Black Registration in Kentucky 226 12.3 Fitting the Data 228 12.4 Empirical Results 232 13 Classic Ecological Inferences 235 13.2 Black Literacy in 1910 241 Part V: Generalizations and Concluding Suggestions 247 14 Non-Ecological Aggregation Problems 249 14.1 The Geographer’s Modifiable Areal Unit Problem 249 x Contents 14.1 The Problem with the Problem 250 14.2 Ecological Inference as a Solution to the Modifiable Areal Unit Problem 252 14.2 The Statistical Problem of Combining Survey and Aggregate Data 255 14.3 The Econometric Problem of Aggregating Continuous Variables 258 14.4 Concluding Remarks on Related Aggregation Research 262 15 Ecological Inference in Larger Tables 263 15.1 An Intuitive Approach 264 15.2 Notation for a General Approach 267 15.4 The Statistical Model 271 15.6 Calculating the Quantities of Interest 276 15.7 Concluding Suggestions 276 16 A Concluding Checklist 277 Part VI: Appendices 293 A Proof That All Discrepancies Are Equivalent 295 B Parameter Bounds 301 B.2 Heterogeneous Precincts: β’s and θ’s 302 B.3 Heterogeneous Precincts: λi ’s 303 C Conditional Posterior Distribution 304 C.1 Using Bayes Theorem 305 C.2 Using Properties of Normal Distributions 306 D The Likelihood Function 307 E The Details of Nonparametric Estimation 309 F Computational Issues 311 Glossary of Symbols 313 References 317 Index 337 Figures 1.1 Model Verification: Voter Turnout among African Americans in Louisiana Precincts 23 1.2 Non-Minority Turnout in New Jersey Cities and Towns 25 3.1 How a Correlation between the Parameters and Xi Induces Bias 41 4.1 Scatter Plot of Precincts in Marion County, Indiana: Voter Turnout for the U. Senate by Fraction Black, 1990 60 4.2 Evaluating Population-Based Weights 64 4.3 Typically Massive Heteroskedasticity in Voting Data 66 5.1 A Data Summary Convenient for Statistical Modeling 81 5.2 Image Plots of Upper and Lower Bounds on βbi 86 5.3 Image Plots of Upper and Lower Bounds on βw i 87 5.4 Image Plots of Width of Bounds 88 5.5 A Scattercross Graph of Voter Turnout by Fraction Hispanic 89 6.1 Features of the Data Generated by Each Parameter 100 6.2 Truncated Bivariate Normal Distributions 105 6.4 Truncated Bivariate Normal Surface Plot 116 7.1 Verifying Individual-Level Distributional Assumptions with Aggregate Data 126 7.2 Observable Implications for Sample Parameter Values 127 7.3 Likelihood Contour Plots 137 8.1 Posterior Distributions of Precinct Parameters βbi 148 8.2 Support of the Joint Distribution of θib and βbi with Bounds Specified for Drawing λbi 155 9.1 The Worst of Aggregation Bias: Same Truth, Different Observable Implications 160 9.2 The Worst of Distributional Violations: Different True Parameters, Same Observable Implications 163 9.3 Conclusive Evidence of Aggregation Bias from Aggregate Data 176 9.5 Controlling for Aggregation Bias 179 9.6 Extended Model Tradeoffs 180 9.7 A Tomography Plot with Evidence of Multiple Modes 187 9.8 Building a Nonparametric Density Estimate 194 9.9 Nonparametric Density Estimate for a Difficult Case 195 xii Figures 10.1 A Scattercross Graph for Southern Counties, 1968 201 10.2 Tomography Plot of Southern Race Data with Maximum Likelihood Contours 204 10.3 Scatter Plot with Maximum Likelihood Results Superimposed 206 10.4 Posterior Distribution of the Aggregate Quantities of Interest 208 10.5 Comparing Estimates to the Truth at the County Level 210 10.7 Verifying Uncertainty Estimates 213 10.8 275 Lines Fit to 275 Points 214 11.1 South Carolina Tomography Plot 221 11.2 Posterior Distributions of the State-Wide Fraction in Poverty by Sex in South Carolina 222 11.3 Fractions in Poverty for 3,187 South Carolina Block Groups 223 11.4 Percentiles at Which True Values Fall 224 12.1 A Scattercross Graph of Fraction Black by Fraction Registered 227 12.2 Tomography Plot with Parametric Contours and a Nonparametric Surface Plot 229 12.3 Posterior Distributions of the State-Wide Fraction of Blacks and Whites Registered 231 12.4 Fractions Registered at the County Level 232 12.5 80% Posterior Confidence Intervals by True Values 233 13.1 Fulton County Voter Transitions 236 13.2 Aggregation Bias in Fulton County Data 238 13.3 Fulton County Tomography Plot 239 13.4 Comparing Voter Transition Rate Estimates with the Truth in Fulton County 241 13.5 Alternative Fits to Literacy by Race Data 242 13.6 Black Literacy Tomography Plot and True Points 243 13.7 Comparing Estimates to the County-Level Truth in Literacy by Race Data 244 Tables 1.1 The Ecological Inference Problem at the District Level 13 1.2 The Ecological Inference Problem at the Precinct Level 14 1.3 Sample Ecological Inferences 16 2.1 Basic Notation for Precinct i 29 2.2 Alternative Notation for Precinct i 31 2.3 Simplified Notation for Precinct i 31 4.1 Comparing Goodman Model Parameters to the Parameters of Interest in the 2 × 3 Table 70 9.1 Consequences of Spatial Autocorrelation: Monte Carlo Evidence 168 9.2 Consequences of Distributional Misspecification: Monte Carlo Evidence 189 10.1 Maximum Likelihood Estimates 202 10.2 Reparameterized Maximum Likelihood Estimates 203 10.3 Verifying Estimates of ψ 207 11.1 Evidence of Aggregation Bias in South Carolina 219 11.2 Goodman Model Estimates: Poverty by Sex 220 12.1 Evidence of Aggregation Bias in Kentucky 228 12.2 80% Confidence Intervals for ψ̆ and ψ 230 15.1 Example of a Larger Table 265 15.2 Notation for a Large Table 268 Preface In this book, I present a solution to the ecological inference problem: a method of inferring individual behavior from aggregate data that works in practice. Ecological inference is the process of using aggre- gate (i., “ecological”) data to infer discrete individual-level relation- ships of interest when individual-level data are not available. Existing methods of ecological inference generate very inaccurate conclusions about the empirical world—which thus gives rise to the ecological in- ference problem.
Most scholars who analyze aggregate data routinely encounter some form of the this problem. The ecological inference problem has been among the longest standing, hitherto unsolved problems in quantitative social science. It was originally raised over seventy-five years ago as the first statistical problem in the nascent discipline of political science, and it has held back research agendas in most of its empirical subfields. Ecological inferences are required in political science research when individual- level surveys are unavailable (for example, local or comparative electoral politics), unreliable (racial politics), insufficient (political ge- ography), or infeasible (political history).
They are also required in numerous areas of major significance in public policy (for example, for applying the Voting Rights Act) and other academic disciplines, ranging from epidemiology and marketing to sociology and quanti- tative history.1 Because the ecological inference problem is caused by the lack of individual-level information, no method of ecological inference, including that introduced in this book, will produce precisely ac- curate results in every instance. However, potential difficulties are minimized here by models that include more available information, diagnostics to evaluate when assumptions need to be modified, and realistic uncertainty estimates for all quantities of interest. For po- litical methodologists, many opportunities remain, and I hope the 1 What is “ecological” about the aggregate data from which individual behavior is to be inferred? The name has been used at least since the late 1800s and stems from the word ecology, the science of the interrelationship of living things and their environ- ments. Statistical measures taken at the level of the environment, such as summaries of geographic areas or other aggregate units, are widely known as ecological data.
Eco- logical inference is the process of using ecological data to learn about the behavior of individuals within these aggregates. xvi Preface results reported here lead to continued research into and further improvements in the methods of ecological inference. But most im- portantly, the solution to the ecological inference problem presented here is designed so that empirical researchers can investigate sub- stantive questions that have heretofore proved intractable. Perhaps it will also lead to new theories and empirical research in areas where analysts have feared to tread due to the lack of reliable ecological methods or individual-level data.