Giải Quyết Vấn Đề Suy Diễn Sinh Thái: Tái Tạo Hành Vi Cá Nhân Từ Dữ Liệu Tập Hợp

Tài liệu nghiên cứu Part1, tổng hợp lý thuyết và thực hành, cung cấp kiến thức chuyên sâu về ., phục vụ nghiên cứu và ứng dụng thực tiễn

Trường đại học

Princeton University

Chuyên ngành

Political Science

Người đăng

Ẩn danh

Thể loại

thesis

1997

Phí lưu trữ

30 Point

Mục lục chi tiết

Preface

Part I. Introduction

1. Qualitative Overview

1.1. The Necessity of Ecological Inferences

1.5. The Method

2. Formal Statement of the Problem

Part II. Catalog of Problems to Fix

3. Aggregation Problems

3.1. Goodman’s Regression: A Definition

3.2. The Indeterminacy Problem

3.3. The Grouping Problem

3.4. Equivalence of the Grouping and Indeterminacy Problems

3.5. A Concluding Definition

4. Non-Aggregation Problems

4.1. Goodman Regression Model Problems

4.2. Applying Goodman’s Regression in 2 × 3 Tables

4.3. Double Regression Problems

4.4. Concluding Remarks

Part III. The Proposed Solution

5. The Data: Generalizing the Method of Bounds

5.1. Homogeneous Precincts: No Uncertainty

5.2. Heterogeneous Precincts: Upper and Lower Bounds

5.1. Precinct-Level Quantities of Interest

5.2. District-Level Quantities of Interest

5.3. An Easy Visual Method for Computing Bounds

6. The Model

6.1. The Basic Model

6.1. Observable Implications of Model Parameters

6.2. Parameterizing the Truncated Bivariate Normal

6.3. Computing 2p Parameters from Only p Observations

6.4. Connections to the Statistics of Medical and Seismic Imaging

6.5. Would a Model of Individual-Level Choices Help?

7. Preliminary Estimation

7.2. The Likelihood Function

7.5. Summarizing Information about Estimated Parameters

8. Calculating Quantities of Interest

8.1. Simulation Is Easier than Analytical Derivation

8.1. Definitions and Examples

8.2. Simulation for Ecological Inference

8.2. Precinct-Level Quantities

8.3. District-Level Quantities

8.4. Quantities of Interest from Larger Tables

8.1. A Multiple Imputation Approach

8.2. An Approach Related to Double Regression

8.5. Other Quantities of Interest

9. Model Extensions

9.1. What Can Go Wrong?

9.2. Incorrect Distributional Assumptions

9.2. Avoiding Aggregation Bias

9.1. Using External Information

9.2. Unconditional Estimation: Xi as a Covariate

9.3. Tradeoffs and Priors for the Extended Model

9.4. Ex Post Diagnostics

9.3. Avoiding Distributional Problems

9.2. A Nonparametric Approach

Part IV. Verification

10. A Typical Application Described in Detail: Voter Registration by Race

10.3. Computing Quantities of Interest

10.3. Other Quantities of Interest

11. Robustness to Aggregation Bias: Poverty Status by Sex

11.1. Data and Notation

11.2. Verifying the Existence of Aggregation Bias

11.3. Fitting the Data

11.4. Empirical Results

12. Estimation without Information: Black Registration in Kentucky

12.3. Fitting the Data

12.4. Empirical Results

13. Classic Ecological Inferences

13.2. Black Literacy in 1910

Part V. Generalizations and Concluding Suggestions

14. Non-Ecological Aggregation Problems

14.1. The Geographer’s Modifiable Areal Unit Problem

14.1. The Problem with the Problem

14.2. Ecological Inference as a Solution to the Modifiable Areal Unit Problem

14.2. The Statistical Problem of Combining Survey and Aggregate Data

14.3. The Econometric Problem of Aggregating Continuous Variables

14.4. Concluding Remarks on Related Aggregation Research

15. Ecological Inference in Larger Tables

15.1. An Intuitive Approach

15.2. Notation for a General Approach

15.4. The Statistical Model

15.6. Calculating the Quantities of Interest

15.7. Concluding Suggestions

16. A Concluding Checklist

Part VI: Appendices

A Proof That All Discrepancies Are Equivalent

B Parameter Bounds

B.2 Heterogeneous Precincts: β’s and θ’s

B.3 Heterogeneous Precincts: λi ’s

C Conditional Posterior Distribution

C.1 Using Bayes Theorem

C.2 Using Properties of Normal Distributions

D The Likelihood Function

E The Details of Nonparametric Estimation

F Computational Issues

Glossary of Symbols

References

Index

Tóm tắt

I. Giới thiệu về Giải Quyết Vấn Đề Suy Diễn Sinh Thái

Vấn đề suy diễn sinh thái đã tồn tại lâu dài trong nghiên cứu xã hội học và chính trị. Nó liên quan đến việc sử dụng dữ liệu tổng hợp để suy luận về hành vi cá nhân mà không có thông tin chi tiết. Việc tái tạo hành vi cá nhân từ dữ liệu tập hợp là một thách thức lớn, đặc biệt khi dữ liệu cá nhân không khả dụng. Gary King đã đưa ra một phương pháp để giải quyết vấn đề này, giúp các nhà nghiên cứu có thể rút ra những kết luận chính xác hơn từ dữ liệu tổng hợp.

1.1. Tầm Quan Trọng của Suy Diễn Sinh Thái

Suy diễn sinh thái là một công cụ quan trọng trong nghiên cứu chính trị và xã hội. Nó cho phép các nhà nghiên cứu hiểu rõ hơn về hành vi của các nhóm cá nhân từ dữ liệu tổng hợp, điều này rất cần thiết trong các nghiên cứu khi dữ liệu cá nhân không có sẵn.

1.2. Mục Tiêu của Nghiên Cứu

Mục tiêu của nghiên cứu này là tìm ra phương pháp hiệu quả để tái tạo hành vi cá nhân từ dữ liệu tập hợp, nhằm giải quyết vấn đề suy diễn sinh thái. Điều này không chỉ giúp cải thiện độ chính xác của các nghiên cứu mà còn mở ra hướng đi mới cho các nghiên cứu trong tương lai.

II. Vấn Đề và Thách Thức trong Suy Diễn Sinh Thái

Một trong những thách thức lớn nhất trong suy diễn sinh thái là sự không chính xác trong việc suy luận từ dữ liệu tổng hợp. Các vấn đề như phân tích dữ liệu không chính xác, sự thiên lệch trong mẫu và các yếu tố không thể đo lường đều có thể dẫn đến kết quả sai lệch. King đã chỉ ra rằng việc hiểu rõ các vấn đề này là rất quan trọng để phát triển các phương pháp chính xác hơn.

2.1. Các Vấn Đề Phân Tích Dữ Liệu

Các vấn đề phân tích dữ liệu bao gồm sự thiên lệch trong mẫu và các yếu tố không thể đo lường. Những vấn đề này có thể dẫn đến những kết luận sai lệch về hành vi cá nhân từ dữ liệu tổng hợp.

2.2. Tác Động của Suy Diễn Sinh Thái

Tác động của suy diễn sinh thái có thể rất lớn, ảnh hưởng đến các quyết định chính trị và xã hội. Việc hiểu rõ các tác động này giúp các nhà nghiên cứu đưa ra các giải pháp hiệu quả hơn.

III. Phương Pháp Giải Quyết Vấn Đề Suy Diễn Sinh Thái

Gary King đã phát triển một phương pháp để giải quyết vấn đề suy diễn sinh thái thông qua việc tái tạo hành vi cá nhân từ dữ liệu tập hợp. Phương pháp này bao gồm việc sử dụng các mô hình thống kê để phân tích dữ liệu và rút ra các kết luận chính xác hơn. Điều này không chỉ giúp cải thiện độ chính xác mà còn giảm thiểu các sai sót trong suy luận.

3.1. Mô Hình Hành Vi Cá Nhân

Mô hình hành vi cá nhân được xây dựng dựa trên các dữ liệu tổng hợp, cho phép các nhà nghiên cứu suy luận về hành vi của cá nhân trong các nhóm khác nhau. Điều này giúp cải thiện độ chính xác của các kết quả nghiên cứu.

3.2. Phân Tích Dữ Liệu Tập Hợp

Phân tích dữ liệu tập hợp là một phần quan trọng trong phương pháp của King. Nó cho phép các nhà nghiên cứu xác định các yếu tố ảnh hưởng đến hành vi cá nhân từ dữ liệu tổng hợp, từ đó đưa ra các kết luận chính xác hơn.

IV. Ứng Dụng Thực Tiễn của Phương Pháp

Phương pháp của King đã được áp dụng trong nhiều lĩnh vực khác nhau, từ chính trị đến xã hội học. Việc tái tạo hành vi cá nhân từ dữ liệu tập hợp đã giúp các nhà nghiên cứu đưa ra những kết luận chính xác hơn về hành vi của các nhóm cá nhân. Điều này có thể ảnh hưởng đến các chính sách công và quyết định chính trị.

4.1. Nghiên Cứu Về Đăng Ký Cử Tri

Một ứng dụng điển hình của phương pháp này là nghiên cứu về đăng ký cử tri theo chủng tộc. Nghiên cứu này đã chỉ ra rằng phương pháp của King có thể giúp xác định các yếu tố ảnh hưởng đến tỷ lệ đăng ký cử tri trong các nhóm khác nhau.

4.2. Phân Tích Tình Trạng Nghèo Đói

Phân tích tình trạng nghèo đói theo giới tính cũng là một ứng dụng quan trọng. Phương pháp này đã giúp xác định các yếu tố ảnh hưởng đến tình trạng nghèo đói trong các nhóm khác nhau, từ đó đưa ra các giải pháp hiệu quả hơn.

V. Kết Luận và Tương Lai của Nghiên Cứu

Nghiên cứu về suy diễn sinh thái và phương pháp của King mở ra nhiều cơ hội cho các nghiên cứu trong tương lai. Việc tái tạo hành vi cá nhân từ dữ liệu tập hợp không chỉ giúp cải thiện độ chính xác của các nghiên cứu mà còn mở ra hướng đi mới cho các nghiên cứu trong các lĩnh vực khác nhau. Tương lai của nghiên cứu này hứa hẹn sẽ mang lại nhiều giá trị cho các nhà nghiên cứu và các nhà hoạch định chính sách.

5.1. Hướng Đi Mới Trong Nghiên Cứu

Hướng đi mới trong nghiên cứu sẽ tập trung vào việc cải thiện các phương pháp phân tích dữ liệu, từ đó nâng cao độ chính xác của các kết quả nghiên cứu. Điều này sẽ giúp các nhà nghiên cứu có thể đưa ra những kết luận chính xác hơn về hành vi cá nhân.

5.2. Tác Động Đến Chính Sách Công

Nghiên cứu này có thể ảnh hưởng đến các chính sách công, giúp các nhà hoạch định chính sách đưa ra các quyết định dựa trên dữ liệu chính xác hơn. Điều này sẽ góp phần cải thiện chất lượng cuộc sống của người dân.

27/07/2025

Bạn đang xem trước tài liệu:

Part1

Tải đầy đủ

Trích đoạn nội dung tài liệu

Gary King: A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data is published by Princeton University Press and copyrighted,  1997, Princeton University Press. All rights reserved. This text may be used and shared in accordance with the fair-use provisions of US copyright law, and it may be archived and redistributed in electronic form, provided that this notice is carried, Princeton University Press is notified, the entire original is distributed without modification, and no fee is charged for access. Archiving, redistribution, or republication of this text on other terms, in any medium, requires the consent of Princeton University Press.

For COURSE PACK PERMISSIONS, refer to entry on previous menu. For more information, send e-mail to permissions@pupress.edu A Solution to the Ecological Inference Problem A Solution to the Ecological Inference Problem reconstructing individual behavior from aggregate data Gary King PRINCETON UNIVERSITY PRESS P R I N C E T O N, N E W J E R S E Y Copyright © 1997 by Princeton University Press Published by Princeton University Press, 41 William Street, Princeton, New Jersey 08540 In the United Kingdom: Princeton University Press, Chichester, West Sussex All Rights Reserved Library of Congress Cataloging-in-Publication Data King, Gary. A solution to the ecological inference problem: reconstructing individual behavior from aggregate data / Gary King. Includes bibliographical references and index.

Political science—Statistical methods.072—dc20 9632986 CIP This book has been composed in Palatino Princeton University Press books are printed on acid-free paper and meet the guidelines for permanence and durability of the Committee on Production Guidelines for Book Longevity of the Council on Library Resources Printed in the United States of America by Princeton Academic Press 1 3 5 7 9 10 8 6 4 2 1 3 5 7 9 10 8 6 4 2 (Pbk.) For Ella Michelle King Contents List of Figures xi List of Tables xiii Preface xv Part I: Introduction 1 1 Qualitative Overview 3 1.1 The Necessity of Ecological Inferences 7 1.5 The Method 26 2 Formal Statement of the Problem 28 Part II: Catalog of Problems to Fix 35 3 Aggregation Problems 37 3.1 Goodman’s Regression: A Definition 37 3.2 The Indeterminacy Problem 39 3.3 The Grouping Problem 46 3.4 Equivalence of the Grouping and Indeterminacy Problems 53 3.5 A Concluding Definition 54 4 Non-Aggregation Problems 56 4.1 Goodman Regression Model Problems 56 4.2 Applying Goodman’s Regression in 2 × 3 Tables 68 4.3 Double Regression Problems 71 4.4 Concluding Remarks 73 Part III: The Proposed Solution 75 5 The Data: Generalizing the Method of Bounds 77 5.1 Homogeneous Precincts: No Uncertainty 78 viii Contents 5.2 Heterogeneous Precincts: Upper and Lower Bounds 79 5.1 Precinct-Level Quantities of Interest 79 5.2 District-Level Quantities of Interest 83 5.3 An Easy Visual Method for Computing Bounds 85 6 The Model 91 6.1 The Basic Model 92 6.1 Observable Implications of Model Parameters 96 6.2 Parameterizing the Truncated Bivariate Normal 102 6.3 Computing 2p Parameters from Only p Observations 106 6.4 Connections to the Statistics of Medical and Seismic Imaging 112 6.5 Would a Model of Individual-Level Choices Help? 119 7 Preliminary Estimation 123 7.2 The Likelihood Function 132 7.5 Summarizing Information about Estimated Parameters 139 8 Calculating Quantities of Interest 141 8.1 Simulation Is Easier than Analytical Derivation 141 8.1 Definitions and Examples 142 8.2 Simulation for Ecological Inference 144 8.2 Precinct-Level Quantities 145 8.3 District-Level Quantities 149 8.4 Quantities of Interest from Larger Tables 151 8.1 A Multiple Imputation Approach 151 8.2 An Approach Related to Double Regression 153 8.5 Other Quantities of Interest 156 9 Model Extensions 158 9.1 What Can Go Wrong? 158 9.2 Incorrect Distributional Assumptions 161 9.2 Avoiding Aggregation Bias 168 9.1 Using External Information 169 Contents ix 9.2 Unconditional Estimation: Xi as a Covariate 174 9.3 Tradeoffs and Priors for the Extended Model 179 9.4 Ex Post Diagnostics 183 9.3 Avoiding Distributional Problems 184 9.2 A Nonparametric Approach 191 Part IV: Verification 197 10 A Typical Application Described in Detail: Voter Registration by Race 199 10.3 Computing Quantities of Interest 207 10.3 Other Quantities of Interest 215 11 Robustness to Aggregation Bias: Poverty Status by Sex 217 11.1 Data and Notation 217 11.2 Verifying the Existence of Aggregation Bias 218 11.3 Fitting the Data 220 11.4 Empirical Results 222 12 Estimation without Information: Black Registration in Kentucky 226 12.3 Fitting the Data 228 12.4 Empirical Results 232 13 Classic Ecological Inferences 235 13.2 Black Literacy in 1910 241 Part V: Generalizations and Concluding Suggestions 247 14 Non-Ecological Aggregation Problems 249 14.1 The Geographer’s Modifiable Areal Unit Problem 249 x Contents 14.1 The Problem with the Problem 250 14.2 Ecological Inference as a Solution to the Modifiable Areal Unit Problem 252 14.2 The Statistical Problem of Combining Survey and Aggregate Data 255 14.3 The Econometric Problem of Aggregating Continuous Variables 258 14.4 Concluding Remarks on Related Aggregation Research 262 15 Ecological Inference in Larger Tables 263 15.1 An Intuitive Approach 264 15.2 Notation for a General Approach 267 15.4 The Statistical Model 271 15.6 Calculating the Quantities of Interest 276 15.7 Concluding Suggestions 276 16 A Concluding Checklist 277 Part VI: Appendices 293 A Proof That All Discrepancies Are Equivalent 295 B Parameter Bounds 301 B.2 Heterogeneous Precincts: β’s and θ’s 302 B.3 Heterogeneous Precincts: λi ’s 303 C Conditional Posterior Distribution 304 C.1 Using Bayes Theorem 305 C.2 Using Properties of Normal Distributions 306 D The Likelihood Function 307 E The Details of Nonparametric Estimation 309 F Computational Issues 311 Glossary of Symbols 313 References 317 Index 337 Figures 1.1 Model Verification: Voter Turnout among African Americans in Louisiana Precincts 23 1.2 Non-Minority Turnout in New Jersey Cities and Towns 25 3.1 How a Correlation between the Parameters and Xi Induces Bias 41 4.1 Scatter Plot of Precincts in Marion County, Indiana: Voter Turnout for the U. Senate by Fraction Black, 1990 60 4.2 Evaluating Population-Based Weights 64 4.3 Typically Massive Heteroskedasticity in Voting Data 66 5.1 A Data Summary Convenient for Statistical Modeling 81 5.2 Image Plots of Upper and Lower Bounds on βbi 86 5.3 Image Plots of Upper and Lower Bounds on βw i 87 5.4 Image Plots of Width of Bounds 88 5.5 A Scattercross Graph of Voter Turnout by Fraction Hispanic 89 6.1 Features of the Data Generated by Each Parameter 100 6.2 Truncated Bivariate Normal Distributions 105 6.4 Truncated Bivariate Normal Surface Plot 116 7.1 Verifying Individual-Level Distributional Assumptions with Aggregate Data 126 7.2 Observable Implications for Sample Parameter Values 127 7.3 Likelihood Contour Plots 137 8.1 Posterior Distributions of Precinct Parameters βbi 148 8.2 Support of the Joint Distribution of θib and βbi with Bounds Specified for Drawing λbi 155 9.1 The Worst of Aggregation Bias: Same Truth, Different Observable Implications 160 9.2 The Worst of Distributional Violations: Different True Parameters, Same Observable Implications 163 9.3 Conclusive Evidence of Aggregation Bias from Aggregate Data 176 9.5 Controlling for Aggregation Bias 179 9.6 Extended Model Tradeoffs 180 9.7 A Tomography Plot with Evidence of Multiple Modes 187 9.8 Building a Nonparametric Density Estimate 194 9.9 Nonparametric Density Estimate for a Difficult Case 195 xii Figures 10.1 A Scattercross Graph for Southern Counties, 1968 201 10.2 Tomography Plot of Southern Race Data with Maximum Likelihood Contours 204 10.3 Scatter Plot with Maximum Likelihood Results Superimposed 206 10.4 Posterior Distribution of the Aggregate Quantities of Interest 208 10.5 Comparing Estimates to the Truth at the County Level 210 10.7 Verifying Uncertainty Estimates 213 10.8 275 Lines Fit to 275 Points 214 11.1 South Carolina Tomography Plot 221 11.2 Posterior Distributions of the State-Wide Fraction in Poverty by Sex in South Carolina 222 11.3 Fractions in Poverty for 3,187 South Carolina Block Groups 223 11.4 Percentiles at Which True Values Fall 224 12.1 A Scattercross Graph of Fraction Black by Fraction Registered 227 12.2 Tomography Plot with Parametric Contours and a Nonparametric Surface Plot 229 12.3 Posterior Distributions of the State-Wide Fraction of Blacks and Whites Registered 231 12.4 Fractions Registered at the County Level 232 12.5 80% Posterior Confidence Intervals by True Values 233 13.1 Fulton County Voter Transitions 236 13.2 Aggregation Bias in Fulton County Data 238 13.3 Fulton County Tomography Plot 239 13.4 Comparing Voter Transition Rate Estimates with the Truth in Fulton County 241 13.5 Alternative Fits to Literacy by Race Data 242 13.6 Black Literacy Tomography Plot and True Points 243 13.7 Comparing Estimates to the County-Level Truth in Literacy by Race Data 244 Tables 1.1 The Ecological Inference Problem at the District Level 13 1.2 The Ecological Inference Problem at the Precinct Level 14 1.3 Sample Ecological Inferences 16 2.1 Basic Notation for Precinct i 29 2.2 Alternative Notation for Precinct i 31 2.3 Simplified Notation for Precinct i 31 4.1 Comparing Goodman Model Parameters to the Parameters of Interest in the 2 × 3 Table 70 9.1 Consequences of Spatial Autocorrelation: Monte Carlo Evidence 168 9.2 Consequences of Distributional Misspecification: Monte Carlo Evidence 189 10.1 Maximum Likelihood Estimates 202 10.2 Reparameterized Maximum Likelihood Estimates 203 10.3 Verifying Estimates of ψ 207 11.1 Evidence of Aggregation Bias in South Carolina 219 11.2 Goodman Model Estimates: Poverty by Sex 220 12.1 Evidence of Aggregation Bias in Kentucky 228 12.2 80% Confidence Intervals for ψ̆ and ψ 230 15.1 Example of a Larger Table 265 15.2 Notation for a Large Table 268 Preface In this book, I present a solution to the ecological inference problem: a method of inferring individual behavior from aggregate data that works in practice. Ecological inference is the process of using aggre- gate (i., “ecological”) data to infer discrete individual-level relation- ships of interest when individual-level data are not available. Existing methods of ecological inference generate very inaccurate conclusions about the empirical world—which thus gives rise to the ecological in- ference problem.

Most scholars who analyze aggregate data routinely encounter some form of the this problem. The ecological inference problem has been among the longest standing, hitherto unsolved problems in quantitative social science. It was originally raised over seventy-five years ago as the first statistical problem in the nascent discipline of political science, and it has held back research agendas in most of its empirical subfields. Ecological inferences are required in political science research when individual- level surveys are unavailable (for example, local or comparative electoral politics), unreliable (racial politics), insufficient (political ge- ography), or infeasible (political history).

They are also required in numerous areas of major significance in public policy (for example, for applying the Voting Rights Act) and other academic disciplines, ranging from epidemiology and marketing to sociology and quanti- tative history.1 Because the ecological inference problem is caused by the lack of individual-level information, no method of ecological inference, including that introduced in this book, will produce precisely ac- curate results in every instance. However, potential difficulties are minimized here by models that include more available information, diagnostics to evaluate when assumptions need to be modified, and realistic uncertainty estimates for all quantities of interest. For po- litical methodologists, many opportunities remain, and I hope the 1 What is “ecological” about the aggregate data from which individual behavior is to be inferred? The name has been used at least since the late 1800s and stems from the word ecology, the science of the interrelationship of living things and their environ- ments. Statistical measures taken at the level of the environment, such as summaries of geographic areas or other aggregate units, are widely known as ecological data.

Eco- logical inference is the process of using ecological data to learn about the behavior of individuals within these aggregates. xvi Preface results reported here lead to continued research into and further improvements in the methods of ecological inference. But most im- portantly, the solution to the ecological inference problem presented here is designed so that empirical researchers can investigate sub- stantive questions that have heretofore proved intractable. Perhaps it will also lead to new theories and empirical research in areas where analysts have feared to tread due to the lack of reliable ecological methods or individual-level data.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ

Chủ đề

computational social science

Statistical Inference and Methodology

Data Aggregation and Disaggregation Methods