Nghiên cứu về dữ liệu không cân bằng trong phân loại: Trường hợp tín dụng

Chuyên ngành

Statistics

Người đăng

Ẩn danh

Thể loại

Doctoral Dissertation

2024

173
1
0

Phí lưu trữ

30.000 VNĐ

Mục lục chi tiết

STATEMENT OF AUTHENTICATION

ACKNOWLEDGMENT

TABLE OF CONTENTS

LIST OF ABBREVIATIONS

LIST OF FIGURES

LIST OF TABLES

ABSTRACT

TÓM TẮT

1. INTRODUCTION

1.1. Overview of imbalanced data in classification

1.2. Motivations

2. LITERATURE REVIEW OF IMBALANCED DATA

2.1. Imbalanced data in classification

2.1.1. Description of imbalanced data

2.1.2. Obstacles in imbalanced classification

2.1.3. Categories of imbalanced data

2.2. Performance measures for imbalanced data

2.2.1. Performance measures for labeled outputs

2.2.2. Performance measures for scored outputs

2.2.2.1. Area under the Receiver Operating Characteristics Curve
2.2.2.2. Kolmogorov-Smirnov statistic

2.2.3. Conclusion of performance measures in imbalanced classification

2.3. Approaches to imbalanced classification

2.3.1. Algorithm-level approach

2.3.1.1. Modifying the current classifier algorithms
2.3.1.2. Cost-sensitive learning
2.3.1.3. Comments on algorithm-level approach

2.3.2. Data-level approach

2.3.2.1. Under-sampling method
2.3.2.2. Over-sampling method
2.3.2.4. Comments on data-level approach

2.3.3. Ensemble-based approach

2.3.3.1. Integration of algorithm-level method and ensemble classifier algorithm
2.3.3.2. Integration of data-level method and ensemble classifier algorithm
2.3.3.3. Comments on ensemble-based approach

2.3.4. Conclusions of approaches to imbalanced data

2.4. Meaning of credit scoring

2.5. Inputs for credit scoring models

2.6. Interpretability of credit scoring models

2.7. Approaches to imbalanced data in credit scoring

2.8. Recent credit scoring ensemble models

3. IMBALANCED DATA IN CREDIT SCORING

3.1. Classifiers for credit scoring

3.1.1. Lasso-Logistic regression

3.1.2. Support vector machine

3.1.3. Artificial neural network

3.1.4. Heterogeneous ensemble classifiers

3.1.5. Homogeneous ensemble classifiers

3.1.6. Conclusions of statistical models for credit scoring

3.2. The proposed credit scoring ensemble model base Decision tree

3.2.1. The proposed algorithms

3.2.1.1. Algorithm for balancing data - OUS(B ) algorithm

3.2.2. Algorithm for constructing ensemble classifier - DTE(B ) algorithm

3.2.3. Empirical data sets

3.2.4. The optimal Decision tree ensemble classifier

3.2.5. Performance of the proposed model on the Vietnamese data sets

3.2.6. Performance of the proposed model on the public data sets

3.2.7. Conclusions of the proposed credit scoring ensemble model based Decision tree

3.3. The proposed algorithm for imbalanced and overlapping data

3.3.1. The proposed algorithms

3.3.1.1. Algorithm for dealing with noise, overlapping, and imbalanced data
3.3.1.2. Algorithm for constructing ensemble model

3.3.2. Empirical data sets

3.3.2.1. Computation protocol of the Lasso Logistic ensemble
3.3.2.2. Computation protocol of the Decision tree ensemble

3.3.3. The optimal ensemble classifier

3.3.4. Performance of LLE(B )

3.3.5. Performance of DTE(B )

3.3.6. Conclusions of the proposed technique

4. A MODIFICATION OF LOGISTIC REGRESSION WITH IMBALANCED DATA

4.1. The proposed works

4.1.1. The modification of the cross-validation procedure

4.1.2. The modification of Logistic regression

4.2. Weighted likelihood estimation (WLE)

4.3. Penalized likelihood regression (PLR)

4.4. Empirical data sets

4.5. Discussions and Conclusions

4.5.1. Summary of contributions

4.5.2. The interpretable credit scoring ensemble classifier

4.5.3. The technique for imbalanced data, noise, and overlapping samples

4.5.4. The modification of Logistic regression

4.5.5. Limitations and suggestions for further research

LIST OF PUBLICATION

REFERENCES

Appendices

A. Distance functions

B. Pseudo-code of popular ensemble classifiers

C. Empirical data sets

C.1. German credit data set (GER)

C.2. Vietnamese 1 data set (VN1)

C.3. Vietnamese 2 data set (VN2)

C.4. Taiwanese credit data set (TAI)

C.5. Bank personal loan data set (BANK)

C.6. Hepatitis C patients data set (HEPA)

C.7. The Loan schema data from lending club (US)

C.8. Vietnamese 3 data set (VN3)

C.9. Australian credit data set (AUS)

C.10. Credit risk data set (Credit 1)

C.11. Credit card data set (Credit 2)

C.12. Credit default data set (Credit 3)

C.13. Vietnamese 4 data set (VN4)

Imbalanced data in classification a case study of credit scoring

Bạn đang xem trước tài liệu:

Imbalanced data in classification a case study of credit scoring