Nghiên cứu về dữ liệu không cân bằng trong phân loại: Trường hợp tín dụng

Nghiên cứu trường hợp về dữ liệu không cân bằng trong phân loại, tập trung vào điểm tín dụng và các giải pháp cải thiện độ chính xác.

Trường đại học

University of Economics Ho Chi Minh City

Chuyên ngành

Statistics

Người đăng

Ẩn danh

Thể loại

Doctoral Dissertation

2024

173

Phí lưu trữ

45 Point

Mục lục chi tiết

STATEMENT OF AUTHENTICATION

ACKNOWLEDGMENT

LIST OF ABBREVIATIONS

LIST OF FIGURES

LIST OF TABLES

ABSTRACT

TÓM TẮT

1. INTRODUCTION

1.1. Overview of imbalanced data in classification

1.2. Motivations

2. LITERATURE REVIEW OF IMBALANCED DATA

2.1. Imbalanced data in classification

2.1.1. Description of imbalanced data

2.1.2. Obstacles in imbalanced classification

2.1.3. Categories of imbalanced data

2.2. Performance measures for imbalanced data

2.2.1. Performance measures for labeled outputs

2.2.2. Performance measures for scored outputs

2.2.2.1. Area under the Receiver Operating Characteristics Curve

2.2.2.2. Kolmogorov-Smirnov statistic

2.2.3. Conclusion of performance measures in imbalanced classification

2.3. Approaches to imbalanced classification

2.3.1. Algorithm-level approach

2.3.1.1. Modifying the current classifier algorithms

2.3.1.2. Cost-sensitive learning

2.3.1.3. Comments on algorithm-level approach

2.3.2. Data-level approach

2.3.2.1. Under-sampling method

2.3.2.2. Over-sampling method

2.3.2.4. Comments on data-level approach

2.3.3. Ensemble-based approach

2.3.3.1. Integration of algorithm-level method and ensemble classifier algorithm

2.3.3.2. Integration of data-level method and ensemble classifier algorithm

2.3.3.3. Comments on ensemble-based approach

2.3.4. Conclusions of approaches to imbalanced data

2.4. Meaning of credit scoring

2.5. Inputs for credit scoring models

2.6. Interpretability of credit scoring models

2.7. Approaches to imbalanced data in credit scoring

2.8. Recent credit scoring ensemble models

3. IMBALANCED DATA IN CREDIT SCORING

3.1. Classifiers for credit scoring

3.1.1. Lasso-Logistic regression

3.1.2. Support vector machine

3.1.3. Artificial neural network

3.1.4. Heterogeneous ensemble classifiers

3.1.5. Homogeneous ensemble classifiers

3.1.6. Conclusions of statistical models for credit scoring

3.2. The proposed credit scoring ensemble model base Decision tree

3.2.1. The proposed algorithms

3.2.1.1. Algorithm for balancing data - OUS(B ) algorithm

3.2.2. Algorithm for constructing ensemble classifier - DTE(B ) algorithm

3.2.3. Empirical data sets

3.2.4. The optimal Decision tree ensemble classifier

3.2.5. Performance of the proposed model on the Vietnamese data sets

3.2.6. Performance of the proposed model on the public data sets

3.2.7. Conclusions of the proposed credit scoring ensemble model based Decision tree

3.3. The proposed algorithm for imbalanced and overlapping data

3.3.1. The proposed algorithms

3.3.1.1. Algorithm for dealing with noise, overlapping, and imbalanced data

3.3.1.2. Algorithm for constructing ensemble model

3.3.2. Empirical data sets

3.3.2.1. Computation protocol of the Lasso Logistic ensemble

3.3.2.2. Computation protocol of the Decision tree ensemble

3.3.3. The optimal ensemble classifier

3.3.4. Performance of LLE(B )

3.3.5. Performance of DTE(B )

3.3.6. Conclusions of the proposed technique

4. A MODIFICATION OF LOGISTIC REGRESSION WITH IMBALANCED DATA

4.1. The proposed works

4.1.1. The modification of the cross-validation procedure

4.1.2. The modification of Logistic regression

4.2. Weighted likelihood estimation (WLE)

4.3. Penalized likelihood regression (PLR)

4.4. Empirical data sets

4.5. Discussions and Conclusions

4.5.1. Summary of contributions

4.5.2. The interpretable credit scoring ensemble classifier

4.5.3. The technique for imbalanced data, noise, and overlapping samples

4.5.4. The modification of Logistic regression

4.5.5. Limitations and suggestions for further research

LIST OF PUBLICATION

REFERENCES

Appendices

A. Distance functions

B. Pseudo-code of popular ensemble classifiers

C. Empirical data sets

C.1. German credit data set (GER)

C.2. Vietnamese 1 data set (VN1)

C.3. Vietnamese 2 data set (VN2)

C.4. Taiwanese credit data set (TAI)

C.5. Bank personal loan data set (BANK)

C.6. Hepatitis C patients data set (HEPA)

C.7. The Loan schema data from lending club (US)

C.8. Vietnamese 3 data set (VN3)

C.9. Australian credit data set (AUS)

C.10. Credit risk data set (Credit 1)

C.11. Credit card data set (Credit 2)

C.12. Credit default data set (Credit 3)

C.13. Vietnamese 4 data set (VN4)

Tóm tắt

I. Tổng Quan Về Dữ Liệu Không Cân Bằng Trong Phân Loại

Phân loại đóng vai trò quan trọng trong nhiều lĩnh vực như y học (chẩn đoán ung thư), tài chính (phát hiện gian lận), quản trị kinh doanh (dự đoán churn khách hàng), truy xuất thông tin (theo dõi tràn dầu, gian lận viễn thông), và nhận dạng ảnh (nhận diện khuôn mặt). Phân loại là bài toán dự đoán nhãn lớp cho một mẫu dữ liệu cho trước. Các thuật toán phân loại học các đặc trưng của mẫu để nhận diện các mẫu nhãn từ các tập dữ liệu huấn luyện bao gồm các mẫu với các loại nhãn khác nhau. Sau đó, các mẫu này, hiện được trình bày dưới dạng mô hình phân loại phù hợp, sẽ đưa ra dự đoán về nhãn của các mẫu mới. Phân loại được chia thành hai loại: nhị phân và đa phân loại. Phân loại nhị phân tập trung vào các bài toán nhãn hai lớp. Ngược lại, đa phân loại giải quyết các nhiệm vụ của một số nhãn lớp. Đa phân loại đôi khi được coi là nhị phân với hai lớp: một lớp tương ứng với nhãn quan tâm và lớp còn lại đại diện cho các nhãn còn lại. Trong phân loại nhị phân, tập dữ liệu được chia thành các lớp dương tính và âm tính. Lớp dương tính là lớp quan tâm, lớp này phải được xác định trong nhiệm vụ phân loại. Luận án này tập trung vào phân loại nhị phân.

1.1. Mô Tả Bài Toán Phân Loại Nhị Phân Chi Tiết

Một tập dữ liệu với k đặc trưng đầu vào cho phân loại nhị phân là tập hợp các mẫu S = X × Y, trong đó X ⊂ Rk là miền của các đặc trưng của mẫu và Y = {0, 1} là tập hợp các nhãn. Tập con của các mẫu được gắn nhãn 1 được gọi là lớp dương tính, được ký hiệu là S +. Tập con còn lại được gọi là lớp âm tính, được ký hiệu là S -. Một mẫu s ∈ S + được gọi là mẫu dương tính, ngược lại nó được gọi là mẫu âm tính. Một bộ phân loại nhị phân là một hàm ánh xạ miền của các đặc trưng X sang tập hợp các nhãn {0, 1}. Xem xét một tập dữ liệu S và một bộ phân loại f: X → {0, 1}. Với một mẫu s0 = (x0, y0) ∈ S, có bốn khả năng xảy ra: Nếu f(s0) = y0 = 1, s0 được gọi là một mẫu dương tính thực sự. Nếu f(s0) = y0 = 0, s0 được gọi là một mẫu âm tính thực sự. Nếu f(s0) = 1 và y0 = 0, s0 được gọi là một mẫu dương tính sai. Nếu f(s0) = 0 và y0 = 1, s0 được gọi là một mẫu âm tính sai.

1.2. Các Độ Đo Đánh Giá Hiệu Suất Mô Hình Phân Loại

Số lượng các mẫu dương tính thực, âm tính thực, dương tính sai và âm tính sai, được ký hiệu lần lượt là TP, TN, FP và FN. Một số tiêu chí phổ biến được sử dụng để đánh giá hiệu suất của bộ phân loại là độ chính xác, tỷ lệ dương tính thực (TPR), tỷ lệ âm tính thực (TNR), tỷ lệ dương tính sai (FPR) và tỷ lệ âm tính sai (FNR). Trong nhiều lĩnh vực ứng dụng nơi có sự cân bằng giữa các lớp dương tính và âm tính, độ chính xác là mục tiêu đầu tiên của bộ phân loại. Tuy nhiên, lớp quan tâm (lớp dương tính) đôi khi bao gồm các sự kiện bất thường hoặc các sự kiện hiếm gặp. Số lượng mẫu trong lớp dương tính quá nhỏ để bộ phân loại nhận ra các mẫu dương tính. Trong những tình huống như vậy, nếu bộ phân loại mắc lỗi trong lớp dương tính, chi phí tổn thất sẽ rất lớn. Do đó, độ chính xác không còn là tiêu chí hiệu suất quan trọng nhất mà là thứ gì đó liên quan đến TP như TPR.

II. Vấn Đề Dữ Liệu Không Cân Bằng Ảnh Hưởng Mô Hình Ra Sao

Trong nhiều lĩnh vực ứng dụng nơi có sự cân bằng giữa các lớp dương tính và âm tính, độ chính xác là mục tiêu đầu tiên của bộ phân loại. Tuy nhiên, lớp quan tâm (lớp dương tính) đôi khi bao gồm các sự kiện bất thường hoặc các sự kiện hiếm gặp. Số lượng mẫu trong lớp dương tính quá nhỏ để bộ phân loại nhận ra các mẫu dương tính. Trong những tình huống như vậy, nếu bộ phân loại mắc lỗi trong lớp dương tính, chi phí tổn thất sẽ rất nặng. Do đó, độ chính xác không còn là tiêu chí hiệu suất quan trọng nhất mà là thứ gì đó liên quan đến TP như TPR. Ví dụ, trong phát hiện gian lận, khách hàng được chia thành các lớp “xấu” và “tốt”. Vì các quy định tín dụng được công khai và khách hàng đã được sàng lọc sơ bộ trước khi đăng ký vay, một bộ dữ liệu tín dụng thường bao gồm phần lớn khách hàng tốt và một phần nhỏ khách hàng xấu. Mất mát do phân loại sai “xấu” thành “tốt” thường lớn hơn nhiều so với mất mát do phân loại sai “tốt” thành “xấu”. Do đó, việc xác định người xấu thường được coi là quan trọng hơn các nhiệm vụ khác.

2.1. Ảnh Hưởng Của Độ Chính Xác Trong Bài Toán Dữ Liệu Mất Cân Bằng

Hãy xem xét một danh sách khách hàng tín dụng bao gồm 95% tốt và 5% xấu. Nếu theo đuổi độ chính xác cao, chúng ta có thể chọn một bộ phân loại tầm thường ánh xạ tất cả khách hàng có nhãn tốt. Sau đó, độ chính xác của bộ phân loại này là 95%, nhưng TPR là 0%. Nói cách khác, bộ phân loại này không thể xác định khách hàng xấu. Thay vào đó, một bộ phân loại khác có độ chính xác thấp hơn nhưng TPR lớn hơn có thể được xem xét để thay thế bộ phân loại tầm thường này. Một ví dụ khác về phân loại hiếm gặp là chẩn đoán ung thư. Trong trường hợp này, tập dữ liệu có hai lớp, đó là “ác tính” và “lành tính”. Số lượng bệnh nhân ác tính luôn ít hơn nhiều so với số lượng bệnh nhân lành tính. Tuy nhiên, ác tính là mục tiêu đầu tiên của bất kỳ quy trình chẩn đoán ung thư nào vì những hậu quả nặng nề của việc bỏ sót bệnh nhân ung thư. Do đó, việc dựa vào chỉ số độ chính xác để đánh giá hiệu suất của bộ phân loại chẩn đoán ung thư là không hợp lý.

2.2. Định Nghĩa Tỷ Lệ Mất Cân Bằng IR Cụ Thể

Hiện tượng phân phối lệch trong tập dữ liệu huấn luyện cho phân loại được gọi là dữ liệu không cân bằng. Cho S = S + ∪ S − là tập dữ liệu, trong đó S + và S − lần lượt là các lớp dương tính và âm tính. Nếu số lượng S + nhỏ hơn nhiều so với số lượng S −, S được gọi là một tập dữ liệu không cân bằng. Bên cạnh đó, tỷ lệ không cân bằng (IR) của S được định nghĩa là tỷ lệ số lượng lớp âm tính và dương tính: IR = |S − | / |S + |.

III. Giải Pháp 1 Ensemble Cây Quyết Định Để Đánh Giá Tín Dụng

Luận án này đề xuất các giải pháp cho phân loại không cân bằng. Hơn nữa, các giải pháp này được áp dụng cho một nghiên cứu trường hợp đánh giá tín dụng. Các giải pháp này được rút ra từ ba bài báo được công bố trên các tạp chí khoa học. Bài báo đầu tiên trình bày một mô hình ensemble cây quyết định có thể diễn giải được cho các tập dữ liệu đánh giá tín dụng không cân bằng. Bài báo thứ hai giới thiệu một kỹ thuật mới để giải quyết dữ liệu không cân bằng, đặc biệt trong các trường hợp mẫu chồng chéo và nhiễu. Bài báo cuối cùng đề xuất một sửa đổi của hồi quy Logistic tập trung vào tối ưu hóa độ đo F, một độ đo phổ biến trong phân loại không cân bằng.

3.1. Ưu Điểm Của Mô Hình Ensemble Trong Đánh Giá Tín Dụng

Các bộ phân loại này đã được huấn luyện trên một loạt các tập dữ liệu công khai và riêng tư với trạng thái không cân bằng cao và các lớp chồng chéo. Các kết quả chính chứng minh rằng các công trình được đề xuất vượt trội hơn cả các mô hình truyền thống và một số mô hình gần đây. Các mô hình ensemble giúp giảm phương sai và sai lệch, từ đó cải thiện độ chính xác dự đoán.

3.2. Giải Thích Kết Quả Dễ Dàng Hơn Với Cây Quyết Định

Cây quyết định có khả năng hiển thị rõ ràng các quy tắc phân loại, giúp người dùng hiểu rõ hơn về cách thức đưa ra quyết định của mô hình. Điều này đặc biệt quan trọng trong lĩnh vực đánh giá tín dụng, nơi tính minh bạch và giải thích được là yêu cầu bắt buộc.

IV. Giải Pháp 2 Kỹ Thuật Mới Xử Lý Dữ Liệu Chồng Chéo Nhiễu

Bài báo thứ hai giới thiệu một kỹ thuật mới để giải quyết dữ liệu không cân bằng, đặc biệt trong các trường hợp mẫu chồng chéo và nhiễu. Kỹ thuật này tập trung vào việc làm sạch dữ liệu, loại bỏ các mẫu nhiễu và giảm thiểu sự chồng chéo giữa các lớp. Bằng cách này, mô hình học máy có thể tập trung vào các mẫu quan trọng và chính xác hơn.

4.1. Tầm Quan Trọng Của Làm Sạch Dữ Liệu Trong Phân Loại

Dữ liệu nhiễu và chồng chéo có thể làm giảm đáng kể hiệu suất của các mô hình phân loại. Bằng cách loại bỏ các mẫu không chính xác, chúng ta có thể cải thiện độ chính xác và khả năng khái quát hóa của mô hình.

4.2. Các Phương Pháp Giảm Thiểu Chồng Chéo Giữa Các Lớp

Việc giảm thiểu sự chồng chéo giữa các lớp giúp mô hình phân biệt rõ ràng hơn giữa các mẫu thuộc các lớp khác nhau, từ đó cải thiện độ chính xác phân loại. Có nhiều phương pháp để giảm thiểu chồng chéo, bao gồm sử dụng các thuật toán lấy mẫu lại và lựa chọn đặc trưng.

V. Giải Pháp 3 Hiệu Chỉnh Hồi Quy Logistic Tối Ưu F Measure

Bài báo cuối cùng đề xuất một sửa đổi của hồi quy Logistic tập trung vào tối ưu hóa độ đo F, một độ đo phổ biến trong phân loại không cân bằng. Độ đo F kết hợp độ chính xác và độ phủ, cung cấp một đánh giá toàn diện hơn về hiệu suất của mô hình trong các bài toán dữ liệu không cân bằng.

5.1. Tại Sao F Measure Quan Trọng Trong Dữ Liệu Không Cân Bằng

Trong các bài toán dữ liệu không cân bằng, độ chính xác có thể không phải là một độ đo tốt về hiệu suất. F-Measure cung cấp một đánh giá cân bằng hơn bằng cách xem xét cả độ chính xác và độ phủ.

5.2. Cách Hiệu Chỉnh Hồi Quy Logistic Tối Ưu F Measure

Các sửa đổi được đề xuất trong bài báo tập trung vào việc điều chỉnh các tham số của hồi quy Logistic để tối đa hóa F-Measure. Điều này có thể được thực hiện bằng cách sử dụng các thuật toán tối ưu hóa khác nhau.

VI. Kết Luận Đóng Góp Hướng Nghiên Cứu Dữ Liệu Không Cân Bằng

Các bộ phân loại này đã được thực nghiệm trên tập dữ liệu công khai và dữ liệu riêng với tính chất không cân bằng và chồng chéo các lớp. Kết quả đã chứng minh rằng các mô hình của chúng tôi có hiệu quả vượt trội so với các mô hình truyền thống và các mô hình được đề xuất gần đây. Nghiên cứu này đóng góp vào việc cải thiện hiệu suất của các mô hình phân loại trong các bài toán dữ liệu không cân bằng, đặc biệt trong lĩnh vực đánh giá tín dụng.

6.1. Tóm Tắt Đóng Góp Chính Của Luận Án

Luận án này đề xuất một mô hình có khả năng giải thích (ensemble cây quyết định), giới thiệu một kỹ thuật mới cho dữ liệu không cân bằng, đặc biệt trong trường hợp dữ liệu có chồng chéo các lớp và nhiễu, và đề xuất một hiệu chỉnh cho mô hình hồi quy Logistic tập trung vào tối đa hoá độ đo F.

6.2. Đề Xuất Hướng Nghiên Cứu Trong Tương Lai

Các hướng nghiên cứu trong tương lai có thể tập trung vào việc mở rộng các kỹ thuật được đề xuất cho các bài toán đa phân loại và khám phá các phương pháp mới để xử lý dữ liệu không cân bằng với độ phức tạp cao hơn.

23/05/2025

Bạn đang xem trước tài liệu:

Imbalanced data in classification a case study of credit scoring

Tải đầy đủ

Trích đoạn nội dung tài liệu

MINISTRY OF EDUCATION AND TRAINING UNIVERSITY OF ECONOMICS HO CHI MINH CITY BUI THI THIEN MY IMBALANCED DATA IN CLASSIFICATION: A CASE STUDY OF CREDIT SCORING DOCTORAL DISSERTATION Ho Chi Minh City - 2024 MINISTRY OF EDUCATION AND TRAINING UNIVERSITY OF ECONOMICS HO CHI MINH CITY BUI THI THIEN MY IMBALANCED DATA IN CLASSIFICATION: A CASE STUDY OF CREDIT SCORING Major: Statistics Major ID: 946.01 DOCTORAL DISSERTATION ACADEMIC ADVISORS: 1. Le Xuan Truong 2. Ta Quoc Bao Ho Chi Minh City - 2024 i STATEMENT OF AUTHENTICATION I certify that the Ph. dissertation, “Imbalanced data in classification: A case study of credit scoring”, is solely my own research.

This dissertation is only used for the Ph. degree at the University of Eco- nomics Ho Chi Minh City (UEH), and no part of it has been submitted to any other university or organization to obtain any other degree. Any studies of other authors used in this dissertation are properly cited. Ho Chi Minh City, April 2, 2024 ii ACKNOWLEDGMENT First of all, I would like to express my deepest gratitude to my supervisors, Assoc.

Le Xuan Truong and Dr. Ta Quoc Bao, for their scientific direction and dedicated guidance throughout the process of conducting this Ph. I sincerely thank the teachers of the UEH’s doctoral training program for imparting valuable knowledge, and the teachers at the Department of Mathe- matics and Statistics, UEH for their sincere comments on my dissertation. I sincerely thank Dr.

Le Thi Thanh An for her moral and academic support so that I could complete the research. Besides, I really appreciate the interest and help of my colleagues at Ho Chi Minh City University of Banking. Finally, I am grateful for the unconditional support that my mother and my family have given to me on my educational path. Ho Chi Minh City, April 2, 2024 iii TABLE OF CONTENTS STATEMENT OF AUTHENTICATION i ACKNOWLEDGMENT ii TABLE OF CONTENTS iii LIST OF ABBREVIATIONS ix LIST OF FIGURES xii LIST OF TABLES xiii ABSTRACT xv TÓM TẮT xvi 1 INTRODUCTION 1 1.1 Overview of imbalanced data in classification .3 Research gap identifications .1 Gaps in credit scoring .2 Gaps in the approaches to solving imbalanced data .3 Gaps in Logistic regression with imbalanced data .4 Research objectives, research subjects, and research scopes .5 Research data and research methods .6 Contributions of the dissertation.

14 2 LITERATURE REVIEW OF IMBALANCED DATA 16 2.1 Imbalanced data in classification .1 Description of imbalanced data .2 Obstacles in imbalanced classification .3 Categories of imbalanced data .2 Performance measures for imbalanced data .1 Performance measures for labeled outputs .2 Performance measures for scored outputs .1 Area under the Receiver Operating Character- istics Curve .2 Kolmogorov-Smirnov statistic .3 Conclusion of performance measures in imbalanced clas- sification .3 Approaches to imbalanced classification .1 Algorithm-level approach .1 Modifying the current classifier algorithms .2 Cost-sensitive learning .3 Comments on algorithm-level approach .2 Data-level approach .1 Under-sampling method .2 Over-sampling method .4 Comments on data-level approach .3 Ensemble-based approach .1 Integration of algorithm-level method and en- semble classifier algorithm .2 Integration of data-level method and ensemble classifier algorithm .3 Comments on ensemble-based approach .4 Conclusions of approaches to imbalanced data .1 Meaning of credit scoring .2 Inputs for credit scoring models .3 Interpretability of credit scoring models .4 Approaches to imbalanced data in credit scoring .5 Recent credit scoring ensemble models. 55 3 IMBALANCED DATA IN CREDIT SCORING 56 3.1 Classifiers for credit scoring .4 Lasso-Logistic regression .6 Support vector machine .7 Artificial neural network .1 Heterogeneous ensemble classifiers .2 Homogeneous ensemble classifiers .3 Conclusions of statistical models for credit scoring .2 The proposed credit scoring ensemble model base Decision tree 71 3.1 The proposed algorithms .1 Algorithm for balancing data - OUS(B ) algorithm 71 3.2 Algorithm for constructing ensemble classifier - DTE(B ) algorithm .2 Empirical data sets .1 The optimal Decision tree ensemble classifier .2 Performance of the proposed model on the Viet- namese data sets .3 Performance of the proposed model on the pub- lic data sets .5 Conclusions of the proposed credit scoring ensemble model based Decision tree .3 The proposed algorithm for imbalanced and overlapping data .1 The proposed algorithms .1 Algorithm for dealing with noise, overlapping, and imbalanced data .2 Algorithm for constructing ensemble model .2 Empirical data sets .1 Computation protocol of the Lasso Logistic en- semble .2 Computation protocol of the Decision tree en- semble .1 The optimal ensemble classifier .2 Performance of LLE(B ) .3 Performance of DTE(B ) .5 Conclusions of the proposed technique. 92 4 A MODIFICATION OF LOGISTIC REGRESSION WITH IM- BALANCED DATA 93 4.2 Weighted likelihood estimation (WLE) .3 Penalized likelihood regression (PLR) .3 The proposed works .1 The modification of the cross-validation procedure .2 The modification of Logistic regression .1 Empirical data sets .6 Important variables for output .1 Important variables for F-LLR fitted model .2 Important variables of the Vietnamese data set 112 4.5 Discussions and Conclusions .1 Summary of contributions .1 The interpretable credit scoring ensemble classifier .2 The technique for imbalanced data, noise, and overlap- ping samples .3 The modification of Logistic regression .3 Limitations and suggestions for further research. 122 LIST OF PUBLICATION 123 REFERENCES 124 viii Appendices 135 A Distance functions 136 B Pseudo-code of popular ensemble classifiers 138 C Empirical data sets 140 C.1 German credit data set (GER) .2 Vietnamese 1 data set (VN1) .3 Vietnamese 2 data set (VN2) .4 Taiwanese credit data set (TAI) .5 Bank personal loan data set (BANK) .6 Hepatitis C patients data set (HEPA) .7 The Loan schema data from lending club (US) .8 Vietnamese 3 data set (VN3) .9 Australian credit data set (AUS) .10 Credit risk data set (Credit 1) .11 Credit card data set (Credit 2) .12 Credit default data set (Credit 3) .13 Vietnamese 4 data set (VN4).

155 ix LIST OF ABBREVIATIONS ADASYN Adaptive synthetic sampling ANN Artificial neural network AUC Area under the ROC curves AUS Australian credit data set BANK Bank personal loan data set CART Classification and regression tree algorithm CHAID Chi-square automatic interaction detector algorithm CNN Condensed nearest neighbors Credit 1 Credit risk data set Credit 2 Credit card data set Credit 3 Credit default data set CSL Cost-sensitive learning CV Cross-validation procedure DA Discriminant analysis DT Decision tree DTE Decision tree ensemble classifier F-CV F-measure-oriented cross-validation procedure FICO Fair Issac Corporation FLAC Firth’s logistic regression with added covariate FLIC Firth’s logistic regression with intercept-correction F-LLR F-measure-oriented Lasso-Logistic regression FIR Firth-type - a version of Penalized likelihood regression FN, FNR False negative, False negative rate x FP, FPR False positive, False positive rate GER German credit data set HEPA Hepatitis patient data set HEOM Heterogeneous Euclidean-Overlap metric HVDM Heterogeneous value difference metric ID Imbalanced data IR Imbalanced ratio KNN K-nearest neighbor classifier KS Kolmogorov-Smirnov statistic LDA Linear discriminant analysis LLE Lasso-Logistic regression ensemble classifier LR Logistic regression LLR Lasso-Logistic regression MLE Maximum likelihood estimate NCL Neighborhood cleaning rule OSS One-side selection OUS Over-Under sampling - the proposed algorithm for balancing data PLR Penalized likelihood regression QDA Quadratic discriminant analysis RF Random forest ROC Receiver Operating Characteristics Curve ROS Random over-sampling RPART Recursive Partitioning and Regression Tree algorithm RUS Random under-sampling SMOTE Synthetic Minority Over-sampling technique xi SVM Support vector machine TAI Taiwanese credit data set TN, TNR True negative, True negative rate TP, TPR True positive, True positive rate TOUS Tomek-link -Over-Under sampling technique UCI University of California, Irvine US Loan schema data set from lending club VACM Vietnam Asset Management Company VN1 Vietnamese credit 1 data set VN2 Vietnamese credit 2 data set VN3 Vietnamese credit 3 data set VN4 Vietnamese credit 4 data set WLE Weighted likelihood estimation xii LIST OF FIGURES 2.1 Examples of circumstances of imbalanced data.2 Illustration of ROCs .3 Illustration of KS metric .4 Illustration of RUS technique .5 Illustration of CNN rule .6 Illustration of tomek-links .7 Illustration of ROS technique .8 Illustration of SMOTE technique .9 Approaches to imbalanced data in classification .1 Illustration of a Decision tree .2 Illustration of a decision boundary of SVM .3 Illustration of a two-hidden-layer ANN .4 Importance level of features of the Vietnamese data sets .5 Computation protocol of the proposed ensemble classifier .1 Illustration of F-CV .2 Illustration of F-LLR. 102 xiii LIST OF TABLES 1.1 General implementation protocol in the dissertation .2 Representatives employing the algorithm-level approach to ID .3 Cost matrix in Cost-sensitive learning .4 Summary of SMOTE algorithm .5 Representatives employing the data-level approach to ID .6 Representatives employing the ensemble-based approach to ID .1 Representatives of classifiers in credit scoring .4 Description of empirical data sets .5 Computation protocol of empirical study on DTE .6 Performance measures of DTE(B ) on the Vietnamese data sets 76 3.7 Performance of ensemble classifiers on the Vietnamese data sets 78 3.8 Performance of ensemble classifiers on the German data set .9 Performance of ensemble classifiers on the Taiwanese data set .12 Description of empirical data sets .13 Average testing AUC of the proposed ensembles .14 Average testing AUC of the models based LLR .15 Average testing AUC of the ensemble classifiers based tree .1 Cross-validation procedure for Lasso Logistic regression .2 F-measure-oriented Cross-Validation Procedure .3 Algorithm for F-LLR classifier .4 Description of empirical data sets .5 Implementation protocol of empirical study .6 Average testing performance measures of classifiers .7 Average testing performance measures of classifiers (cont.8 The number of wins of F-LLR on empirical data sets .9 Important features of the Vietnamese data set .10 Important features of the Vietnamese data set (cont.1 Algorithm of Bagging classifier .2 Algorithm of Random Forest .3 Algorithm of AdaBoost .1 Summary of the German credit data set .2 Summary of the Vietnamese 1 data set .3 Summary of Vietnamese 2 data set .4 Summary of the Taiwanese credit data set (a) .5 Summary of the Taiwanese credit data set (b) .6 Summary of the Bank personal loan data set .7 Summary of the Hepatitis C patients data set .8 Summary of the Loan schema data from lending club (a) .9 Summary of the Loan schema data from lending club (b) .10 Summary of the Loan schema data from lending club (c) .11 Summary of the Vietnamese 3 data set .12 Summary of the Australian credit data set .13 Summary of the Credit 1 data set .14 Summary of the Credit 2 data set .15 Summary of the Credit 3 data set .16 Summary of the Vietnamese 4 data set. 155 xv ABSTRACT In classification, imbalanced data occurs when there is a great difference in the quantities of classes of the training data set. This problem frequently arises in various fields, for example, credit scoring and medical diagnosis.

With imbalanced data, predictive modeling for real-world applications has posed a challenge because most machine learning algorithms are designed for balanced data sets. Therefore, addressing imbalanced data has attracted much attention from researchers and practitioners. In this dissertation, we propose solutions for imbalanced classification. Fur- thermore, these solutions are applied to a credit scoring case study.

The solu- tions are derived from three papers published in the scientific journals. • The first paper presents an interpretable decision tree ensemble model for imbalanced credit scoring data sets. • The second paper introduces a novel technique for addressing imbalanced data, particularly in the cases of overlapping and noisy samples. • The final paper proposes a modification of Logistic regression focusing on the optimization F-measure, a popular metric in imbalanced classification.

These classifiers have been trained on a range of public and private data sets with highly imbalanced status and overlapping classes. The primary results demonstrate that the proposed works outperform both traditional and some recent models. xvi TÓM TẮT Khi giải quyết các bài toán phân loại, hiện tượng dữ liệu không cân bằng xảy ra nếu các lớp trong tập huấn luyện có sự chênh lệch số phần tử đáng kể.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ

Chủ đề

Nghiên cứu về dữ liệu không cân bằng

Phân loại trong học máy

Ứng dụng trong tín dụng

Kỹ thuật cải thiện mô hình phân loại