FUZZY-GRANULAR BASED DATA MINING FOR EFFECTIVE DECISION SUPPORT IN BIOMEDICAL APPLICATIONS by YUANCHEN HE Under the Direction of Raj Sunderraman and Yan-Qing Zhang ABSTRACT Due to complexity of biomedical problems, adaptive and intelligent knowledge discovery and data mining systems are highly needed to help humans to understand the inherent mechanism of diseases. For biomedical classification problems, typically it is impossible to build a perfect classifier with 100% prediction accuracy. Hence a more realistic target is to build an effective Decision Support System (DSS). In this dissertation, a novel adaptive Fuzzy Association Rules (FARs) mining algorithm, named FARM-DS, is proposed to build such a DSS for binary classification problems in the biomedical domain.
Empirical studies show that FARM-DS is competitive to state-of- the-art classifiers in terms of prediction accuracy. More importantly, FARs can provide strong decision support on disease diagnoses due to their easy interpretability. This dissertation also proposes a fuzzy-granular method to select informative and discriminative genes from huge microarray gene expression data. With fuzzy granulation, information loss in the process of gene selection is decreased.
As a result, more informative genes for cancer classification are selected and more accurate classifiers can be modeled. Empirical studies show that the proposed method is more accurate than traditional algorithms for cancer classification. And hence we expect that genes being selected can be more helpful for further biological studies. INDEX WORDS: Data Mining, Knowledge Discovery, Computational Intelligence, Granular Computing, Fuzzy Association Rule Mining, Decision Support System, Binary Classification, Bioinformatics FUZZY-GRANULAR BASED DATA MINING FOR EFFECTIVE DECISION SUPPORT IN BIOMEDICAL APPLICATIONS by YUANCHEN HE A Dissertation Submitted in Partial Fulfillment of Requirements for the Degree of Doctor of Philosophy in the College of Arts and Sciences Georgia Stage University 2006 UMI Number: 3243236 Copyright 2006 by He, Yuanchen All rights reserved.
UMI Microform 3243236 Copyright 2007 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.
Box 1346 Ann Arbor, MI 48106-1346 Copyright by Yuanchen He 2006 FUZZY-GRANULAR BASED DATA MINING FOR EFFECTIVE DECISION SUPPORT IN BIOMEDICAL APPLICATIONS by YUANCHEN HE Major Professor: Rajshekhar Sunderraman Yan-Qing Zhang Committee: Saeid Belkasim Yichuan Zhao Electronic Version Approved: Office of Graduate Studies College of Arts and Sciences Georgia State University December 2006 iv Acknowledgments Firstly, my specific thanks go to my co-advisors, Dr. Rajshekhar Sunderraman and Dr. Yan-Qing Zhang, for their careful guidance and precise advisement during the process of my PhD dissertation. The dissertation would not have been possible without their helps.
Secondly, I would like to thank my committee members, Dr. Saeid Belkasim and Dr. Yichuan Zhao for their well-appreciated support and assistance. Finally, I want to thank my family and friends for their support and beliefs.
v TABLE OF CONTENTS LIST OF TABLES .VIII LIST OF ACRONYMS.2 METRICS FOR CLASSIFICATION .1 KNOWLEDGE DISCOVERY, DATA MINING, AND DATA WAREHOUSING .2 ASSOCIATION RULE MINING .2 THE APRIORI ALGORITHM .3 ASSOCIATION RULE MINING FOR CLASSIFICATION .4 SOFT COMPUTING AND FUZZY LOGIC .1 FUZZY CONCEPT IN THE DATA MINING DOMAIN.2 FUZZY DATA MODELING .2 PROBABILITY DISTRIBUTION AND FUZZY SETS .3 DATA MINING AND QUANTITATIVE DATA.1 TRANSFORMING QUANTITATIVE DATA .2 FUZZY DATA MINING .3 FINDING FUZZY SETS.5 FUZZY ASSOCIATION RULE MINING .6 FUZZY ASSOCIATION RULE MINING FOR CLASSIFICATION .8 CLUSTERING AND DATA ABSTRACTION .2 REPRESENTATION OF CLUSTERS. 32 CHAPTER 3 FUZZY ASSOCIATION RULE MINING FOR DECISION SUPPORT .1 STEP 1: FUZZY INTERVAL PARTITIONING .2 STEP 2: DATA ABSTRACTING .3 STEP 3: GENERATING FUZZY DISCRETE TRANSACTIONS .4 STEP 4: MINING ASSOCIATION RULES. 46 CHAPTER 4 FARM-DS FROM MEDICAL DATA.2 RESULTS ANALYSIS ON EFFECTIVENESS .3 RESULT ANALYSIS ON EFFICIENCY .4 RESULT ANALYSIS ON INTERPRETABILITY. 52 CHAPTER 5 FARM-DS FROM MICROARRAY EXPRESSION DATA .2 CHALLENGES FOR BIOINFORMATICS SCIENTISTS .3 SIMULATION ENVIRONMENT AND DATASETS.4 PERFECT GENE SUBSETS .5 GENE-CANCER KNOWLEDGE DISCOVERY .6 FUZZY ASSOCIATION RULES.
64 CHAPTER 6 FUZZY-GRANULAR GENE SELECTION FROM MICROARRAY EXPRESSION DATA. TRADITIONAL ALGORITHMS FOR GENE SELECTION. SVM FOR CANCER CLASSIFICATION. CORRELATION-BASED FEATURE RANKING ALGORITHMS FOR GENE SELECTION 68 6.
A NEW FUZZY-GRANULAR BASED ALGORITHM FOR GENE SELECTION. FUZZY C-MEANS CLUSTERING. FUZZY-GRANULAR BASED GENE SELECTION. 80 CHAPTER 7 CONCLUSIONS AND FUTURE WORKS.
83 vii LIST OF FIGURES Figure 1. confusion matrix 2 Figure 1. Sample of Area under ROC curve 5 Figure 1. Sample of Area under Precision/Recall 7 Figure 2.
Apriori algorithm 12 Figure 2. discrete interval method 19 Figure 2. creating overlapping regions 20 Figure 2. fuzzy partition 20 Figure 3.
a sketch of FARM-DS 34 Figure 3. an example to project a sample onto a feature 44 Figure 4. an example to decide the optimal 50 Figure 6. positive-related gene, negative-related gene, both, neither 71 Figure 6.
Fuzzy-Granular gene selection 74 viii LIST OF TABLES TABLE 2.1 Notation for mining algorithm 11 TABLE 4.1 characteristics of datasets used for experiments 48 TABLE 4.2 farm-ds modeling results with trapezoidal-shaped membership functions by 5-fold cross validation 51 TABLE 4.3 validation error comparison by 5-fold cross validation 51 TABLE 4.4 running time comparison with 5-fold cross validation 52 TABLE 4.5 The feature information of the wisconsin Breast cancer data set 52 TABLE 4.6 12 wrongly classified samples on wisconsin breast cancer dataset 53 TABLE 4.7 the most general and the most specific fired rules for the 1st sample in fold 1 on wisconsin breast cancer dataset 54 TABLE 4.8 activation frequency of features on the wisconsin Breast cancer data 55 TABLE 5.1 characteristics of datasets 60 TABLE 5.2 a perfect gene subset selected on the aml/all dataset 61 TABLE 5.3 a perfect gene subset selected on the colon cancer dataset 62 TABLE 5.4 a perfect gene subset selected on the prostate cancer dataset 62 TABLE 5.5 classification errors of the four models 62 TABLE 5.6 auc of the four models 63 TABLE 5.7 rule numbers of the four models 63 TABLE 5.8 average rule lengths of the four models 63 ix TABLE 5.9 5 fuzzy association rules for aml/all dataset 64 TABLE 5.10 8 fuzzy association rules for colon dataset 64 TABLE 5.11 15 fuzzy association rules for prostate dataset 65 TABLE 6.1 leave-one out validation performance on the prostate cancer dataset 79 TABLE 6.632 bootstrapping performance on the prostate cancer dataset 79 TABLE 6.3 leave-one out validation performance on the colon cancer dataset 79 TABLE 6.632 bootstrapping performance on the colon cancer dataset 79 x LIST OF ACRONYMS Fuzzy Association Rule Mining FARM Decision Support DS 1 CHAPTER 1 INTRODUCTION In the last decade, with the advent of genomic and proteomic technologies, more and more biomedical databases have been created and have been growing in an exponential rate. Developing intelligent data analysis tools is essential to extract knowledge from these databases to ease biomedical decision-making process. The knowledge extracted from these databases is expected to be as accurate as possible. However, due to complexity and huge sizes of biomedical databases, it is difficult or even impossible to find 100% accurate knowledge.
Therefore, a more realistic goal is to build an intelligent data analysis tool as an effective Decision Support System (DSS). That is, the role of such a data analysis tool is not to replace human experts, but only to assist human experts to make decisions more reliably.1 Binary classification In this dissertation, we focus on binary classification modeling. Although binary classification is the simplest classification problem, many works show that the models for it can be naturally extended to multiple classification or regression problems. (This extension itself is an interesting research topic and will not be covered in this dissertation.) A general binary classification problem is defined as follows: • Given l independent and identically distributed (i.) samples ( x1 , y1 ), ( x2 , y2 ), K, ( xl , yl ) where xi ∈ R d , for i = 1,2,L, l is a feature vector 2 of length d and yi = {+1,−1} is the class label (+1 for the positive class, and -1 for the negative class) for data point xi , • Assume the classes are mutually exclusive and exhaustive, which means every sample has one and only one class label, • Find a classifier with the decision function f ( x,θ ) such that y = f ( x,θ ) , where y is the class label for x, θ is a vector of unknown parameters in the function.
These l samples are called “training data”. real negatives real positives predicted (TN) true (FN) false negatives negatives negatives predicted (FP) false (TP) true positives positives positives Figure.2 Feature selection Some binary classification problem is more natural to be modeled as a binary ranking modeling. Protein homology prediction task is a good example. The target is to predict if a protein sequence is homologous to another pre-specified natural protein sequence.
Because of biological complexity, it is difficult and arbitrary to say two protein sequences are absolutely homologous or not (1 or -1 is output); an output with "confidence" may be more helpful. In this way, many protein sequences could be ranked by their confidence to be homologous to the pre-specified protein sequence. As a result, biologists could quickly 3 prioritize a list of protein sequences for further study and thus their working efficiencies can be enhanced. A binary ranking problem is similar to a binary classification problem.
The differences are • the output is a real number in the field of [-1,1], and • the absolute value of the output is useless. Intuitively, a good model should rank the unseen positive samples (in case of protein homology prediction, they are homologous protein sequences) close to the top and rank unseen negative samples (in case of protein homology prediction, they are non-homologous protein sequences) close to the bottom of the list.3 Feature selection Feature selection is another important task usually correlated with a classification problem. Given a dataset, some input features may be irrelevant to classification. Furthermore, some features may be redundant or even noise due to complex correlations among them to hide real data distribution.
Hence, relevance analysis may be performed on the data with the aim of removing any irrelevant, redundant or noisy features from the learning process. In machine learning, this process is known as feature selection to filter out features, which may otherwise slow down, and possibly mislead, the learning step. Relevance analysis is closely related to binary classification. Suppose there are d input features in the original dataset, the target of feature selection is to select d i informative features while removing d n non-informative features.
The target is that the classifier modeled on the subset of d i features has better performance than the classifier modeled in the original feature set.2 Metrics for classification The performance of the classifier is usually measured in terms of misclassification error on unseen “testing data” which is defined in Eq.1) ⎩ 1 otherwise Based on the confusion matrix in Fig.1, many other metrics have been used for performance evaluation on classification. • Accuracy is the fraction of correctly classified samples over all samples. TN + TP accuracy = .2) TN + FN + FP + TP The overall accuracy metric at Eq.2) represents the same meaning as misclassification error. Both of them are used to evaluate classification performance on the whole dataset.
Besides them, two other kinds of metrics have been proposed for different purposes. The first kind of metrics is concern with balanced classification ability. Sensitivity at Eq.3) and specificity at Eq.4) are usually adopted to monitor classification performance on two classes, separately. • Sensitivity is the fraction of the real positives that actually are correctly predicted as positives.
• Specificity is the fraction of the real negatives that actually are correctly predicted as negatives.3) TP + FN TN specificity = .4) TN + FP 5 Notice that sensitivity is sometimes called true positive rate or positive class accuracy, while specificity called true negative rate or negative class accuracy, in different research communities.