Rowan University Rowan Digital Works Theses and Dissertations 9-2-2014 Optimization algorithms for inference and classification of genetic profiles from undersampled measurements Belhassen Bayar Follow this and additional works at: https://rdw.edu/etd Part of the Electrical and Computer Engineering Commons Recommended Citation Bayar, Belhassen, "Optimization algorithms for inference and classification of genetic profiles from undersampled measurements" (2014). Theses and Dissertations.edu/etd/410 This Thesis is brought to you for free and open access by Rowan Digital Works. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of Rowan Digital Works. For more information, please contact graduateresearch@rowan.
OPTIMIZATION ALGORITHMS FOR INFERENCE AND CLASSIFICATION OF GENETIC PROFILES FROM UNDERSAMPLED MEASUREMENTS by Belhassen Bayar A Thesis Submitted to the Department of Electrical & Computer Engineering College of Engineering In partial fulfillment of the requirement For the degree of Master of Science at Rowan University June 2014 Thesis Chair: Nidhal Bouaynaya © 2014 Belhassen Bayar ACKNOWLEDGEMENTS I want to express my sincere gratitude to Dr. Nidhal Bouaynaya, my supervisor who has always bothered to offer me the best working conditions possible. I thank her for her wide availability, her high scientific qualifications and her guidance, illuminating discussions related to this work and beyond, encouragement, moral and financial support in this research. I express my appreciation and gratitude to Dr.
Roman Shterenberg , Associate Professor at the University of Alabama at Birmingham USA, for the time he spent with me, his availability even when he was abroad and the valuable advice he has given me throughout my research. I also would like to express my deep and sincere gratitude to Dr. Robi Polikar, Professor & Chair at the ECE Department, for the high quality courses he teaches, his availability and eagerness to provide the best learning experience for students at the department. Many thanks to all the students who accompanied me during these years and have continued to create a good working atmosphere within the laboratory.
Deepest thanks to my dear parents and grandmother to whom I owe so much. I would have neither the means nor the strength to accomplish this work without them. I also want to express my gratitude to my friends who have continued to give me the moral and intellectual support throughout my work during all the good and bad moments. They always say the best is for the end, that’s why I dedicate this project to my dear sister, my little light that gave me energy and courage.
iii Abstract Belhassen Bayar OPTIMIZATION ALGORITHMS FOR INFERENCE AND CLASSIFICATION OF GENETIC PROFILES FROM UNDERSAMPLED MEASUREMENTS 2014/06 Nidhal Bouaynaya, Ph. Master of Science in Electrical & Computer Engineering In this thesis, we tackle three different problems, all related to optimization tech- niques for inference and classification of genetic profiles. First, we extend the de- terministic Non-negative Matrix Factorization (NMF) framework to the probabilistic case (PNMF). We apply the PNMF algorithm to cluster and classify DNA microar- rays data.
The proposed PNMF is shown to outperform the deterministic NMF and the sparse NMF algorithms in clustering stability and classification accuracy. Sec- ond, we propose SMURC: Small-sample MUltivariate Regression with Covariance estimation. Specifically, we consider a high dimension low sample-size multivariate regression problem that accounts for correlation of the response variables. We show that, in this case, the maximum likelihood approach is senseless because the likeli- hood diverges.
We propose a normalization of the likelihood function that guaran- tees convergence. Simulation results show that SMURC outperforms the regularized likelihood estimator with known covariance matrix and the state-of-the-art sparse Conditional Graphical Gaussian Model (sCGGM). In the third Chapter, we derive a new greedy algorithm that provides an exact sparse solution of the combinatorial `0 - optimization problem in an exponentially less computation time. Unlike other greedy approaches, which are only approximations of the exact sparse solution, the proposed greedy approach, called Kernel reconstruction, leads to the exact optimal solution.
iv Table of Contents List of Figures vii List of Tables viii 1 Introduction 1 1.3 Organization 2 2 PNMF: Theory & Application To Microarray Data Analysis 4 2.2 Non-negative Matrix Factorization 10 2.3 Probabilistic Non-negative Matrix Factorization 13 2.4 PNMF-based Data Classification 15 2.5 Application to Gene Microarrays 19 2.6 Conclusion and Discussion 34 3 High-Dimension SMURC Estimation 36 3.2 The Normalized-Likelihood 41 3.3 Application: Genetic Regulatory Networks 53 3.4 Conclusion and Discussion 60 v 4 Kernel Reconstruction V.4 Conclusion 79 Bibliography 80 A Appendix 86 vi List of Figures 2.1 Clustering results for the Leukemia dataset 20 2.2 Metagenes expression patterns versus the samples for k = 4 21 2.3 Clustering results for the Medulloblastoma dataset 22 2.4 Clustering Percentage Error versus Nbr.5 The cophenetic coefficient versus the standard deviation 29 2.6 Cophenetic versus SNR in dB in Leukemia dataset 30 2.7 Cophenetic versus SNR in dB in Medulloblastoma dataset 31 3.1 Approximation of the optimization problem in Proposition 4 49 3.2 Approximation error ||S − S ∗ ||F /||S||F versus n 51 3.3 Performance comparison of SMURC with sCGGM and RMLE 53 3.4 The known undirected gene interactions in the Drosophila 57 3.5 Estimated gene regulatory networks of the Drosophila 57 4.1 Performance comparison of KR with `1 -based and `2 -based CS for N = 10 78 4.2 Performance comparison of KR with `1 -based and `2 -based CS for N = 20 78 vii List of Tables 2.1 Smallest SNR value for ρ ≥ 0.1 Detection of the known gene interactions in Flybase 60 viii Chapter 1 Introduction 1.1 Research Objectives We outline the goal of this research through the following objectives: 1. Study and analyse the Non-negative Matrix Factorization (NMF) and propose a probabilistic extension to NMF (PNMF) for data corrupted by noise. Build a PNMF-based classifier and apply it for tumor classification from gene expression data. Derive a convex optimization algorithm for the solution of an under-determined multivariate regression problem.
Apply the proposed algorithm to infer genetic regulatory networks from gene expression data. Derive a greedy algorithm for exact reconstruction of sparse signals from a limited number of observations.2 Research Contribution This work contributes to the field of computational bioinformatics and biology through the application of the signal processing algorithms aiming to study and analyze the microarray data. Our work shifts the focus of the genomic signal processing commu- nity from analyzing the genes expression patterns and samples clusters to considering 1 the mathematical aspect of the algorithm and deriving its application in the stochas- tic work. We also focus on solving under-determined multivariate regression systems in order to infer gene regulatory networks.
These networks are known to be sparse, therefore, we have a great interest in studying the compressive sensing approach which recovers sparse signal from linear model. Specific contributions of this work include: The improvement of the mathematical proof for the NMF algorithm by provid- ing a general evidence (see Appendix preposition 2). The development of a new NMF algorithm for the noisy Microarray data in order to improve the basic NMF approach and to predict some hidden data features. Solving under-determined multivariate regression systems to infer gene regula- tory networks using our new SMURC algorithm.
Recover k-sparse signal using our new approach, called Kernel Reconstruction, that guarantees an exact reconstruction and less computational time comparing to the `0 -based compressive sensing approach [18].3 Organization This thesis is organized as follows. In Chapter 2, we study and analyze the Non-negative Matrix Factorization and de- rive its probabilistic approach that we call PNMF algorithm and then we derive its corresponding update rules. The proof of the developed approaches is provided in 2 the Appendix chapter. We compare the performance of our PNMF approach with its homologues in clustering as well as classification.
In Chapter 3, we develop a new approach, called Small-sample MUltivariate Re- gression with Covariance Estimation (SMURC), to solve under-determined multivari- ate regression systems. We use this approach to infer gene regulatory networks. We compare our algorithm to other techniques cited in related works and using a syn- thetic data. Subsequently, we apply our approach to infer the know interactions in the Drosophila’s 11-gene wing muscle network.
Finally, in Chapter 4 we provide a complete review of the compressive sensing technique. We also come up with a new approach that performs an exact reconstruc- tion of a sparse signal. We call this approach, Kernel Reconstruction, and we compare it with what has been suggested in the related work. 3 Chapter 2 Probabilistic Non-negative Matrix Factorization: Theory and Application to Microarray Data Analysis 2.1 Introduction Extracting knowledge from experimental raw data and measurements is an important objective and challenge in signal processing.
Often data collected is high dimensional and incorporates several inter-related variables, which are combinations of underly- ing latent components or factors. Approximate low-rank matrix factorizations play a fundamental role in extracting these latent components [14]. In many applica- tions, signals to be analyzed are non-negative, e., pixel values in image processing, price variables in economics and gene expression levels in computational biology. For such data, it is imperative to take the non-negativity constraint into account in or- der to obtain a meaningful physical interpretation.
Classical decomposition tools, such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD), Blind Source Separation (BSS) and related methods do not guarantee to maintain the non-negativity constraint. Non-negative matrix factorization (NMF) represents non-negative data in terms of lower-rank non-negative factors. NMF proved to be a powerful tool in many applications in biomedical data processing and analysis, 4 such as muscle identification in the nervous system [54], classification of images [29], gene expression classification [10], biological process identification [32] and transcrip- tional regulatory network inference [38]. The appeal of NMF, compared to other clustering and classification methods, stems from the fact that it does not impose any prior structure or knowledge on the data.
Brunet et al. successfully applied NMF to the classification of gene expression datasets [10] and showed that it leads to more accurate and more robust clustering than the Self-Organizing Maps (SOMs) and Hierarchical Clustering (HC). Analytically, the NMF method factors the original non-negative matrix V into two lower rank non-negative matrices, W and H such that V = W H + E, where E is the residual error. Lee and Seung [33] derived algorithms for estimating the optimal non-negative factors that minimize the Euclidean distance and the Kullback-Leibler divergence cost functions.
Their algorithms, guaranteed to converge, are based on multiplicative update rules, and are a good compromise be- tween speed and ease of implementation. In particular, the Euclidean distance NMF algorithm can be shown to reduce to the gradient descent algorithm for a specific choice of the step size [33]. Lee and Seung’s NMF factorization algorithms have been widely adopted by the community [6, 10, 19, 59]. The NMF method is, however, deterministic.
That is, the algorithm does not take into account the measurement or observation noise in the data. On the other hand, data collected using electronic or biomedical devices, such as gene expression profiles, are known to be inherently noisy and therefore, must be processed and analyzed by systems that take into account the stochastic nature of the data. Furthermore, the ef- fect of the data noise on the NMF method in terms of convergence and robustness has 5 not been previously investigated. Thus, questions about the efficiency and robustness of the method in dealing with imperfect or noisy data are still unanswered.
In this chapter, we extend the NMF framework and algorithms to the stochastic case, where the data is assumed to be drawn from a multinomial probability den- sity function. We call the new framework Probabilistic NMF or PNMF. We show that the PNMF formulation reduces to a weighted regularized matrix factorization problem. We generalize and extend Lee and Seung’s algorithm to the stochastic case; thus providing PNMF updates rules, which are guaranteed to converge to the optimal solution.