Thuật toán tối ưu cho phân tích và phân loại hồ sơ di truyền từ dữ liệu đo lường không đầy đủ

Chuyên khảo phân tích Optimization algorithms for inference and classification of genet, đánh giá các khía cạnh quan trọng, đề xuất hướng nghiên cứu tiếp theo.

Trường đại học

Rowan University

Chuyên ngành

Electrical & Computer Engineering

Người đăng

Ẩn danh

Thể loại

thesis

2014

Phí lưu trữ

35 Point

Mục lục chi tiết

ACKNOWLEDGEMENTS

Abstract

1. Introduction

1.1. Research Objectives

1.2. Research Contribution

1.3. Organization

2. PNMF: Theory & Application To Microarray Data Analysis

2.1. Introduction

2.2. Non-negative Matrix Factorization

2.3. Probabilistic Non-negative Matrix Factorization

2.4. PNMF-based Data Classification

2.5. Application to Gene Microarrays

2.6. Conclusion and Discussion

3. High-Dimension SMURC Estimation

3.1. Introduction

3.2. The Normalized-Likelihood

3.3. Application: Genetic Regulatory Networks

3.4. Conclusion and Discussion

4. Kernel Reconstruction

4.1. Conclusion

Bibliography

Appendix

List of Figures

List of Tables

Tóm tắt

I. Tổng quan về thuật toán tối ưu cho phân tích di truyền

Thuật toán tối ưu cho phân tích và phân loại hồ sơ di truyền là một lĩnh vực quan trọng trong nghiên cứu di truyền học. Các thuật toán này giúp xử lý và phân tích dữ liệu di truyền phức tạp, từ đó rút ra những thông tin giá trị về gen và các yếu tố di truyền khác. Việc áp dụng các phương pháp tối ưu hóa không chỉ nâng cao độ chính xác trong phân loại mà còn giúp phát hiện các mẫu gen tiềm ẩn.

1.1. Khái niệm về thuật toán tối ưu trong di truyền

Thuật toán tối ưu trong di truyền học thường được sử dụng để phân tích dữ liệu gen và phát hiện các mối quan hệ giữa các gen. Các phương pháp như NMF (Non-negative Matrix Factorization) đã chứng minh hiệu quả trong việc phân loại và phân tích dữ liệu di truyền.

1.2. Tầm quan trọng của phân tích hồ sơ di truyền

Phân tích hồ sơ di truyền giúp hiểu rõ hơn về các bệnh di truyền và các yếu tố ảnh hưởng đến sức khỏe con người. Điều này có thể dẫn đến những phát hiện mới trong y học và cải thiện các phương pháp điều trị.

II. Thách thức trong phân tích và phân loại hồ sơ di truyền

Mặc dù có nhiều tiến bộ trong lĩnh vực này, nhưng vẫn tồn tại nhiều thách thức trong việc phân tích và phân loại hồ sơ di truyền. Dữ liệu di truyền thường rất lớn và phức tạp, điều này đòi hỏi các thuật toán phải có khả năng xử lý hiệu quả và chính xác. Ngoài ra, sự hiện diện của nhiễu trong dữ liệu cũng là một vấn đề lớn cần được giải quyết.

2.1. Vấn đề về dữ liệu không đầy đủ

Dữ liệu di truyền thường không đầy đủ hoặc bị thiếu sót, điều này có thể ảnh hưởng đến độ chính xác của các thuật toán phân loại. Việc phát triển các phương pháp tối ưu hóa để xử lý dữ liệu thiếu là rất cần thiết.

2.2. Nhiễu trong dữ liệu di truyền

Nhiễu trong dữ liệu di truyền có thể đến từ nhiều nguồn khác nhau, bao gồm cả lỗi trong quá trình thu thập dữ liệu. Các thuật toán cần phải có khả năng phân biệt giữa tín hiệu và nhiễu để đảm bảo độ chính xác cao.

III. Phương pháp tối ưu hóa trong phân tích hồ sơ di truyền

Có nhiều phương pháp tối ưu hóa được áp dụng trong phân tích hồ sơ di truyền, bao gồm NMF, SMURC và các thuật toán học máy khác. Những phương pháp này không chỉ giúp cải thiện độ chính xác mà còn tăng cường khả năng phát hiện các mẫu gen phức tạp.

3.1. Non negative Matrix Factorization NMF

NMF là một phương pháp mạnh mẽ trong phân tích dữ liệu di truyền, cho phép phân tách dữ liệu thành các thành phần không âm. Điều này giúp dễ dàng hơn trong việc phân tích và phân loại các mẫu gen.

3.2. SMURC Phương pháp hồi quy đa biến nhỏ mẫu

SMURC là một phương pháp mới giúp giải quyết các vấn đề hồi quy đa biến trong điều kiện mẫu nhỏ. Phương pháp này đã cho thấy hiệu quả cao trong việc phân tích các mạng điều hòa gen.

IV. Ứng dụng thực tiễn của thuật toán tối ưu trong di truyền

Các thuật toán tối ưu hóa đã được áp dụng rộng rãi trong nhiều lĩnh vực khác nhau, từ y học đến nông nghiệp. Việc phân tích hồ sơ di truyền không chỉ giúp phát hiện các bệnh di truyền mà còn hỗ trợ trong việc phát triển các giống cây trồng mới.

4.1. Ứng dụng trong y học

Trong y học, các thuật toán tối ưu hóa giúp phát hiện các bệnh di truyền và xác định các yếu tố nguy cơ. Điều này có thể dẫn đến các phương pháp điều trị cá nhân hóa hiệu quả hơn.

4.2. Ứng dụng trong nông nghiệp

Trong nông nghiệp, phân tích hồ sơ di truyền giúp phát triển các giống cây trồng có khả năng chống chịu tốt hơn với bệnh tật và điều kiện môi trường khắc nghiệt.

V. Kết luận và tương lai của thuật toán tối ưu trong di truyền

Tương lai của thuật toán tối ưu trong phân tích hồ sơ di truyền rất hứa hẹn. Với sự phát triển của công nghệ và các phương pháp mới, khả năng phân tích và phân loại hồ sơ di truyền sẽ ngày càng chính xác và hiệu quả hơn. Điều này sẽ mở ra nhiều cơ hội mới trong nghiên cứu và ứng dụng thực tiễn.

5.1. Xu hướng phát triển công nghệ

Công nghệ mới như học sâu và trí tuệ nhân tạo sẽ tiếp tục cải thiện khả năng phân tích dữ liệu di truyền, giúp phát hiện các mẫu gen phức tạp hơn.

5.2. Tác động đến y học và nông nghiệp

Sự phát triển của các thuật toán tối ưu hóa sẽ có tác động lớn đến y học và nông nghiệp, giúp cải thiện sức khỏe con người và năng suất cây trồng.

25/07/2025

Bạn đang xem trước tài liệu:

Optimization algorithms for inference and classification of genet

Tải đầy đủ

Trích đoạn nội dung tài liệu

Rowan University Rowan Digital Works Theses and Dissertations 9-2-2014 Optimization algorithms for inference and classification of genetic profiles from undersampled measurements Belhassen Bayar Follow this and additional works at: https://rdw.edu/etd Part of the Electrical and Computer Engineering Commons Recommended Citation Bayar, Belhassen, "Optimization algorithms for inference and classification of genetic profiles from undersampled measurements" (2014). Theses and Dissertations.edu/etd/410 This Thesis is brought to you for free and open access by Rowan Digital Works. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of Rowan Digital Works. For more information, please contact graduateresearch@rowan.

OPTIMIZATION ALGORITHMS FOR INFERENCE AND CLASSIFICATION OF GENETIC PROFILES FROM UNDERSAMPLED MEASUREMENTS by Belhassen Bayar A Thesis Submitted to the Department of Electrical & Computer Engineering College of Engineering In partial fulfillment of the requirement For the degree of Master of Science at Rowan University June 2014 Thesis Chair: Nidhal Bouaynaya © 2014 Belhassen Bayar ACKNOWLEDGEMENTS I want to express my sincere gratitude to Dr. Nidhal Bouaynaya, my supervisor who has always bothered to offer me the best working conditions possible. I thank her for her wide availability, her high scientific qualifications and her guidance, illuminating discussions related to this work and beyond, encouragement, moral and financial support in this research. I express my appreciation and gratitude to Dr.

Roman Shterenberg , Associate Professor at the University of Alabama at Birmingham USA, for the time he spent with me, his availability even when he was abroad and the valuable advice he has given me throughout my research. I also would like to express my deep and sincere gratitude to Dr. Robi Polikar, Professor & Chair at the ECE Department, for the high quality courses he teaches, his availability and eagerness to provide the best learning experience for students at the department. Many thanks to all the students who accompanied me during these years and have continued to create a good working atmosphere within the laboratory.

Deepest thanks to my dear parents and grandmother to whom I owe so much. I would have neither the means nor the strength to accomplish this work without them. I also want to express my gratitude to my friends who have continued to give me the moral and intellectual support throughout my work during all the good and bad moments. They always say the best is for the end, that’s why I dedicate this project to my dear sister, my little light that gave me energy and courage.

iii Abstract Belhassen Bayar OPTIMIZATION ALGORITHMS FOR INFERENCE AND CLASSIFICATION OF GENETIC PROFILES FROM UNDERSAMPLED MEASUREMENTS 2014/06 Nidhal Bouaynaya, Ph. Master of Science in Electrical & Computer Engineering In this thesis, we tackle three different problems, all related to optimization tech- niques for inference and classification of genetic profiles. First, we extend the de- terministic Non-negative Matrix Factorization (NMF) framework to the probabilistic case (PNMF). We apply the PNMF algorithm to cluster and classify DNA microar- rays data.

The proposed PNMF is shown to outperform the deterministic NMF and the sparse NMF algorithms in clustering stability and classification accuracy. Sec- ond, we propose SMURC: Small-sample MUltivariate Regression with Covariance estimation. Specifically, we consider a high dimension low sample-size multivariate regression problem that accounts for correlation of the response variables. We show that, in this case, the maximum likelihood approach is senseless because the likeli- hood diverges.

We propose a normalization of the likelihood function that guaran- tees convergence. Simulation results show that SMURC outperforms the regularized likelihood estimator with known covariance matrix and the state-of-the-art sparse Conditional Graphical Gaussian Model (sCGGM). In the third Chapter, we derive a new greedy algorithm that provides an exact sparse solution of the combinatorial `0 - optimization problem in an exponentially less computation time. Unlike other greedy approaches, which are only approximations of the exact sparse solution, the proposed greedy approach, called Kernel reconstruction, leads to the exact optimal solution.

iv Table of Contents List of Figures vii List of Tables viii 1 Introduction 1 1.3 Organization 2 2 PNMF: Theory & Application To Microarray Data Analysis 4 2.2 Non-negative Matrix Factorization 10 2.3 Probabilistic Non-negative Matrix Factorization 13 2.4 PNMF-based Data Classification 15 2.5 Application to Gene Microarrays 19 2.6 Conclusion and Discussion 34 3 High-Dimension SMURC Estimation 36 3.2 The Normalized-Likelihood 41 3.3 Application: Genetic Regulatory Networks 53 3.4 Conclusion and Discussion 60 v 4 Kernel Reconstruction V.4 Conclusion 79 Bibliography 80 A Appendix 86 vi List of Figures 2.1 Clustering results for the Leukemia dataset 20 2.2 Metagenes expression patterns versus the samples for k = 4 21 2.3 Clustering results for the Medulloblastoma dataset 22 2.4 Clustering Percentage Error versus Nbr.5 The cophenetic coefficient versus the standard deviation 29 2.6 Cophenetic versus SNR in dB in Leukemia dataset 30 2.7 Cophenetic versus SNR in dB in Medulloblastoma dataset 31 3.1 Approximation of the optimization problem in Proposition 4 49 3.2 Approximation error ||S − S ∗ ||F /||S||F versus n 51 3.3 Performance comparison of SMURC with sCGGM and RMLE 53 3.4 The known undirected gene interactions in the Drosophila 57 3.5 Estimated gene regulatory networks of the Drosophila 57 4.1 Performance comparison of KR with `1 -based and `2 -based CS for N = 10 78 4.2 Performance comparison of KR with `1 -based and `2 -based CS for N = 20 78 vii List of Tables 2.1 Smallest SNR value for ρ ≥ 0.1 Detection of the known gene interactions in Flybase 60 viii Chapter 1 Introduction 1.1 Research Objectives We outline the goal of this research through the following objectives: 1. Study and analyse the Non-negative Matrix Factorization (NMF) and propose a probabilistic extension to NMF (PNMF) for data corrupted by noise. Build a PNMF-based classifier and apply it for tumor classification from gene expression data. Derive a convex optimization algorithm for the solution of an under-determined multivariate regression problem.

Apply the proposed algorithm to infer genetic regulatory networks from gene expression data. Derive a greedy algorithm for exact reconstruction of sparse signals from a limited number of observations.2 Research Contribution This work contributes to the field of computational bioinformatics and biology through the application of the signal processing algorithms aiming to study and analyze the microarray data. Our work shifts the focus of the genomic signal processing commu- nity from analyzing the genes expression patterns and samples clusters to considering 1 the mathematical aspect of the algorithm and deriving its application in the stochas- tic work. We also focus on solving under-determined multivariate regression systems in order to infer gene regulatory networks.

These networks are known to be sparse, therefore, we have a great interest in studying the compressive sensing approach which recovers sparse signal from linear model. Specific contributions of this work include: The improvement of the mathematical proof for the NMF algorithm by provid- ing a general evidence (see Appendix preposition 2). The development of a new NMF algorithm for the noisy Microarray data in order to improve the basic NMF approach and to predict some hidden data features. Solving under-determined multivariate regression systems to infer gene regula- tory networks using our new SMURC algorithm.

Recover k-sparse signal using our new approach, called Kernel Reconstruction, that guarantees an exact reconstruction and less computational time comparing to the `0 -based compressive sensing approach [18].3 Organization This thesis is organized as follows. In Chapter 2, we study and analyze the Non-negative Matrix Factorization and de- rive its probabilistic approach that we call PNMF algorithm and then we derive its corresponding update rules. The proof of the developed approaches is provided in 2 the Appendix chapter. We compare the performance of our PNMF approach with its homologues in clustering as well as classification.

In Chapter 3, we develop a new approach, called Small-sample MUltivariate Re- gression with Covariance Estimation (SMURC), to solve under-determined multivari- ate regression systems. We use this approach to infer gene regulatory networks. We compare our algorithm to other techniques cited in related works and using a syn- thetic data. Subsequently, we apply our approach to infer the know interactions in the Drosophila’s 11-gene wing muscle network.

Finally, in Chapter 4 we provide a complete review of the compressive sensing technique. We also come up with a new approach that performs an exact reconstruc- tion of a sparse signal. We call this approach, Kernel Reconstruction, and we compare it with what has been suggested in the related work. 3 Chapter 2 Probabilistic Non-negative Matrix Factorization: Theory and Application to Microarray Data Analysis 2.1 Introduction Extracting knowledge from experimental raw data and measurements is an important objective and challenge in signal processing.

Often data collected is high dimensional and incorporates several inter-related variables, which are combinations of underly- ing latent components or factors. Approximate low-rank matrix factorizations play a fundamental role in extracting these latent components [14]. In many applica- tions, signals to be analyzed are non-negative, e., pixel values in image processing, price variables in economics and gene expression levels in computational biology. For such data, it is imperative to take the non-negativity constraint into account in or- der to obtain a meaningful physical interpretation.

Classical decomposition tools, such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD), Blind Source Separation (BSS) and related methods do not guarantee to maintain the non-negativity constraint. Non-negative matrix factorization (NMF) represents non-negative data in terms of lower-rank non-negative factors. NMF proved to be a powerful tool in many applications in biomedical data processing and analysis, 4 such as muscle identification in the nervous system [54], classification of images [29], gene expression classification [10], biological process identification [32] and transcrip- tional regulatory network inference [38]. The appeal of NMF, compared to other clustering and classification methods, stems from the fact that it does not impose any prior structure or knowledge on the data.

Brunet et al. successfully applied NMF to the classification of gene expression datasets [10] and showed that it leads to more accurate and more robust clustering than the Self-Organizing Maps (SOMs) and Hierarchical Clustering (HC). Analytically, the NMF method factors the original non-negative matrix V into two lower rank non-negative matrices, W and H such that V = W H + E, where E is the residual error. Lee and Seung [33] derived algorithms for estimating the optimal non-negative factors that minimize the Euclidean distance and the Kullback-Leibler divergence cost functions.

Their algorithms, guaranteed to converge, are based on multiplicative update rules, and are a good compromise be- tween speed and ease of implementation. In particular, the Euclidean distance NMF algorithm can be shown to reduce to the gradient descent algorithm for a specific choice of the step size [33]. Lee and Seung’s NMF factorization algorithms have been widely adopted by the community [6, 10, 19, 59]. The NMF method is, however, deterministic.

That is, the algorithm does not take into account the measurement or observation noise in the data. On the other hand, data collected using electronic or biomedical devices, such as gene expression profiles, are known to be inherently noisy and therefore, must be processed and analyzed by systems that take into account the stochastic nature of the data. Furthermore, the ef- fect of the data noise on the NMF method in terms of convergence and robustness has 5 not been previously investigated. Thus, questions about the efficiency and robustness of the method in dealing with imperfect or noisy data are still unanswered.

In this chapter, we extend the NMF framework and algorithms to the stochastic case, where the data is assumed to be drawn from a multinomial probability den- sity function. We call the new framework Probabilistic NMF or PNMF. We show that the PNMF formulation reduces to a weighted regularized matrix factorization problem. We generalize and extend Lee and Seung’s algorithm to the stochastic case; thus providing PNMF updates rules, which are guaranteed to converge to the optimal solution.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ