University of New Orleans ScholarWorks@UNO University of New Orleans Theses and Dissertations and Theses Dissertations Summer 8-4-2011 SEA: a novel computational and GUI software pipeline for detecting activated biological sub-pathways Thair Judeh University of New Orleans, tjudeh@uno.edu Follow this and additional works at: https://scholarworks.edu/td Part of the Computer Sciences Commons Recommended Citation Judeh, Thair, "SEA: a novel computational and GUI software pipeline for detecting activated biological sub-pathways" (2011). University of New Orleans Theses and Dissertations.edu/td/463 This Thesis-Restricted is protected by copyright and/or related rights. It has been brought to you by ScholarWorks@UNO with permission from the rights-holder(s). You are free to use this Thesis-Restricted in any way that is permitted by the copyright and related rights legislation that applies to your use.
For other uses you need to obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/or on the work itself. This Thesis-Restricted has been accepted for inclusion in University of New Orleans Theses and Dissertations by an authorized administrator of ScholarWorks@UNO. For more information, please contact scholarworks@uno. SEA: a novel computational and GUI software pipeline for detecting activated biological sub-pathways A Thesis Submitted to the Graduate Faculty of the University of New Orleans in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Bioinformatics by Thair Judeh B.
Loyola University New Orleans, 2005 August, 2011 Copyright 2011, Thair Judeh ii Acknowledgments I thank God who gave me the perseverance to complete this thesis and to Whom I owe all good in this life. Furthermore, I thank my major professor Dr. Dongxiao Zhu whom I hope to one day emulate in his dedication to his work and his advisees. Without a doubt I have greatly benefited from his guidance and expertise.
I also thank the other committee members Dr. Adlai DePano and Dr. Christopher Summa for their invaluable advice and stimulating discussions. I also thank my colleague Lipi Acharya with whom I have collaborated with on many interesting research projects.
I also thank the Research Institute for Children and Tulane University for the generous funding they have provided in supporting the research that Dr. Zhu and I undertook and the Department of Computer Science at UNO for providing me with an assistantship to support my graduate studies. A special thanks is entitled to my family. I thank my mother who has always sought to instill into my siblings and I a sense of responsibility.
I thank my father who sacrificed greatly to ensure the quality of the education that I received throughout my life. Finally, I thank my beloved wife Honida who has constantly pushed me to excel in my research and in life in general. iii Table of Contents List of Figures. ix Chapter 1: Background and Introduction.
1 Chapter 2: Network Reconstruction. 15 Chapter 3: Network Partitioning .4 Kernighan-Lin Algorithm .5 Girvan-Newman Algorithm .6 Clique Percolation Method .2 The Work of Chen Et Al.8 Goals and Original Contributions .10 Retrieving NCBI Gene IDs .11 Decomposing the Pathways .2 Nonlinear Regulatory Modules .13 Scoring the Sub-pathways .14 The Graphical User Interface (GUI) .1 Updating the List of Organisms .2 Selecting or Updating an Organism .3 Loading Profile Data .4 Selecting a Subset of Sub-pathways .5 Ranking the Sub-pathways .7 Saving and Loading Results. 48 v List of Figures 1.1 The Big Picture .1 LPA Problem Statement .2 LPA Input Generation .5 LPA Growth Stage .6 LPA Pruning Stage .7 LPA Intersection Stage .2 Directed Versus Undirected Communities .3 Zachary’s Karate Network .3 Duplicates in KEGG Pathways .4 Root to Leaf Linear Path Illustration .5 SEA Quick Start Guide. 44 vi Abbreviations API Application Programming Interface BIC Bayesian Information Criterion BNT Bayes Net Toolbox COSINE COndition-SpecIfic sub-Network CPD Conditional Probability Distribution CPM Clique Percolation Method CPT Conditional Probability Table DAG Directed Acyclic Graph DFS Depth First Search DNA DeoxyriboNucleic Acid FTP File Transfer Protocol GenMAPP Gene Map Annotator and Pathway Profiler GSGS Gene Set Gibbs Sampler GUI Graphical User Interface KEGG Kyoto Encyclopedia of Genes and Genomes KGML KEGG Markup Language LPA Linear Path Augmentation MLE Maximum Likelihood Estimator mRNA messenger RNA NCBI National Center for Biotechnology Information PPI Protein-Protein Interaction RNA RiboNucleic Acid SEA Structure Enrichment Analysis vii SOAP Simple Object Access Protocol TPM Transitional Probability Matrix WSDL Web Service Definition Language XML Extensible Markup Language viii Abstract With the ever increasing amount of high-throughput molecular profile data, biologists need versatile tools to enable them to quickly and succinctly analyze their data.
Furthermore, pathway databases have grown increasingly robust with the KEGG database at the fore- front. Previous tools have color-coded the genes on different pathways using differential expression analysis. Unfortunately, they do not adequately capture the relationships of the genes amongst one another. Structure Enrichment Analysis (SEA) thus seeks to take bio- logical analysis to the next level.
SEA accomplishes this goal by highlighting for users the sub-pathways of a biological pathways that best correspond to their molecular profile data in an easy to use GUI interface. Network Partitioning, Network Reconstruction, Structure Enrichment Analysis, Community Detection Algorithms, Biological Networks, KEGG ix Chapter 1: Background and Introduction The world of biological systems is a vast and complex system of regulation processes and biomolecular interactions. An underlying goal for biologists is to arrive at a theory that shines light on the complicated interaction patterns in living organisms. These interaction patterns result in various biological phenomena where recognition of these patterns can provide much needed insight into biomolecular activities.
Capturing these biomolecular activities, however, is a daunting task due to the complexity of the systems at hand as well as lacking of data needed to fully capture the underlying biomolecular activities. Thus, two problems have recently received a considerable amount of attention: (1) inferring biological pathway structures from gene expression data and gene sets and (2) decomposing different biological pathway structures into functional units. A revolution in the understanding of biomolecular interaction mechanisms has oc- curred in large part due to the rapid and significant advances in high-throughput technolo- gies. Such technologies, such as microarrays and second-generation sequencing, now enable a systematic study of biomolecular activities due to the copious amount of genome-wide mea- surements.
These genome-wide measurements continue to be accumulated into numerous databases by research labs across the world. Unfortunately, gaining biological insights from large-scale gene expression data is a daunting task due to the curse of dimensionality. To overcome this task, many computational and experimental models have been developed to group genes into various sets based on either a structural or functional similarity. This lead to the birth of gene sets as a new source of data leading to a burst in novel algorithms that infer biological pathway structures from gene sets.
These two types of data, gene expression data and gene sets, will now be examined in more detail. 1 First, gene expression data is represented as a matrix of numerical values. Each row corresponds to a gene while each columns corresponds to an experiment. Each entry of the matrix corresponds to the gene expression level for a given gene under a given experiment.
Gene expression profiling has thus allowed the simultaneous measurement of the expression levels of thousands of genes. A systematic study of biomolecular interaction mechanisms is now possible on a genomic scale. One typical example of gene expression data is microarray data. For microarray data one usually has a glass slide that is coated with oligonucleotides corresponding to specific gene coding regions.
The slide is then labeled and hybridized with purified RNA. A laser is scanned on the washed microarray slide to obtain gene expression data. Ways to obtain genome-wide measurements have also grown. There are a wide array of microarray platforms, and genome-wide measurements can be obtained via conventional hybridization based microarray [14, 20, 31] or deep sequencing experiments [32, 33].
Some representative microarray platforms include Agilent Microarray, Affymetrix GeneChip, and Illumina BeadArray. Moving on to gene sets, gene sets are defined as a group of genes that share biological similarities. They are a rich source of data for reconstructing the structure of biological pathways as they tend to participate in the same biological process. Gene sets are derived from a variety of sources including PubMed text, ChIP-chip, co-localization along the a chromosome, and gene expression data.
There are a variety of methods to rank gene sets with GSEA-P [34] being one of the most popular methods. A major advantage of working with gene sets is their capability to incorporate with ease higher-order interaction patterns. They are also more robust to noise than gene expression data and are capable of integrating data from a variety of sources. Given the ways a gene set may be derived, one must keep in mind the possibility that not all gene sets may represent network structures.
2 An important underlying assumption when trying to reconstruct a biological pathway structure using gene sets or gene expression data is that these sets of data were originally emitted from unobserved signaling pathways. There are various algorithms based on this assumption that attempt to reconstruct the structure of biological pathways using gene sets and/or gene expression data. First, a biological pathway structure is a graph G(V, E) where V is the set of vertices or nodes. E is the set of edges.
In the case of biological pathways, a vertex v V may either be a gene or protein whereas an edge e E joining two such vertices represents the biological properties connecting them. The final underlying network may either be directed or undirected, and both types of networks occur naturally in biological systems. For example, a signal transduction is a typical example of a directed network in biological systems. According to the Central Dogma of Molecular Biology, DNA encodes the genetic information of living organisms.
DNA directs protein synthesis via the formation of messenger RNA (mRNA) [4]. A signal transduction is thus the primary means that decodes DNA into mRNA and then into protein synthesis. For a signal transduction to occur, cytokines or chemokines bind to the transmembrane proteins which in turn activates a sequential activation of signal molecules leading to a biological end-point. In this case a directed edge represents one event in a signal transduction activating another, and a signaling pathway is thus composed of a web of gene regulatory wiring or different transduction events.
Undirected networks, on the other hand, are typically exemplified by Protein-Protein Interaction (PPI) networks [35]. These networks have no self-loops, and all vertices consist of proteins. An edge exists between two proteins if they can physically interact. Once a biological pathway structure has been reconstructed, one needs to examine it at a finer level as usually only part of a biological pathway structure is involved in a biological process of interest.
Thus, decomposing different biological pathway structures into sub-pathways is a must. By retrieving the sub-pathways, one is able to accomplish two major 3 goals: predict gene functionality and relevant sub-pathways for different phenotypes. For example, if gene A is clustered with other genes responsible for apoptosis, one may infer that gene A also plays a role in apoptosis. This leads to predicting a new gene functionality for gene A that may have been previously unknown.
As another example, one may possess cancer molecular profile data. By “enriching” the sub-pathways, one may extract new biological insights about the sub-pathways most relevant to cancer.1 succinctly summarizes the relationships amongst the various topics discussed in this introduction.1: The big picture. Gene expression data and gene sets may be converted from one to another. Furthermore, given gene expression data or gene sets, one can reconstruct different biological pathway structures.
Given that only a sub-pathway is usually activated for a particular biological process, decomposing a biological pathway structure into sub- pathways is a must. From these sub-pathways, one may extract useful biological insights.