SUPPORTING ON-THE-FLY DATA INTEGRATION FOR BIOINFORMATICS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Xuan Zhang, M. ***** The Ohio State University 2007 Dissertation Committee: Approved by Gagan Agrawal, Adviser Hakan Ferhatosmanoglu Adviser Yusu Wang Graduate Program in Computer and Information Science UMI Number: 3246116 UMI Microform 3246116 Copyright 2007 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company 300 North Zeeb Road P. Box 1346 Ann Arbor, MI 48106-1346 c Copyright by Xuan Zhang 2007 ABSTRACT The use of computational tools and on-line data knowledgebases has changed the way the biologists conduct their research. The fusion of biology and information science is expected to continue. Data integration is one of the challenges faced by bioinformatics.
In order to build an integration system for modern biological research, three problems have to be solved. A large number of existing data sources have to be incorporated and when new data sources are discovered, they should be utilized right away. The variety of the biological data formats and access methods have to be addressed. Finally, the system has to be able to understand the rich and often fuzzy semantic of biological data.
Motivated by the above challenges, a system and a set of tools have been im- plemented to support on-the-fly integration of biological data. Metadata about the underlying data sources are the backbone of the system. Data mining tools have been developed to help users to write the descriptors semi-automatically. With auto- matic code generation approach, we have developed several tools for bioinformatics integration needs.
An automatic data wrapper generation tool is able to transform data between heterogeneous data sources. Another code generation system can create programs to answer projection, selection, cross product and join queries from flat file data. ii Real bioinformatics requests have been used to test our system and tools. These case studies show that our approach can reduce the human efforts involved in an information integration system.
Specifically, it makes the following contributions. 1) Data mining tools allow new data sources to be understood with ease and integrated to the system on-the-fly. 2) Changes in data format are localized by using the metadata descriptors. System maintenance cost is low.
3) Users interact with our system through high-level declarative interfaces. Programming efforts are reduced. 4) Our tools process data directly from flat files and requires no database support. Data parsing and processing are done implicitly.
5) Request analysis and request execution are separated and our tools can be used in a data grid environment. iii This is dedicated to the ones I love. To my parents, who believe in women in engineering. To my husband, who never stop criticism.
And to my daughter, whose smile is the best reward in the world. iv ACKNOWLEDGMENTS I would like to express my deepest gratitude to my advisor, Professor Gagan Agrawal. He has been a great mentor and a wonderful colleague to me. I am so fortunate to have the opportunity to learn from him on not only how to conduct research but also how to be a better person.
I also want to sincerely thank Professor Hakan Ferhatosmanoglu and Professor Yusu Wang for serving in my dissertation committee., Biological Science and Biotechnology July, 1999 ., Biological Science and Biotechnology March, 2003 ., Electrical and Computer Engineering 2003-present. Graduate Research Associate, Ohio State University. PUBLICATIONS Xuan Zhang, Ruoming Jin, Gagan Agrawal. “Assigning Schema Labels Using On- tology And Heuristics”.
In Proceedings of IEEE Symposium on Bioinformatics and Bioengineering (BIBE’06), October 2006. Xuan Zhang, Gagan Agrawal. “A Tool for Supporting Integration Across Multiple Flat-File Datasets”. In Proceedings of IEEE Symposium on Bioinformatics and Bioengineering (BIBE’06), October 2006.
Xuan Zhang, Gagan Agrawal. “Enabling Information Integration and Workflows in a Grid Environment with Automatic Wrapper Generation”. In Proceedings of IEEE/ACM International Workshop on Grid Computing (GRID2005), November 2005. Kaushik Sinha, Xuan Zhang, Ruoming Jin, Gagan Agrawal.
“Using data mining techniques to learn layouts of flat-file biological datasets”. In Proceedings of IEEE Symposium on Bioinformatics and Bioengineering (BIBE’05), October 2005. Kaushik Sinha, Xuan Zhang, Ruoming Jin, Gagan Agrawal. “Learning layouts of biological datasets semi-automatically”.
In Proceedings of International Workshop on Data Integration in the Life Sciences (DILS’05), July 2005. vi Xuan Zhang, Xiaoyang Gao, Gagan Agrawal. “Integrated Retrieval from Biological Databases Using an SQL Extension”. In Proceedings of Workshop on Bioinformatics and Computational Biology (BCB2003), December 2003.
Leonid Glimcher, Xuan Zhang, and Gagan Agrawal. “Scaling and Parallelizing a Scientific Feature Mining Application Using a Cluster Middleware”. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS2004), April 2004. FIELDS OF STUDY Major Field: Computer Science and Engineering Studies in Bioinformatics Integration System: Prof.
Gagan Agrawal vii TABLE OF CONTENTS Page Abstract. vi List of Tables. xi List of Figures .1 Biological Information Integration Systems .2 Grid Projects on Bioinformatics .6 Semantic and Ontology .1 Overall Context, Challenges, and System Overview .1 Challenges in Schema Mining .2 Summary of the Steps .1 Data Cleaning and Summarization .3 Mining with Ontology .4 Mining with Heuristics. Automatic Wrapper Generation .2 Technical Issues and Challenges .3 Metadata Description Language .4 System Implementation and Key Algorithms .1 Wrapper Generation System .5 Case Studies and Experimental Results .1 TRANSFAC-to-Reference .2 SWISSPROT-to-FASTA.
Query Multiple Flat-File Datasets .1 Challenges and Our Approach .1 POST-BLAST QUERY .2 CHIP-SUPPLEMENT QUERY. Query Flat-File Datasets Using Indices .1 Challenges and System Overview .1 Indexing Biological Data .2 Algorithms and System Implementation .3 Query Execution: The Query-Proc Program .1 General Database Search with Index .2 Similarity Search on Sequence Databases .2 Case Study I: Gene Name Nomenclature .I: Nomenclature Across Species .II: Nomenclature Over Time .3 Case Study II: Correlation Between Gene’s Function and Location 139 7.1 Understandability and Usability .1 Ontology for bioinformatics tools .2 Reason about workflows. 158 x LIST OF TABLES Table Page 3.1 Profile Table for Token Categorization .2 Schema Mining Algorithm Evaluation .1 WRAPINFO data structure for the TRANSFAC-to-Reference Example 62 7.1 Summary of Databases .2 Usage of Registered Gene Names .3 Usage of Gene Names in Other Communities .4 Summary of Major Cellular Component and Molecular Function GO Terms. 141 xi LIST OF FIGURES Figure Page 3.1 Overview of Metadata Learning for Biological Data .2 General Function for Schema Mining Score Calculation .3 Score Calculation with Heuristics .5 Pseudo-code of Approximate Frequent Token Mining Algorithm .6 Score Calculation with Ontology .7 Results of Attribute Labelling with Ontology .8 Results of Attribute Labelling with Heuristics .1 Overview of the Wrapper Generation System .2 The Descriptor for the Reference Table in the TRANSFAC-to-Reference Example .3 Automatic Generated Schema Mapping File for the TRANSFAC-to- Reference Example .4 Logical View of TRANSFAC Data Layout as a Tree .5 Overview of the Wrapper .6 The Algorithm for DataReader of Wrapper .7 The Algorithm for DataWriter of Wrapper .8 Results from TRANSFAC-to-Reference Problem .9 Results from SWISSPROT-to-FASTA Problem .10 The Descriptor for TRANSFAC in the TRANSFAC-to-Reference Ex- ample .1 Overview of the System .2 Query for POST-BLAST example .3 Types of Query Specified with Query Language .4 Internal Representation of the metadata for BLASTP .5 QUERYINFOR for POST-BLAST Example .6 Value Buffer for POST-BLAST Example .7 Performance on POST-BLAST Example .8 Performance on CHIP-SUPPLEMENT Example .9 Algorithm for the Synchronizer of query-proc .1 Overview of the Query System Using Indices .3 The Metadata Descriptor for Yeast Genome .4 QUERYINFOR for Example Yeast Genome Query .5 Performance of Answering BLAST-ENHANCE Query .6 Performance of CYGD Similarity Search Using Singh’s Algorithm .7 Performance of GENBANK Similarity Search Using Ferhatosmanoglu’s Algorithm .8 Algorithm of Example Indexing Functions for Yeast Genome IDs .9 The Algorithm for the Synchronizer Using Indices .1 Overview of the On-the-Fly Biological Data Integration System and Tools .2 The Metadata Descriptor for dictyBase .3 Performance of Entry Selection by Species .4 Trends of Nomenclature Between Swiss-Prot and Genome Databases 137 7.5 Performance of Historical Analysis .6 Correlation Analysis Workflow .7 Correlation Between Cellular Component and Molecular Functions .8 The Modification of Descriptor When Swiss-Prot Format Changes .9 Identification of Gene Name Attributes Using Schema Labelling Tool 147 xiv CHAPTER 1 INTRODUCTION In this dissertation, a framework and a set of tools have been proposed and im- plemented for the on-the-fly integration of biological data.
They could minimize the human involvement in integrating new resource and reduce the maintenance cost when participating autonomous data resources update. Our approaches are mainly based on data mining and code generation.1 Motivation Biologists today spend large amount of time and effort in querying multiple remote or local data sources, running data analysis programs and interpreting the results. As a result, integration has become an important phase in biology research process. Integration allows biologists to combine knowledge from multiple disciplines [56, 110, 47, 88, 53] and has become a critical issue in biological research in recent years.
However, the explosion of biological data and computation resources has made human integration no longer feasible. First, the quantity of biological data is overwhelming. In August 2005, the INSDC announced that the DNA sequence database exceeded 100 gigabases [13]. GenBank 1 1 Please see http://www.gov/Genbank/ 1 statistics showed that it contained 65,369,091,950 bases in 61,132,599 sequence records in its traditional divisions as of August 2006 [14].
New biological data is being pro- duced at a phenomenal rate. It has been reported that, on the average, biological databases grow exponentially and double in size about every 15 months [12]. The number of data depositories is increasing, too. Manually tracing all the data resources is infeasible.
Second, the interoperability between these biological services are poor. These data resources are usually developed autonomously and may represent same kind of information heterogeneously. They are represented in a variety of formats, and may be organized in flat files, relational or object-oriented databases. One main reason for the variety of data representation is that biological concepts are usually complex and data are semi-structured.
Another reason is that collaboration between differ- ence data authorities are low and therefore there are a limited number of constraints when designing data representation formats. Unlike data in classic database systems, biological data is usually accessible through user-friendly web interfaces and down- loadable files. For example, a biologist using microarray technology to uncover the genetic basis of a disease needs to go through the following steps: 1)mapping the site of a reactive spot in the micro-array output to its gene sequence, 2)comparing the sequence to known sequences to find protein or DNA homologues, 3)mining informa- tion about these homologues, and 4)annotating unknown sequence with information from the mined sources. The whole process involves querying multiple distributed databases, including sequence databases such as SWISSPROT, annotation databases such as GenCards and literature databases such as PubMed.
These databases com- municate their query results differently. Their formats range from ASN.1 format for 2 SWISSPROT, loosely structured HTML format for GeneCards, to structured XML format for PubMed. This microarray research process also involves computational tools, such as BLAST, that require the inputs in particular formats. The heterogene- ity between the data layouts forbids the biologist carry on the workflow directly.
For example, he can not run BLAST search on SWISSPROT directly because BLAST program asks for sequences to be stored in FASTA format, and SWISSPROT data are stored in a different and much more complicated form. Third, a variety of tools exist that assist biologists in searching, mining and analyz- ing biological data. Famous examples are FASTA [81], BLAST [5] and ClustalW [101]. Most of these tools are free, either through downloading of source code or Web in- terfaces.
They are important for many analysis workflows and an integration system without any tools offers limited support for bioinformatics research. Several collec- tions of computer applications are freely available to public. Examples include the online list at Bioexplorer.Net 2 and the book Bioinformatics: Methods and Protocols by Stephen Misener and Stephen A.