Algebraic combinatorics for computational biology by Nicholas Kar! Eriksson B. (Massachusetts Institute of Technology) 2001 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Mathematics and the Designated Emphasis in Computational and Genomic Biology in the GRADUATE DIVISION of the UNIVERSITY of CALIFORNIA, BERKELEY Committee in charge: Professor Bernd Sturmfels, Chair Professor Lior Pachter Professor Elchanan Mossel Spring 2006 UMI Number: 3228316 INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted.
Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3228316 Copyright 2006 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company 300 North Zeeb Road P. Box 1346 Ann Arbor, MI 48106-1346 Algebraic combinatorics for computational biology Copyright 2006 by Nicholas Karl Eriksson Abstract Algebraic combinatorics for computational biology by Nicholas Kar! Eriksson Doctor of Philosophy in Mathematics University of California, Berkeley Professor Bernd Sturmfels, Chair Algebraic statistics is the study of the algebraic varieties that correspond to discrete statistical models. Such statistical models are used throughout computational biology, for example to describe the evolution of DNA sequences. This perspective on statistics allows us to bring mathematical techniques to bear and also provides a source of new problems in mathematics.
The central focus of this thesis is the use of the language of algebraic statistics to translate between biological and statistical problems and algebraic and combinato- rial mathematics. The wide range of biological and statistical problems addressed in this work come from phylogenetics, comparative genomics, virology, and the analysis of ranked data. While these problems are varied, the mathematical techniques used in this work share common roots in the field of combinatorial commutative algebra. The main mathematical theme is the use of ideals which correspond to combinatorial objects such as magic squares, trees, or posets.
Biological problems suggest new families of ideals, and the study of these ideals can in some cases be useful for biology. Professor Bernd Sturmfels Dissertation Committee Chair To Nirit il Contents List of Figures iv List of Tables 1 Introduction 11 LÝ:30vs80n 10s".2 Toric ideals and exponential Íamilies.3 Phylogenetic algebraic geometry 2.4 Genomics and phylogenetics. ch ung gi kg và va 1.qaa 2 Markov bases for noncommutative analysis of ranked data 15 21 Election data with five candidates uc uc HQ 0.2 Fourier analysis of group valued data. 0 c k vn vn na ga si xà và 22 2.4 Computing Markov bases for permutation data .5 Structure of the toric ideallg, 2.
HQ Quà gà sa 26 2.6 Statistical analysis of the election data.7 Statistical analysis of an Sy exampDÌ€ uc cv nh va ee 32 3 Toric ideals of homogeneous phylogenetic models 35 3.1 Homogeneous phylogenetic models. 40 4 Tree construction using singular value decomposition 41 The general Markov model. nu vn gà kg kg kia 4.2 Flattenings and rank conditions.3 Singular value decomposition. cu cu ng Qa v kg va 4.4 Tree-construction algorithm.
cv cv vn vu ee 4,5 Building trees with simulated đatâ. cv ng ng g2 va 4.6 Building trees with real data. cv cv ng ngủ es ili 5 Ultra-conserved elements in vertebrate and fly genomes 65 5.1 The data 66 5,2 Ultra-conserved elements. ga kg kg kg xà 69 5.1 Nine-vertebrate algnMenE,.3 Eight-Drosophila alignment.
vu ch v1 kg va 72 5,3 Biology of ultra-conserved element§S.1 Nine-vertebrate alignment. cv ch vn v.3 Eight-Drosophia alignment. cv Quà iu eae 78 5. cv kg kg ki kg va 80 5.4 Statistical significance of ultra-conservation.
82 6 Evolution on distributive lattices 86 6.1 Drug resistancein HIV cv vu ee 87 6.2 The model of evolution. ng gi ga kg àv 88 6.3 Fitness landscapes on distributive lattices .5 Distributive lattices from Bayesian networks.6 Applications to HIV drug resistance.7 Mathematics and computation of the risk polynomial.8 Discussion 112 Bibliography 115 iv List of Figures 1.1 A simple statistical model. cv cv vn 1 2k v gà va 5 1.2 A multiple alignment of 3 DNA sequences.1 Distribution of the projection to S*? for two random walks.1 Polytope for a path with 7 nodes. LH ee es 42 3,2 The polytope of the completely odd binary tree.3 A tree T with 15 nodes where Pr has 34 vertices, 58 edges, and 26 facets.1 Determining the rank of Flat4 g(P) where {A,B} is not asplit.
52 4,2 The 6-taxa tree constructed in Example 4.3 The eight-taxa tree used for simulations.4 Simulation results with branch lengths (a,b) = (0. 61 45 Simulation results with branch lengths (a,b) = (0.6 Two phylogenetic trees for eight mammals.1 Phylogenetic tree for whole genome alignment of 9 vertebrates.2 Phylogenetic tree for whole genome alignment of 8 Drosophila species.3 Frequencies of vertebrate ultra-conserved elements (log;g-scale).4 Frequencies of Drosophila ultra-conserved elements (log;p-scale).5 Functional base coverage of collapsed vertebrate ultra-conserved elements.6 Ultra-conserved sequences found on either side of JRX5.7 Functional base coverage of ultra-conserved elements in ENCODE regions.8 Functional base coverage of ultra-conserved elements in Drosophila.1 HIV protease enzyme with bound inhibitor.2 An event poset, its genotype lattice, and a fitness landscape.3 An event poset whose risk polynomial is of degree 11 in 375 unknowns.4 Mutagenetic trees for ritonavir and indinavir.5 Graded resistance landscapes for ritonavir and indinavir.6 Risk as a function of drug dosage for indinavir and ritonavir. 106 List of Tables 21 American Psychological Association ranked voting data.2 First-order summary: chance of ranking candidate i in position j.3 A Markov basis for S; with 29890 moves in 14 symmetry classes.4 Length of the data projections onto the 7 isotypic subspaces of Ss.5 Second order summary for the APA data.6 Markov bases for S3 and $4 and the size of their symmetry classes.7 Number of generators by degree in a Markov basis for S,.8 Length of the data projections for the APA data and three perturbations.10 First order summary for the S4 ranked datain Table2. 211 Length of the data projections for the S4 data and three perturbations.1 Generators of the toric ideals of binary trees.2 Generators of the toric ideals of paths.3 Statistics for the polytopes of binary trees with at most 23 nodes.4 Statistics for the polytopes of all trees with at most 15 nodes.
41 Comparison of the SVD algorithm and dnaml on ENCODE data.1 Example of the output of Mercator, 2.2 Genomes in the nine-vertebrate alignment.3 Genomes in the eight-Drosophila alignment. vu vu ào 5.4 Ultra-conserved elements in the ENCODE alignments.9 GO annotations of genes associated with vertebrate ultras, .6 ENCODE regions with the greatest number of ultra-conserved elements.7 GO annotations of genes associated with Drosophila ultras.8 Probability of seeing ultra-conserved elements in an independence model. Nội Acknowledgements Above all, thanks to my advisor, Bernd Sturmfels, from whom I have learned much about the mysterious processes of doing and communicating mathematics. As essentially my second advisor, Lior Pachter has been an excellent guide through the rugged terrain that lies between mathematics and computational biology.
I would not be in this position without a host of mentors and teachers, partic- ularly Jim Cusker and Ken Ono, who started me on this path of studying mathematics. Along the way, it has been a pleasure to learn from my amazing coauthors: Niko Beeren- winkel, Persi Diaconis, Mathias Drton, Steve Fienberg, Jeff Lagarias, Garmay Leung, Kristian Ranestad, Alessandro Rinaldo, Seth Sullivant, and Bernd Sturmfels. I am grateful for support from the National Science Foundation (grant EF- 0331494), the DARPA program Fundamental Laws in Biology (HR0011-05-1-0057), and a National Defense Science and Engineering Graduate Fellowship. Due to this support and support from my advisors, I have had the good fortune to travel the world learning and teaching mathematics.
From Palo Alto to Spain to Argentina and many places in between, the people I have met on these trips have enriched my mathematical life. As this thesis depends heavily on computation, I am indebted to the people who have written programs which proved invaluable for my research. In particular, I thank Raymond Hemmecke, whose program 4ti2 was vital for Chapters 2 and 3. Also, thanks to Susan Holmes and Aaron Staple for writing the R code used in Chapter 2.
Most importantly, my parents, sister, and wife are each more responsible for my successes than they or I usually realize. They have always supported, accepted, and nourished me in countless ordinary and extraordinary ways. Chapter 1 Introduction The main theme of this thesis is the interplay between statistical models and algebraic techniques. More and more, the fields of statistics and biology are generating a wealth of interesting mathematical questions.
In return, discrete mathematics provides techniques for the solution of these problems, as well as a theoretical framework from which to ask new questions. From this interplay, the field of algebraic statistics has emerged. Its main purpose is the development of computational and theoretical tech- niques in algebra and combinatorics for applications to practical statistical problems. These techniques supply a valuable mathematical language for the study of computa- tional biology.
Computational biology has been a wonderful source of problems in combina- torics and combinatorial computer science due to the discrete structure of biological objects, notably DNA. For example, counting alignments and counting RNA secondary structures are typical enumerative problems [104!. For other connections between the fields, we note how biology has motivated mathematicians to better understand the struc- ture of the space of trees [16] and how distance measures between signed permutations i41) provide methods for understanding genome rearrangement through evolution. While biology provides a fount of such interesting questions, it is desirable at the end of the day to better understand real data.
And because there is always error in experimental data, this problem requires the use of statistics. Thus, we must form a connection between statistics and mathematics that allows us to use the combinatorial properties of the underlying problems in order to analyze data in a rigorous, robust, and efficient way. In this thesis, we provide a series of interrelated illustrations of how algebraic combinatorics can be used to increase our understanding of statistical and biological problems. We also demonstrate how biological questions can lead to interesting math- ematics.
The examples we study are drawn from statistics, phylogenetics, comparative genomics, and virology. The underlying mathematical philosophy is that statistical mod- els can be viewed as algebraic varieties. Our examples draw from a small set of statistical models which we introduce in this chapter: exponential families, phylogenetic models, and Bayesian networks. In the rest of this introduction, we will briefly outline the new field of algebraic statistics and explain the major algebraic, statistical, and biological ideas that will be used throughout the thesis.
We refer the reader to the book [73] for more details.1 Algebraic statistics Algebraic statistics depends on a set of tools that allow us to translate problems in statis- tics into algebraic language. We assume the reader is familiar with the basic language of algebraic geometry, namely polynomials, ideals, and varieties. In addition, we will use Gröbner bases throughout the thesis as a computational tool. For a friendly introduction to ideals and Gröbner bases, see [27].
Let X be a discrete random variable taking values in the set [n] = {1,2,. We write p; as shorthand for Pr(X = 7), the probability that X is in state 7. Let Aa_ be the (n — 1) dimensional probability simplex, e.Pn) ER" [pi 20, Sop: = 1}. i=l We will write A for the simplex Aa_¡ when the space is understood.
A statistical model for X is simply a family of probability distributions Mc A.