Đại số Tổ hợp cho Sinh học Tính toán

Luận án tiến sĩ về đại số tổ hợp ứng dụng trong sinh học tính toán. Nghiên cứu chuyên sâu về phân tích phylogen, genomics và virus học.

Trường đại học

University Of California, Berkeley

Chuyên ngành

Mathematics

Người đăng

Ẩn danh

Thể loại

Dissertation

2006

135

Phí lưu trữ

35 Point

Mục lục chi tiết

1. Introduction

2. Markov bases for noncommutative analysis of ranked data

2.1. Election data with five candidates

2.2. Fourier analysis of group valued data

2.4. Computing Markov bases for permutation data

2.5. Structure of the toric ideal

2.6. Statistical analysis of the election data

2.7. Statistical analysis of an example

3. Toric ideals of homogeneous phylogenetic models

3.1. Homogeneous phylogenetic models

4. Tree construction using singular value decomposition

4.1. The general Markov model

4.2. Flattenings and rank conditions

4.3. Singular value decomposition

4.4. Tree-construction algorithm

4.5. Building trees with simulated data

4.6. Building trees with real data

5. Ultra-conserved elements in vertebrate and fly genomes

5.1. The data

5.2. Ultra-conserved elements

5.3. Biology of ultra-conserved elements

5.4. Statistical significance of ultra-conservation

6. Evolution on distributive lattices

6.1. Drug resistance in HIV

6.2. The model of evolution

6.3. Fitness landscapes on distributive lattices

6.4. Distributive lattices from Bayesian networks

6.5. Applications to HIV drug resistance

6.6. Mathematics and computation of the risk polynomial

6.7. Discussion

Bibliography

Tóm tắt

I. Tổng quan về Ứng dụng Đại số Tổ hợp trong Sinh học

Bài viết này khám phá sự giao thoa thú vị giữa Đại số tổ hợp và Sinh học tính toán, đặc biệt tập trung vào các ứng dụng trong phân tích Phân tích phát sinh loài, Genomics và Virus học. Đại số tổ hợp, một nhánh của toán học liên quan đến việc đếm và sắp xếp các đối tượng rời rạc, ngày càng được chứng minh là một công cụ mạnh mẽ để giải quyết các vấn đề phức tạp trong sinh học. Từ việc xây dựng cây phát sinh loài đến giải mã dữ liệu genomics khổng lồ, các kỹ thuật đại số tổ hợp đang mở ra những con đường mới để hiểu biết về sự sống. Tài liệu gốc nhấn mạnh sự cần thiết của việc sử dụng thống kê để xử lý sai số trong dữ liệu thực nghiệm, tạo ra mối liên kết giữa thống kê và toán học. Các mô hình thống kê có thể được xem như các giống đại số. Mục tiêu cuối cùng là nâng cao hiểu biết về các vấn đề thống kê và sinh học.

1.1. Giới thiệu về Đại số Tổ hợp và Sinh học Tính toán

Đại số tổ hợp, với trọng tâm là cấu trúc rời rạc, cung cấp các công cụ để mô hình hóa các hệ thống sinh học phức tạp. Sinh học tính toán sử dụng các mô hình này để giải quyết các vấn đề trong genomics, phylogen và virus học. Sự kết hợp giữa hai lĩnh vực này giúp khám phá các mối quan hệ và cấu trúc ẩn trong dữ liệu sinh học. Bài viết này sẽ đi sâu vào những ứng dụng cụ thể, minh họa sức mạnh của đại số tổ hợp trong việc làm sáng tỏ những bí ẩn của sự sống.

1.2. Vai trò của Đại số Tổ hợp trong Phân tích Phylogen và Genomics

Trong phân tích phylogen, đại số tổ hợp giúp xây dựng và so sánh cây phát sinh loài, biểu diễn mối quan hệ tiến hóa giữa các loài. Trong genomics, nó được sử dụng để phân tích trình tự gen, xác định các biến thể di truyền và hiểu cấu trúc gen. Các thuật toán đại số tổ hợp cho phép xử lý dữ liệu genomics quy mô lớn một cách hiệu quả, cung cấp những hiểu biết sâu sắc về sự tiến hóa và chức năng của gen. Các kỹ thuật này đang cách mạng hóa cách chúng ta nghiên cứu sự đa dạng sinh học.

II. Thách thức khi ứng dụng Đại số Tổ hợp vào Sinh học

Mặc dù tiềm năng to lớn, việc ứng dụng Đại số tổ hợp trong Sinh học tính toán cũng đối mặt với nhiều thách thức. Dữ liệu Genomics thường rất lớn và phức tạp, đòi hỏi các thuật toán mạnh mẽ và hiệu quả về mặt tính toán. Phân tích phát sinh loài có thể trở nên khó khăn khi xử lý các cây tiến hóa phức tạp với nhiều nhánh. Hơn nữa, việc giải thích kết quả toán học dưới dạng sinh học có ý nghĩa đòi hỏi sự hiểu biết sâu sắc về cả hai lĩnh vực. Việc vượt qua những thách thức này đòi hỏi sự hợp tác giữa các nhà toán học, nhà sinh học và nhà khoa học máy tính. Thuật toán trong sinh học tính toán cần được tối ưu hóa liên tục để đáp ứng nhu cầu ngày càng tăng của dữ liệu sinh học.

2.1. Xử lý Dữ liệu Genomics Lớn và Phức Tạp Giải pháp

Một trong những thách thức lớn nhất là xử lý khối lượng dữ liệu khổng lồ trong genomics. Các kỹ thuật Khoa học dữ liệu sinh học và các thuật toán giải thuật di truyền cần được phát triển để lọc, giảm chiều và phân tích hiệu quả dữ liệu này. Các phương pháp song song và phân tán có thể giúp tăng tốc quá trình tính toán, cho phép các nhà nghiên cứu khám phá bộ gen một cách nhanh chóng và chính xác hơn. Việc kết hợp với tin sinh học cũng là điều quan trọng.

2.2. Giải thích Kết quả Toán học dưới Góc Độ Sinh Học

Việc chuyển đổi các kết quả toán học thành những hiểu biết sinh học có ý nghĩa là một thách thức liên ngành. Các nhà nghiên cứu cần phải có kiến thức sâu rộng về cả đại số tổ hợp và sinh học để hiểu được ý nghĩa sinh học của các mô hình toán học. Sự hợp tác giữa các chuyên gia từ các lĩnh vực khác nhau là rất quan trọng để giải quyết vấn đề này. Cần có các công cụ và phương pháp giúp các nhà sinh học dễ dàng tiếp cận và sử dụng các kết quả từ phân tích đại số tổ hợp.

III. Cách Phân tích Phylogen hiệu quả bằng Đại số Tổ hợp

Đại số tổ hợp cung cấp các công cụ mạnh mẽ để Phân tích phát sinh loài. Bằng cách biểu diễn cây tiến hóa dưới dạng các cấu trúc đại số, chúng ta có thể sử dụng các kỹ thuật toán học để xây dựng, so sánh và phân tích các cây này. Các Mô hình hóa toán học trong sinh học giúp chúng ta hiểu rõ hơn về quá trình tiến hóa và mối quan hệ giữa các loài. Cây phát sinh loài được xây dựng dựa trên dữ liệu genomics và các đặc điểm hình thái, cung cấp một cái nhìn tổng quan về lịch sử tiến hóa của sự sống. Phân tích này là cơ sở để hiểu sự tiến hóa của virus.

3.1. Xây dựng Cây Phát Sinh Loài dựa trên Lý thuyết Đồ thị

Lý thuyết đồ thị, một nhánh quan trọng của đại số tổ hợp, cung cấp các công cụ để biểu diễn và phân tích cây phát sinh loài. Các đỉnh của đồ thị đại diện cho các loài, và các cạnh đại diện cho mối quan hệ tiến hóa giữa chúng. Các thuật toán đồ thị có thể được sử dụng để tìm cây tiến hóa tối ưu dựa trên dữ liệu genomics và các đặc điểm khác. Các mô hình toán học có thể được sử dụng để mô phỏng quá trình tiến hóa và đánh giá độ tin cậy của các cây phát sinh loài.

3.2. So sánh và Đánh giá Cây Phát Sinh Loài bằng Đại số Tổ hợp

Đại số tổ hợp cung cấp các phương pháp để so sánh các cây tiến hóa khác nhau và đánh giá độ tin cậy của chúng. Các kỹ thuật như khoảng cách cây và phân tích đồng thuận có thể được sử dụng để xác định các vùng không chắc chắn trong cây tiến hóa và để đánh giá mức độ hỗ trợ cho các mối quan hệ cụ thể. Việc kết hợp các phương pháp đại số tổ hợp với các phương pháp thống kê cho phép phân tích cây phát sinh loài toàn diện hơn.

IV. Bí quyết Giải mã Genomics với Đại số Tổ hợp hiệu quả nhất

Đại số tổ hợp đóng một vai trò quan trọng trong việc giải mã Genomics. Bằng cách sử dụng các kỹ thuật như phân tích trình tự, nhận dạng gen và phân tích biến thể, chúng ta có thể hiểu rõ hơn về cấu trúc và chức năng của bộ gen. Đại số tổ hợp cung cấp các công cụ để xử lý và phân tích dữ liệu genomics quy mô lớn, giúp chúng ta khám phá các mẫu và mối quan hệ ẩn trong bộ gen. Phân tích trình tự gen nhờ đó, có thể phát hiện sự biến đổi của virus.

4.1. Phân tích Trình Tự Gen và Nhận Dạng Gen bằng Thuật toán

Các thuật toán đại số tổ hợp có thể được sử dụng để phân tích trình tự gen và nhận dạng gen một cách hiệu quả. Các thuật toán tìm kiếm mẫu và căn chỉnh trình tự cho phép chúng ta xác định các vùng tương đồng giữa các bộ gen khác nhau và để xác định các gen và các yếu tố điều hòa. Các phương pháp này là cơ sở để so sánh bộ gen của các loài khác nhau và để hiểu sự tiến hóa của bộ gen.

4.2. Phân tích Biến Thể Gen và Mối Liên Hệ với Bệnh tật

Đại số tổ hợp cung cấp các công cụ để phân tích biến thể gen và xác định mối liên hệ giữa biến thể gen và bệnh tật. Các kỹ thuật như phân tích liên kết và nghiên cứu liên kết toàn bộ bộ gen (GWAS) cho phép chúng ta xác định các gen liên quan đến các bệnh di truyền. Việc hiểu được cơ chế di truyền của bệnh tật có thể dẫn đến các phương pháp chẩn đoán và điều trị mới.

V. Hướng dẫn Phân tích Virus học bằng Đại số Tổ hợp chi tiết

Trong Virus học, đại số tổ hợp được sử dụng để nghiên cứu sự tiến hóa của virus, Phân tích biến thể virus và dự đoán sự lây lan của dịch bệnh. Bằng cách mô hình hóa virus và sự tương tác của chúng với vật chủ bằng các cấu trúc đại số, chúng ta có thể hiểu rõ hơn về cơ chế lây nhiễm và phát triển các phương pháp điều trị hiệu quả. Tiến hóa virus là một quá trình phức tạp, có thể được mô hình hóa bằng Mô hình hóa toán học trong sinh học.

5.1. Nghiên cứu Sự Tiến Hóa của Virus bằng Cây Phát Sinh Loài

Phân tích phát sinh loài sử dụng đại số tổ hợp có thể giúp chúng ta hiểu sự tiến hóa của virus và xác định nguồn gốc và sự lây lan của các chủng virus khác nhau. Bằng cách xây dựng cây tiến hóa của virus dựa trên trình tự gen của chúng, chúng ta có thể theo dõi sự thay đổi của virus theo thời gian và xác định các yếu tố ảnh hưởng đến sự tiến hóa của chúng. Dựa vào đó, có thể thiết kế vaccine và thuốc điều trị hiệu quả.

5.2. Dự Đoán Sự Lây Lan của Dịch Bệnh bằng Mô Hình Toán học

Các mô hình toán học dựa trên đại số tổ hợp có thể được sử dụng để dự đoán sự lây lan của dịch bệnh. Các mô hình này xem xét các yếu tố như tỷ lệ lây nhiễm, tỷ lệ phục hồi và các biện pháp can thiệp để dự đoán sự phát triển của dịch bệnh theo thời gian. Các dự đoán này có thể giúp các nhà hoạch định chính sách đưa ra các quyết định sáng suốt về các biện pháp phòng ngừa và kiểm soát dịch bệnh.

VI. Ứng dụng thực tiễn và Kết quả nghiên cứu về Đại số Tổ hợp

Ứng dụng của Đại số tổ hợp trong Sinh học tính toán đã mang lại nhiều kết quả nghiên cứu quan trọng. Từ việc xác định các gen liên quan đến bệnh tật đến việc dự đoán sự lây lan của dịch bệnh, các kỹ thuật đại số tổ hợp đang cách mạng hóa cách chúng ta nghiên cứu sinh học. Những nghiên cứu gần đây cho thấy tiềm năng to lớn của đại số tổ hợp trong việc giải quyết các vấn đề quan trọng trong Genomics, Phân tích phát sinh loài và Virus học. Theo trích dẫn, việc nghiên cứu này sử dụng ngôn ngữ của thống kê đại số để chuyển đổi giữa các bài toán sinh học, thống kê và toán học tổ hợp.

6.1. Nghiên cứu về Tương Tác Protein sử dụng Lý thuyết Mạng Lưới

Mạng lưới tương tác protein được phân tích bằng lý thuyết đồ thị, một phần của đại số tổ hợp, để hiểu các chức năng sinh học và quá trình bệnh lý. Các mô hình mạng lưới giúp xác định các protein quan trọng trong các con đường sinh học và dự đoán tác động của các đột biến gen. Việc nghiên cứu các mạng lưới tương tác protein có thể dẫn đến việc phát triển các loại thuốc mới và phương pháp điều trị hiệu quả hơn.

6.2. Ứng dụng trong Thiết Kế Thuốc và Điều Trị Bệnh dựa trên Genomics

Dữ liệu genomics, được phân tích bằng đại số tổ hợp, có thể được sử dụng để thiết kế thuốc và điều trị bệnh. Bằng cách xác định các gen liên quan đến bệnh tật và hiểu cơ chế di truyền của bệnh, chúng ta có thể phát triển các loại thuốc nhắm mục tiêu và các phương pháp điều trị cá nhân hóa. Điều này hứa hẹn một tương lai y học chính xác hơn, hiệu quả hơn.

14/05/2025

Bạn đang xem trước tài liệu:

Luận án tiến sĩ algebraic combinatorics for computational biology

Tải đầy đủ

Trích đoạn nội dung tài liệu

Algebraic combinatorics for computational biology by Nicholas Kar! Eriksson B. (Massachusetts Institute of Technology) 2001 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Mathematics and the Designated Emphasis in Computational and Genomic Biology in the GRADUATE DIVISION of the UNIVERSITY of CALIFORNIA, BERKELEY Committee in charge: Professor Bernd Sturmfels, Chair Professor Lior Pachter Professor Elchanan Mossel Spring 2006 UMI Number: 3228316 INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted.

Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3228316 Copyright 2006 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code.

ProQuest Information and Learning Company 300 North Zeeb Road P. Box 1346 Ann Arbor, MI 48106-1346 Algebraic combinatorics for computational biology Copyright 2006 by Nicholas Karl Eriksson Abstract Algebraic combinatorics for computational biology by Nicholas Kar! Eriksson Doctor of Philosophy in Mathematics University of California, Berkeley Professor Bernd Sturmfels, Chair Algebraic statistics is the study of the algebraic varieties that correspond to discrete statistical models. Such statistical models are used throughout computational biology, for example to describe the evolution of DNA sequences. This perspective on statistics allows us to bring mathematical techniques to bear and also provides a source of new problems in mathematics.

The central focus of this thesis is the use of the language of algebraic statistics to translate between biological and statistical problems and algebraic and combinato- rial mathematics. The wide range of biological and statistical problems addressed in this work come from phylogenetics, comparative genomics, virology, and the analysis of ranked data. While these problems are varied, the mathematical techniques used in this work share common roots in the field of combinatorial commutative algebra. The main mathematical theme is the use of ideals which correspond to combinatorial objects such as magic squares, trees, or posets.

Biological problems suggest new families of ideals, and the study of these ideals can in some cases be useful for biology. Professor Bernd Sturmfels Dissertation Committee Chair To Nirit il Contents List of Figures iv List of Tables 1 Introduction 11 LÝ:30vs80n 10s".2 Toric ideals and exponential Íamilies.3 Phylogenetic algebraic geometry 2.4 Genomics and phylogenetics. ch ung gi kg và va 1.qaa 2 Markov bases for noncommutative analysis of ranked data 15 21 Election data with five candidates uc uc HQ 0.2 Fourier analysis of group valued data. 0 c k vn vn na ga si xà và 22 2.4 Computing Markov bases for permutation data .5 Structure of the toric ideallg, 2.

HQ Quà gà sa 26 2.6 Statistical analysis of the election data.7 Statistical analysis of an Sy exampDÌ€ uc cv nh va ee 32 3 Toric ideals of homogeneous phylogenetic models 35 3.1 Homogeneous phylogenetic models. 40 4 Tree construction using singular value decomposition 41 The general Markov model. nu vn gà kg kg kia 4.2 Flattenings and rank conditions.3 Singular value decomposition. cu cu ng Qa v kg va 4.4 Tree-construction algorithm.

cv cv vn vu ee 4,5 Building trees with simulated đatâ. cv ng ng g2 va 4.6 Building trees with real data. cv cv ng ngủ es ili 5 Ultra-conserved elements in vertebrate and fly genomes 65 5.1 The data 66 5,2 Ultra-conserved elements. ga kg kg kg xà 69 5.1 Nine-vertebrate algnMenE,.3 Eight-Drosophila alignment.

vu ch v1 kg va 72 5,3 Biology of ultra-conserved element§S.1 Nine-vertebrate alignment. cv ch vn v.3 Eight-Drosophia alignment. cv Quà iu eae 78 5. cv kg kg ki kg va 80 5.4 Statistical significance of ultra-conservation.

82 6 Evolution on distributive lattices 86 6.1 Drug resistancein HIV cv vu ee 87 6.2 The model of evolution. ng gi ga kg àv 88 6.3 Fitness landscapes on distributive lattices .5 Distributive lattices from Bayesian networks.6 Applications to HIV drug resistance.7 Mathematics and computation of the risk polynomial.8 Discussion 112 Bibliography 115 iv List of Figures 1.1 A simple statistical model. cv cv vn 1 2k v gà va 5 1.2 A multiple alignment of 3 DNA sequences.1 Distribution of the projection to S*? for two random walks.1 Polytope for a path with 7 nodes. LH ee es 42 3,2 The polytope of the completely odd binary tree.3 A tree T with 15 nodes where Pr has 34 vertices, 58 edges, and 26 facets.1 Determining the rank of Flat4 g(P) where {A,B} is not asplit.

52 4,2 The 6-taxa tree constructed in Example 4.3 The eight-taxa tree used for simulations.4 Simulation results with branch lengths (a,b) = (0. 61 45 Simulation results with branch lengths (a,b) = (0.6 Two phylogenetic trees for eight mammals.1 Phylogenetic tree for whole genome alignment of 9 vertebrates.2 Phylogenetic tree for whole genome alignment of 8 Drosophila species.3 Frequencies of vertebrate ultra-conserved elements (log;g-scale).4 Frequencies of Drosophila ultra-conserved elements (log;p-scale).5 Functional base coverage of collapsed vertebrate ultra-conserved elements.6 Ultra-conserved sequences found on either side of JRX5.7 Functional base coverage of ultra-conserved elements in ENCODE regions.8 Functional base coverage of ultra-conserved elements in Drosophila.1 HIV protease enzyme with bound inhibitor.2 An event poset, its genotype lattice, and a fitness landscape.3 An event poset whose risk polynomial is of degree 11 in 375 unknowns.4 Mutagenetic trees for ritonavir and indinavir.5 Graded resistance landscapes for ritonavir and indinavir.6 Risk as a function of drug dosage for indinavir and ritonavir. 106 List of Tables 21 American Psychological Association ranked voting data.2 First-order summary: chance of ranking candidate i in position j.3 A Markov basis for S; with 29890 moves in 14 symmetry classes.4 Length of the data projections onto the 7 isotypic subspaces of Ss.5 Second order summary for the APA data.6 Markov bases for S3 and $4 and the size of their symmetry classes.7 Number of generators by degree in a Markov basis for S,.8 Length of the data projections for the APA data and three perturbations.10 First order summary for the S4 ranked datain Table2. 211 Length of the data projections for the S4 data and three perturbations.1 Generators of the toric ideals of binary trees.2 Generators of the toric ideals of paths.3 Statistics for the polytopes of binary trees with at most 23 nodes.4 Statistics for the polytopes of all trees with at most 15 nodes.

41 Comparison of the SVD algorithm and dnaml on ENCODE data.1 Example of the output of Mercator, 2.2 Genomes in the nine-vertebrate alignment.3 Genomes in the eight-Drosophila alignment. vu vu ào 5.4 Ultra-conserved elements in the ENCODE alignments.9 GO annotations of genes associated with vertebrate ultras, .6 ENCODE regions with the greatest number of ultra-conserved elements.7 GO annotations of genes associated with Drosophila ultras.8 Probability of seeing ultra-conserved elements in an independence model. Nội Acknowledgements Above all, thanks to my advisor, Bernd Sturmfels, from whom I have learned much about the mysterious processes of doing and communicating mathematics. As essentially my second advisor, Lior Pachter has been an excellent guide through the rugged terrain that lies between mathematics and computational biology.

I would not be in this position without a host of mentors and teachers, partic- ularly Jim Cusker and Ken Ono, who started me on this path of studying mathematics. Along the way, it has been a pleasure to learn from my amazing coauthors: Niko Beeren- winkel, Persi Diaconis, Mathias Drton, Steve Fienberg, Jeff Lagarias, Garmay Leung, Kristian Ranestad, Alessandro Rinaldo, Seth Sullivant, and Bernd Sturmfels. I am grateful for support from the National Science Foundation (grant EF- 0331494), the DARPA program Fundamental Laws in Biology (HR0011-05-1-0057), and a National Defense Science and Engineering Graduate Fellowship. Due to this support and support from my advisors, I have had the good fortune to travel the world learning and teaching mathematics.

From Palo Alto to Spain to Argentina and many places in between, the people I have met on these trips have enriched my mathematical life. As this thesis depends heavily on computation, I am indebted to the people who have written programs which proved invaluable for my research. In particular, I thank Raymond Hemmecke, whose program 4ti2 was vital for Chapters 2 and 3. Also, thanks to Susan Holmes and Aaron Staple for writing the R code used in Chapter 2.

Most importantly, my parents, sister, and wife are each more responsible for my successes than they or I usually realize. They have always supported, accepted, and nourished me in countless ordinary and extraordinary ways. Chapter 1 Introduction The main theme of this thesis is the interplay between statistical models and algebraic techniques. More and more, the fields of statistics and biology are generating a wealth of interesting mathematical questions.

In return, discrete mathematics provides techniques for the solution of these problems, as well as a theoretical framework from which to ask new questions. From this interplay, the field of algebraic statistics has emerged. Its main purpose is the development of computational and theoretical tech- niques in algebra and combinatorics for applications to practical statistical problems. These techniques supply a valuable mathematical language for the study of computa- tional biology.

Computational biology has been a wonderful source of problems in combina- torics and combinatorial computer science due to the discrete structure of biological objects, notably DNA. For example, counting alignments and counting RNA secondary structures are typical enumerative problems [104!. For other connections between the fields, we note how biology has motivated mathematicians to better understand the struc- ture of the space of trees [16] and how distance measures between signed permutations i41) provide methods for understanding genome rearrangement through evolution. While biology provides a fount of such interesting questions, it is desirable at the end of the day to better understand real data.

And because there is always error in experimental data, this problem requires the use of statistics. Thus, we must form a connection between statistics and mathematics that allows us to use the combinatorial properties of the underlying problems in order to analyze data in a rigorous, robust, and efficient way. In this thesis, we provide a series of interrelated illustrations of how algebraic combinatorics can be used to increase our understanding of statistical and biological problems. We also demonstrate how biological questions can lead to interesting math- ematics.

The examples we study are drawn from statistics, phylogenetics, comparative genomics, and virology. The underlying mathematical philosophy is that statistical mod- els can be viewed as algebraic varieties. Our examples draw from a small set of statistical models which we introduce in this chapter: exponential families, phylogenetic models, and Bayesian networks. In the rest of this introduction, we will briefly outline the new field of algebraic statistics and explain the major algebraic, statistical, and biological ideas that will be used throughout the thesis.

We refer the reader to the book [73] for more details.1 Algebraic statistics Algebraic statistics depends on a set of tools that allow us to translate problems in statis- tics into algebraic language. We assume the reader is familiar with the basic language of algebraic geometry, namely polynomials, ideals, and varieties. In addition, we will use Gröbner bases throughout the thesis as a computational tool. For a friendly introduction to ideals and Gröbner bases, see [27].

Let X be a discrete random variable taking values in the set [n] = {1,2,. We write p; as shorthand for Pr(X = 7), the probability that X is in state 7. Let Aa_ be the (n — 1) dimensional probability simplex, e.Pn) ER" [pi 20, Sop: = 1}. i=l We will write A for the simplex Aa_¡ when the space is understood.

A statistical model for X is simply a family of probability distributions Mc A.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ

Chủ đề

Ứng dụng đại số tổ hợp trong sinh học

Phân tích cây Phylogen di truyền

Genomics và phân tích dữ liệu gen

Vi rút học và mô hình hóa bằng toán học