Dự Đoán Cấu Trúc Gen Trong Bộ Gen Eukaryote

Nghiên cứu luận án tiến sĩ về dự đoán cấu trúc gen trong bộ gen eukaryote, khám phá các phương pháp và ứng dụng trong sinh học phân tử.

Trường đại học

The Johns Hopkins University

Chuyên ngành

Doctor of Philosophy

Người đăng

Ẩn danh

Thể loại

dissertation

2006

215

Phí lưu trữ

55 Point

Mục lục chi tiết

Abstract

Acknowledgements

1. CHƯƠNG 1: Introduction

1.1. Computational Framework for Gene Prediction

1.1.1. Generalized Hidden Markov Models

1.1.2. Statistical Sequence Modeling

1.1.3. Integration of Extrinsic Evidence

1.1.4. Using Multiple Genomic Sequences

2. CHƯƠNG 2: Linear Combiner

3. CHƯƠNG 3: Statistical Combiner

3.1. Gene Structure Prediction with a GHMM

3.2. Representing Gene Structure Evidence

3.3. Conditioned on Input Evidence

4. CHƯƠNG 4: Prediction of Alternatively Spliced Exons

4.1. Explicit Classification of Cassette Exons

4.2. Biological Model of Splicing

4.3. Implicit Prediction of Alternative Splicing

4.4. Computational Model for Alternative Splicing

4.5. A Generalized Hidden Markov Model

4.6. A Phylogenetic Generalized Hidden Markov Model

4.7. Recovering Exon Structure

5. CHƯƠNG 5: Automated Gene Structure Annotation

5.1. TIGR Annotation Pipeline

5.2. Gene Structure Annotation Applications

5.3. Gene Structure Comparison

5.4. Testing on the ENCODE Regions

5.5. Evaluation of Evidence Tracks

6. CHƯƠNG 6: Alternative Exon Prediction Performance

6.1. Sequence Conservation Patterns

7. CHƯƠNG 7: Conclusion

Bibliography

Vita

Tóm tắt

I. Tổng Quan Dự Đoán Cấu Trúc Gen Eukaryote Khái Niệm Tầm Quan Trọng

Dự đoán cấu trúc gen trong bộ gen eukaryote là một lĩnh vực quan trọng của tin sinh học và gen học. Nó liên quan đến việc xác định vị trí và cấu trúc của các gen trong bộ gen eukaryote, bao gồm các exon, intron, vùng promoter, và vùng mã hóa. Quá trình này rất phức tạp do sự hiện diện của intron và các yếu tố điều hòa phức tạp. Dự đoán gen chính xác là nền tảng cho nhiều nghiên cứu sinh học, từ hiểu chức năng gen đến phát triển các liệu pháp điều trị bệnh. Các phương pháp dự đoán gen sử dụng nhiều nguồn thông tin, bao gồm trình tự DNA, dữ liệu biểu hiện gen, và thông tin về các gen tương đồng từ các loài khác. Theo Allen (2006), việc tích hợp các nguồn thông tin khác nhau là chìa khóa để cải thiện độ chính xác của dự đoán gen.

1.1. Vai trò của dự đoán gen trong nghiên cứu bộ gen eukaryote

Dự đoán gen đóng vai trò then chốt trong việc giải mã bộ gen eukaryote. Nó cung cấp thông tin cơ bản về số lượng gen, vị trí của chúng, và cấu trúc của chúng. Thông tin này rất quan trọng cho việc nghiên cứu chức năng gen, gene ontology, pathway analysis, và gene regulation. Việc xác định chính xác các gen cũng là bước đầu tiên để hiểu về sự khác biệt di truyền giữa các cá thể và quần thể, cũng như để phát triển các công cụ chẩn đoán và điều trị bệnh. Các công cụ bioinformatics tools và genomic databases hỗ trợ đắc lực cho quá trình này.

1.2. Các thành phần chính của cấu trúc gen eukaryote cần dự đoán

Cấu trúc gen eukaryote bao gồm nhiều thành phần quan trọng cần được dự đoán chính xác. Các thành phần này bao gồm exon (vùng mã hóa), intron (vùng không mã hóa), vùng promoter (điều hòa biểu hiện gen), và các tín hiệu splicing. Việc xác định chính xác vị trí và ranh giới của các thành phần này là rất quan trọng để hiểu về cơ chế biểu hiện gen và chức năng của protein. Các thuật toán dự đoán gen thường sử dụng các mô hình thống kê và machine learning để nhận diện các đặc điểm đặc trưng của các thành phần này.

II. Thách Thức Trong Dự Đoán Gen Độ Chính Xác Tính Toàn Diện

Dự đoán cấu trúc gen trong bộ gen eukaryote đối mặt với nhiều thách thức đáng kể. Sự phức tạp của bộ gen eukaryote, bao gồm sự hiện diện của intron lớn và số lượng lớn các yếu tố điều hòa, làm cho việc dự đoán gen trở nên khó khăn. Độ chính xác của các phương pháp dự đoán gen hiện tại vẫn còn hạn chế, đặc biệt đối với các gen có cấu trúc phức tạp hoặc biểu hiện thấp. Một thách thức khác là việc tích hợp các nguồn thông tin khác nhau một cách hiệu quả. Theo Allen (2006), việc cải thiện gene prediction accuracy, gene prediction sensitivity, và gene prediction specificity là những mục tiêu quan trọng trong lĩnh vực này.

2.1. Các yếu tố ảnh hưởng đến độ chính xác của dự đoán gen

Nhiều yếu tố ảnh hưởng đến độ chính xác của dự đoán gen. Các yếu tố này bao gồm chất lượng của trình tự DNA, sự phức tạp của cấu trúc gen, sự đa dạng của các yếu tố điều hòa, và sự sẵn có của dữ liệu biểu hiện gen. Các phương pháp dự đoán gen khác nhau có độ nhạy và độ đặc hiệu khác nhau, và việc lựa chọn phương pháp phù hợp là rất quan trọng. Ngoài ra, việc sử dụng các genomic databases và bioinformatics tools có thể giúp cải thiện độ chính xác của dự đoán gen.

2.2. Vấn đề tích hợp dữ liệu từ nhiều nguồn khác nhau

Việc tích hợp dữ liệu từ nhiều nguồn khác nhau là một thách thức lớn trong dự đoán gen. Các nguồn dữ liệu này bao gồm trình tự DNA, dữ liệu biểu hiện gen (transcriptomics), dữ liệu protein (proteomics), và thông tin về các gen tương đồng từ các loài khác (comparative genomics). Việc tích hợp các nguồn dữ liệu này một cách hiệu quả đòi hỏi các phương pháp thống kê và machine learning phức tạp. Các phương pháp tích hợp dữ liệu cũng cần phải xử lý các vấn đề như sai lệch dữ liệu, nhiễu, và sự không tương thích giữa các nguồn dữ liệu.

III. Phương Pháp Ab Initio Dự Đoán Gen Dựa Trên Trình Tự DNA

Phương pháp ab initio trong dự đoán gen dựa trên việc phân tích trực tiếp trình tự DNA để xác định các đặc điểm đặc trưng của gen, như vùng promoter, các vị trí splicing, và các codon bắt đầu và kết thúc. Các phương pháp này sử dụng các mô hình thống kê và thuật toán dự đoán gen để nhận diện các đặc điểm này. Ab initio gene prediction không yêu cầu thông tin từ các nguồn bên ngoài, nhưng độ chính xác của chúng thường thấp hơn so với các phương pháp sử dụng thông tin bổ sung. Theo Allen (2006), các phương pháp ab initio thường được sử dụng như một bước ban đầu trong quá trình dự đoán gen.

3.1. Sử dụng Hidden Markov Model HMM trong dự đoán gen

Hidden Markov Model (HMM) là một công cụ mạnh mẽ được sử dụng rộng rãi trong dự đoán gen ab initio. HMM mô hình hóa cấu trúc gen như một chuỗi các trạng thái ẩn, mỗi trạng thái tương ứng với một thành phần của gen (ví dụ: exon, intron, vùng promoter). Các tham số của HMM được ước tính từ dữ liệu huấn luyện, và sau đó HMM được sử dụng để dự đoán cấu trúc gen của các trình tự DNA mới. Các biến thể của HMM, như Generalized Hidden Markov Model (GHMM), cũng được sử dụng để cải thiện độ chính xác của dự đoán gen.

3.2. Ưu điểm và hạn chế của phương pháp ab initio

Ưu điểm chính của phương pháp ab initio là chúng không yêu cầu thông tin từ các nguồn bên ngoài, điều này làm cho chúng hữu ích cho việc dự đoán gen trong các bộ gen eukaryote mới được giải trình tự. Tuy nhiên, hạn chế chính của chúng là độ chính xác thường thấp hơn so với các phương pháp sử dụng thông tin bổ sung. Điều này là do các phương pháp ab initio chỉ dựa trên thông tin trình tự DNA, và chúng có thể gặp khó khăn trong việc phân biệt giữa các gen thực và các trình tự tương tự nhưng không mã hóa.

IV. Evidence Based Prediction Tích Hợp Dữ Liệu Biểu Hiện Gen Protein

Phương pháp evidence-based gene prediction sử dụng thông tin từ các nguồn bên ngoài, như dữ liệu biểu hiện gen (transcriptomics) và dữ liệu protein (proteomics), để cải thiện độ chính xác của dự đoán gen. Các phương pháp này tích hợp thông tin từ các thí nghiệm thực tế để xác định vị trí và cấu trúc của các gen. Dữ liệu RNA-Seq, EST, và protein sequence alignment thường được sử dụng làm bằng chứng để hỗ trợ dự đoán gen. Theo Allen (2006), việc tích hợp các nguồn thông tin khác nhau là chìa khóa để cải thiện độ chính xác của dự đoán gen.

4.1. Sử dụng dữ liệu RNA Seq và EST để xác định vị trí exon

Dữ liệu RNA-Seq và EST cung cấp thông tin trực tiếp về các vùng của bộ gen eukaryote được phiên mã thành RNA. Thông tin này có thể được sử dụng để xác định vị trí của các exon và các vị trí splicing. Các phương pháp evidence-based gene prediction thường sử dụng các thuật toán alignment để ánh xạ các trình tự RNA-Seq và EST lên bộ gen eukaryote, và sau đó sử dụng thông tin này để xây dựng các mô hình gen.

4.2. Tích hợp thông tin protein sequence alignment vào dự đoán gen

Thông tin protein sequence alignment cũng có thể được sử dụng để cải thiện độ chính xác của dự đoán gen. Các phương pháp evidence-based gene prediction thường sử dụng các thuật toán alignment để so sánh các trình tự protein đã biết với bộ gen eukaryote, và sau đó sử dụng thông tin này để xác định vị trí của các gen mã hóa protein. Thông tin protein sequence alignment đặc biệt hữu ích cho việc dự đoán các gen có cấu trúc phức tạp hoặc biểu hiện thấp.

V. Comparative Genomics So Sánh Bộ Gen Để Dự Đoán Cấu Trúc Gen

Comparative genomics là một phương pháp dự đoán gen dựa trên việc so sánh bộ gen eukaryote của các loài khác nhau để xác định các vùng bảo tồn. Các vùng bảo tồn thường tương ứng với các gen hoặc các yếu tố điều hòa quan trọng. Các phương pháp comparative genomics sử dụng các thuật toán alignment để so sánh bộ gen eukaryote của các loài khác nhau, và sau đó sử dụng thông tin này để dự đoán cấu trúc gen. Theo Allen (2006), comparative genomics có thể giúp cải thiện độ chính xác của dự đoán gen, đặc biệt đối với các gen có cấu trúc phức tạp hoặc biểu hiện thấp.

5.1. Xác định các vùng bảo tồn giữa các loài khác nhau

Việc xác định các vùng bảo tồn giữa các loài khác nhau là một bước quan trọng trong comparative genomics. Các vùng bảo tồn thường tương ứng với các gen hoặc các yếu tố điều hòa quan trọng. Các thuật toán alignment được sử dụng để so sánh bộ gen eukaryote của các loài khác nhau, và sau đó các vùng bảo tồn được xác định dựa trên mức độ tương đồng trình tự.

5.2. Sử dụng thông tin ortholog để cải thiện độ chính xác dự đoán

Thông tin về các ortholog (các gen tương đồng giữa các loài khác nhau) có thể được sử dụng để cải thiện độ chính xác của dự đoán gen. Nếu một gen đã được xác định trong một loài, thì có khả năng cao là gen ortholog cũng tồn tại trong các loài khác. Thông tin này có thể được sử dụng để hướng dẫn quá trình dự đoán gen và cải thiện độ chính xác của kết quả.

VI. Ứng Dụng Tương Lai Dự Đoán Gen Trong Nghiên Cứu Y Học

Dự đoán cấu trúc gen trong bộ gen eukaryote có nhiều ứng dụng quan trọng trong nghiên cứu và y học. Nó được sử dụng để hiểu chức năng gen, phát triển các liệu pháp điều trị bệnh, và nghiên cứu sự tiến hóa của bộ gen eukaryote. Trong tương lai, các phương pháp dự đoán gen sẽ tiếp tục được cải thiện nhờ sự phát triển của các công nghệ giải trình tự mới và các thuật toán machine learning tiên tiến. Theo Allen (2006), việc tích hợp các nguồn thông tin khác nhau một cách hiệu quả sẽ là chìa khóa để cải thiện độ chính xác của dự đoán gen trong tương lai.

6.1. Ứng dụng dự đoán gen trong nghiên cứu chức năng gen

Dự đoán gen là một bước quan trọng trong việc nghiên cứu chức năng gen. Việc xác định vị trí và cấu trúc của các gen cho phép các nhà khoa học nghiên cứu cơ chế biểu hiện gen, chức năng của protein, và vai trò của gen trong các quá trình sinh học khác nhau. Thông tin này có thể được sử dụng để hiểu về sự phát triển, sinh lý, và bệnh tật.

6.2. Triển vọng phát triển của các thuật toán dự đoán gen trong tương lai

Trong tương lai, các thuật toán dự đoán gen sẽ tiếp tục được cải thiện nhờ sự phát triển của các công nghệ giải trình tự mới và các thuật toán machine learning tiên tiến. Các phương pháp deep learning đang được sử dụng ngày càng nhiều trong dự đoán gen, và chúng hứa hẹn sẽ cải thiện đáng kể độ chính xác của kết quả. Ngoài ra, việc tích hợp các nguồn thông tin khác nhau một cách hiệu quả sẽ là chìa khóa để cải thiện độ chính xác của dự đoán gen trong tương lai.

27/05/2025

Bạn đang xem trước tài liệu:

Luận án tiến sĩ predicting gene structure in eukaryotic genomes

Tải đầy đủ

Trích đoạn nội dung tài liệu

PREDICTING GENE STRUCTURE IN EUKARYOTIC GENOMES by Jonathan Edward Allen A dissertation submitted to The Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy. Baltimore, Maryland September, 2006 © Jonathan Edward Allen 2006 All rights reserved UMI Number: 3240661 INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted.

Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3240661 Copyright 2007 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code.

ProQuest Information and Learning Company 300 North Zeeb Road P. Box 1346 Ann Arbor, MI 48106-1346 Abstract Obtaining the complete set of proteins for each eukaryotic organism is an important step in the quest to understand how life evolves and functions. The complex physiology of eu- karyotic cells, however, makes direct observation of proteins and their parent genes difficult to achieve. An organism’s genome provides the raw data that contains the set of instructions for generating the complete set of proteins, providing the potential to obtain a complete list of proteins without having to rely exclusively on direct observations in the cell.

Computa- tional gene prediction systems, therefore, play an important role in compiling sets of putative proteins for each sequenced genome. This dissertation addresses the problem of computational gene prediction in eukaryotic genomes, presenting a framework for predicting precise single isoform protein coding genes in long contiguous stretches of DNA. The framework is extended to predict overlapping alterna- tively spliced exons in known protein coding regions. A main contribution of this work is to apply classifier stacking with sequential inference, for the first time, to the gene finding prob- lem and to develop a phylogenetic generalized hidden Markov model for the alternative splice site prediction problem.

First a linear weighting scheme is developed, which is extended to _ a statistical prediction model. The statistical model is then transformed to a new sequential inference model to predict alternatively spliced exons. il Prediction accuracy of the single isoform gene prediction methods are tested on three eukaryotic genomes: Arabidopsis thaliana, Oryza sativa and human. Applicatio n of the gene prediction methods are examined in other eukaryotic genomes.

The alternative ly spliced exon prediction model is tested in four Drosophila species under a variety of input conditions. Incorporating multiple sources of gene structure evidence is shown to substantially im- proveme single isoform gene prediction accuracy with performance beginning to rival the accuracy of expert human annotators. Results from the alternative exon prediction experi- ments demonstrate the potential to reliably predict new alternatively spliced forms of known genes. The use of cross-species sequence conservation information is shown to enhance the precision of alternatively spliced exon prediction.

Salzberg Readers: Steven L. Salzberg and Jason M. Eisner ili Acknowledgements I would like to thank my adviser Steven L. Salzberg, for his guidance, patience and support and for giving me the freedom to pursue challenging research problems.

Salzberg has made many helpful suggestions, which improved the quality of my work over the last several years. Thank you to J ason Eisner for informing me of important related work in machine learning and natural language processing. I would also like to thank other members of Dr. Salzberg’s group including Mihaela Pertea and William H.

Majoros with whom I had many enlightening discussions on gene finding work. I also benefited from the many useful discus- sions on bioinformatics topics with other members of the group including Pawel Gajer, Maria D. Delcher and Mihai Pop. Thanks to many people at.

The Institute for Genomic Research which were very helpful in providing useful data to work on including Brian Haas, Bernard Suh, Chunhui Yu, Sam Angioli, Ahwui Wang, Robin Buell, Malcolm Gardner, Jane Carlton, Elodie Ghedon and Brendon Loftus. | I would also like to thank Harold Gainer for advice and providing me with an interesting biological problem to work on and I thank S. Rao Kosaraju for his positive supervision. on this project.

Thank you to Marvin Cook for many productive study sessions, which helped me get more out of many of the courses we took together. Thank you to my wife Safia Ahmed Omar for her love and support and helping me to keep iv my life in proper perspective. Thanks to my family Leah Lewis, Karen Kramer, Wise D. Allen and especially my parents Wise and Joan Allen.

Without their support and encouragement my educational pursuits would not have been possible. Contents Abstract ii Acknowledgements iv List of Tables viii List of Figures | xi 1 Introduction 1.2 Computational Framework for Gene Prediction .1 Generalized Hidden Markov Models.2 Statistical Sequence Modeling .4 Integration of Extrinsic Evidence .5 Using Multiple Genomic Sequences. eee ee 2 Linear Combiner 2. ee 3 Statistical Combiner 3.1 Gene Structure Prediction withaGHMM .2 Representing Gene Structure Evidence.

Conditioned on Input Evidence. ee 4 Prediction of Alternatively Spliced Exons 4. Q Q Q Q Q Q Q ng 2g gà và và và 4.2 Biological Model of Splicing. Q Q LH Q HQ ng Q ng g A kg g vn v v g v va 4.1 Explicit Classification of Cassette Pxons.2 Implicit Prediction of Alternative Splicing .4 Computational Model for Alternative Splicing .2 A Generalized Hidden Markov Model.3 A Phylogenetic Generalized Hidden Markov Model .4 Recovering Exon Structure.và 5 Automated Gene Structure Annotation 5.1 TIGR Annotation Pipeline.2 Gene Structure Annotation Applications .3 Gene Structure Comparison .1 Testing on the ENCODE Regions .2 Evaluation of Evidence Tracks.0 00000 cece eee 6 Alternative Exon Prediction Performance 6.

ng gà vn và và 6.2 Sequence Conservation Patterns.0000 cee eee ee va 7 Conclusion Bibliography Vita vii List of Tables 2.1 Sequence intervals scored by the Linear Combiner. Both strands of a genomic sequence (“+” and “-”) are labeled simultaneously.2 Labels for the sequence intervals between the first and last signal in the se- quence (divided into two tables). Sigg and Sigg mark the index to the left of the start of the sequence (-1) and the right of the sequence respectively. Gene signals are listed along the top columns.1 The set of class labels that describe a local sequence interval used to construct gene models on the positive strand, denoted by the “+” symbol.

The non- coding label applies to both strands. Labels reflect partial and complete exons. Each entry asserts whether the condition in that column must be true (1) or false (0). 10 additional class labels are used to represent strand specific labels on the negative strand.

Q Q Q Q ng v g va va 5.1 Performance of the gene predictors on 1783 genes. SC = Statistical Combiner; SC-g = SC combining gene prediction programs only; LC2 = Linear Combiner using sequence alignments; LC1 = Linear Combiner using gene prediction pro- grams only; GA = GlimmerM; GM = GeneMark.hmm; GS = Genscan+. The columns are: number of whole genes correctly predicted (Correct Gene); num- ber of genes completely missed (Missed Gene); correctly predicted exons out of the 7510 total (Correct Exons); number of exons completely missed (ME); Pre- dicted exons overlapping a gene region but do not overlap a true exon (Inserted Exons); percentage of protein coding nucleotides correctly detected (Nucl Sn).2 Breakdown of combiner predictions when matching exactly 3, 2, 1 or 0 gene prediction programs. The first column (Combiner) refers to the four combiners.

The second column (# of GP) refers to the number matching gene prediction programs. The third column and fourth column count the number of times the combiner prediction is correct (CG) and not entirely correct (WG). The fifth column is the percentage of correct predictions.3 The number of gene models each gene finder exclusively predicts correctly in test set 2.4 Performance for gene predictors including Twinscan and the retrained Glim- merM in addition to the programs listed in Table 5. SC-5: SC using all 5 gene prediction programs; SC-3 = SC using three gene prediction programs; SC-ðg = SC using 5 gene prediction programs and no alignment data; LC2-3 = LC2 using three gene prediction programs; LC1-3 = LC1 using three gene prediction programs; TS = Twinscan; GM2 = newer GlimmerM output.

The three prediction programs used by SC-3, LC2-3 and LC1-3 are Twinscan, Gen- eMark.hmm and newer GlimmerM (GM2).9 Performance comparison of JIGSAW and SC-5 (from Table ð.6 JIGSAW performance in Oryza sativa. Sn (sensitivity) = percentage of test set correctly predicted. Sp (specificity) = percentage of predictions that are cor- rect. Performance measured on three criteria: Genes, Exons and Nucleotides (Nucl).

All results shown as percentages.7 Gene structure comparison. Each entry contains two values “A/M” with A being the average and M being the mean value. The Exon / Intron column is median exon length divided by median intron length, .8 JIGSAW using gene finders and non-Human EST data. Results show sensi- tivity (Sn) and specificity (Sp) measured on Genes, Exons and Nucleotides (Nucl).

All results shown as percentages.9 Results of applying JIGSAW with all available evidence. *KnownGene predicts multiple transcripts per gene locus with a transcript specificity of 47%.10 Comparison of EGASP prediction performance for exons and protein coding nucleotides among the different prediction methods. Sensitivity (Sn) and Speci- ficity (Sp) is given. The F-score is shown for the nucleotide predictions.11 EGASP prediction performance for Genes and gene transcripts (Gene Trans) measuring sensitivity (Sn) and specificity (Sp).

The F-score is given for the Gene predictions. Transcript to Gene Ratio shows the number of transcripts predicted per gene locus. ng ng và và 6. melanogaster annotated di-nucleotides conserved in D.

Di-nucleotides are separated according to splicing type: Acceptor (Acc) and Donor (donor) and splicing event type: alternative (Alt) or constitutive (Con). Pseudo splice sites are included for reference. melanogaster annotated exons missing at least one splice site in D. Percentages are organized by exon type: constitutive exons (CS) cassette exons (CE), exons with multiple splice sites (MS) and exons with intron retention (IR).

The second number associated with the MS and IR rows is the percentage of exons where the non-conserved splice site is constitutive (used in all isoforms).3 Results are shown for 8 versions of ExAlt using different combinations of infor- mant species plus Genscan. The informant species are D. ExAlt-ab initio uses no informant species.4 Prediction performance of ExAlt. Rows 1-2 show ExAlt performance using an input exon and default parameters (ExAlt-Exon) and no informant species (ExAlt-Exon-ab initio).

Rows 3-5 show ExAlt performance using an input coding frame with default parameters (ExAlt-Frame), no informant species (ExAlt-Frame-ab initio), and at most 1 exon predicted per test sequence (ExAlt-Frame-Single). Rows 6- 8 show ExAlt performance using no gene structure information with default parameters (ExAlt-Default), no informant species (ExAlt-Default-ab initio) and at most 1 exon prediction per test sequence (ExAlt-Default-Single).5 Exon prediction accuracy from Table 6.4 separated by exon splicing event.6 ExAlt results on the initial training and testing set in percentages. Included next to each measurement is the difference in percentage points compared to performance in the held out set in Table6. 161 List of Figures 1.1 Aschematic of double stranded DNA.

Each nucleotide is represented by an “L” shaped box and includes a 5 carbon sugar molecule (S), a phosphate residue (P) and one of four nitrogenous bases (a, c, t and g). Dashed lines indicate hydrogen bonds between two bases, c-g pairs form three hydrogen bonds and a-t pairs form two hydrogen bonds. The 3’ and 3’ labels denote the orientation ofeach strand.2 Example of protein coding gene structure. Gene contains two exons and one intron.

The initial exon includes an untranslated region (5’ UTR) and the translated region.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ

Tài liệu có tiêu đề Dự Đoán Cấu Trúc Gen Trong Bộ Gen Eukaryote cung cấp cái nhìn sâu sắc về các phương pháp và công nghệ hiện đại trong việc dự đoán cấu trúc gen của các sinh vật eukaryote. Bài viết nhấn mạnh tầm quan trọng của việc hiểu cấu trúc gen để phát triển các ứng dụng trong sinh học phân tử, y học và nông nghiệp. Độc giả sẽ được khám phá các kỹ thuật tiên tiến như phân tích dữ liệu gen và mô hình hóa cấu trúc, từ đó nâng cao khả năng nghiên cứu và ứng dụng trong thực tiễn.

Để mở rộng kiến thức của bạn về lĩnh vực này, bạn có thể tham khảo tài liệu Nghiên ứu xây dựng và ứng dụng thử nghiệm quy trình táh hiết rna virút từ á loại nhuyễn thể hai mảnh vỏ. Tài liệu này sẽ cung cấp thêm thông tin về quy trình tách chiết RNA, một khía cạnh quan trọng trong nghiên cứu gen và virus, giúp bạn có cái nhìn toàn diện hơn về các ứng dụng trong sinh học phân tử.

#di truyền học

#công nghệ sinh học

#tính toán sinh học

#mô hình hóa gen

#cấu trúc gen eukaryote

#dự đoán cấu trúc gen

Chủ đề

ứng dụng trong sinh học phân tử

Nghiên cứu về gen eukaryote

Công nghệ dự đoán gen

Tương lai của di truyền học