Phương Pháp Trích Xuất Mối Quan Hệ Ngữ Nghĩa Trong Văn Bản Y Sinh Dựa Trên Học Máy

Tài liệu nghiên cứu Machine learningbased extraction of semantic relations from biomedical literature trích xuất mối, tổng hợp lý thuyết và thực hành, cung cấp kiến thức chuyên

Chuyên ngành

Biomedical Natural Language Processing

Người đăng

Ẩn danh

Thể loại

doctoral dissertation

2022

193

Phí lưu trữ

45 Point

Mục lục chi tiết

DECLARATION

1. INTRODUCTION TO BIOMEDICAL RELATION EXTRACTION

1.1. Semantic relation extraction

1.2. Biomedical named entity recognition

1.3. Biomedical relation classification

1.4. Literature review of biomedical named entity recognition

1.5. Literature review of biomedical relation extraction

1.6. Datasets for named entity recognition experiments

1.7. Datasets for relation classification experiments

1.8. Named entity recognition evaluation

1.9. Relation classification evaluation

2. AN END-TO-END PIPELINE MODEL FOR BIOMEDICAL RELATION EXTRACTION

2.1. Distant supervision learning with silverCID corpus

2.2. Proposed UET-CAM system

2.3. Joint model of named entity recognition and normalization (DNER)

2.4. Intra-sentence relation classification with support vector machine

2.5. Experimental results and discussion

2.5.1. Choosing the combining manner of SSI and skip-gram for named entity normalization results

2.5.2. Named entity recognition and normalization results

2.5.3. CID relation classification results

3. AN IMPROVED CRE-BILSTM MODEL FOR BIOMEDICAL NAMED ENTITY RECOGNITION

3.1. Introduction to deep learning for named entity recognition

3.2. Proposed D3NER model

3.2.1. Data pre-processing

3.2.2. The TPAC embeddings layer

3.2.3. Context representing biLSTM layer

3.2.4. Conditional random fields layer

3.3. Experimental results and discussion

3.3.1. Experimental environment and model settings

3.3.2. The performance of D3NER model and comparisons

3.3.3. Contribution of the model components

4. HYBRID, ATTENTION-BASED AND ENSEMBLE DEEP LEARNING MODELS FOR BIOMEDICAL RELATION CLASSIFICATION

4.1. The shortest dependency path

4.2. A hybrid adaptive deep learning model for biomedical relation extraction

4.3. Experimental corpora and comparative models

4.4. Experimental environment and model settings

4.5. Experimental results and discussion

4.5.1. An attentive augmented deep learning model for biomedical relation extraction

4.5.2. Experimental environment and model settings

4.5.3. Experimental results and discussion

4.5.4. A multi-fragment ensemble deep learning model for biomedical relation extraction

4.5.4.1. Over-fitting problem of deep learning-based models

4.5.4.2. Bagging with bootstrap training data

4.5.4.3. Proposed multi-fragment ensemble architecture

4.5.4.4. Experimental results and discussion

5. GRAPH-BASED INTER-SENTENCE RELATION CLASSIFICATION IN BIOMEDICAL TEXT

5.1. Inter-sentence relations classification problem

5.2. Proposed graph-based inter-sentence relation classification model

5.3. Document sub-graph construction

5.4. Paths finding, merging and choosing

5.5. Shared-weight convolutional neural network

5.6. Experimental results and discussion

5.6.1. Experimental environment and model settings

5.6.2. Contribution of the added virtual edges in document sub-graph

5.6.3. Different sliding window size w for training and testing

5.6.4. Contribution of the model components

5.6.5. Comparison to comparative model

CONCLUSION

LIST OF PUBLICATIONS

ABBREVIATIONS

LIST OF FIGURES

LIST OF TABLES

Preface

Tóm tắt

I. Giới thiệu về Trích Xuất Mối Quan Hệ Ngữ Nghĩa trong Văn Bản Y Sinh

Trích xuất mối quan hệ ngữ nghĩa là một bước quan trọng trong xử lý ngôn ngữ tự nhiên (NLP), đặc biệt trong lĩnh vực y sinh. Với sự gia tăng nhanh chóng của các tài liệu khoa học y sinh, việc trích xuất thông tin từ các văn bản này trở nên cấp thiết. Học máy đã được áp dụng rộng rãi để tự động hóa quá trình này, giúp xác định các mối quan hệ giữa các thực thể như gen, bệnh, và hóa chất. Phân tích ngữ nghĩa và từ khóa LSI đóng vai trò quan trọng trong việc tối ưu hóa nội dung và cải thiện hiệu quả của các công cụ tìm kiếm.

1.1. Tầm quan trọng của Trích Xuất Mối Quan Hệ Ngữ Nghĩa

Trong lĩnh vực y sinh, việc trích xuất mối quan hệ giữa các thực thể giúp hỗ trợ nghiên cứu khoa học, chẩn đoán bệnh, và phát triển thuốc. Học máy và phân tích dữ liệu đã trở thành công cụ không thể thiếu trong việc xử lý lượng lớn dữ liệu y sinh. Các mô hình học máy như Latent Semantic Indexing (LSI) giúp tối ưu hóa quá trình trích xuất thông tin, đảm bảo tính chính xác và hiệu quả.

1.2. Ứng dụng thực tiễn của Trích Xuất Mối Quan Hệ Ngữ Nghĩa

Các ứng dụng thực tiễn của trích xuất mối quan hệ ngữ nghĩa bao gồm hỗ trợ chẩn đoán bệnh, phát hiện tác dụng phụ của thuốc, và tối ưu hóa quy trình nghiên cứu y học. Công cụ tìm kiếm được tích hợp các kỹ thuật này giúp các nhà nghiên cứu truy cập thông tin nhanh chóng và chính xác. Nghiên cứu từ khóa và tối ưu SEO cũng được áp dụng để cải thiện khả năng tìm kiếm và truy xuất dữ liệu.

II. Phương pháp Học Máy trong Trích Xuất Mối Quan Hệ Ngữ Nghĩa

Học máy đã được sử dụng rộng rãi trong việc trích xuất mối quan hệ ngữ nghĩa từ các văn bản y sinh. Các kỹ thuật học máy như Latent Semantic Indexing (LSI) và mô hình học sâu giúp cải thiện độ chính xác và hiệu quả của quá trình trích xuất. Phân tích ngữ nghĩa và từ khóa liên quan đóng vai trò quan trọng trong việc tối ưu hóa các mô hình này.

2.1. Mô hình Học Sâu trong Trích Xuất Mối Quan Hệ

Các mô hình học sâu như Bidirectional Long Short-term Memory (biLSTM) và Convolutional Neural Network (CNN) đã được áp dụng để trích xuất mối quan hệ giữa các thực thể trong văn bản y sinh. Những mô hình này giúp xử lý các dữ liệu phức tạp và đa dạng, đảm bảo tính chính xác cao. Phân tích dữ liệu và tối ưu hóa nội dung là các bước quan trọng trong quá trình huấn luyện mô hình.

2.2. Ứng dụng của Latent Semantic Indexing LSI

Latent Semantic Indexing (LSI) là một kỹ thuật quan trọng trong việc trích xuất mối quan hệ ngữ nghĩa. LSI giúp xác định các từ khóa liên quan và tối ưu hóa nội dung văn bản. Từ khóa LSI và từ khóa liên quan được sử dụng để cải thiện hiệu quả của các công cụ tìm kiếm và hỗ trợ quá trình nghiên cứu y sinh.

III. Thách thức và Giải pháp trong Trích Xuất Mối Quan Hệ Ngữ Nghĩa

Mặc dù học máy đã mang lại nhiều tiến bộ trong việc trích xuất mối quan hệ ngữ nghĩa, vẫn còn nhiều thách thức cần được giải quyết. Phân tích ngữ nghĩa và tối ưu hóa nội dung là các yếu tố quan trọng để cải thiện hiệu quả của các mô hình. Công cụ tìm kiếm và nghiên cứu từ khóa cũng đóng vai trò quan trọng trong việc giải quyết các thách thức này.

3.1. Thách thức trong Xử lý Dữ liệu Y Sinh

Dữ liệu y sinh thường phức tạp và đa dạng, gây khó khăn trong việc trích xuất thông tin. Phân tích dữ liệu và kỹ thuật học máy được sử dụng để xử lý các dữ liệu này. Tối ưu hóa nội dung và từ khóa LSI giúp cải thiện hiệu quả của quá trình trích xuất.

3.2. Giải pháp Tối ưu hóa Mô hình Học Máy

Để cải thiện hiệu quả của các mô hình học máy, tối ưu hóa nội dung và phân tích ngữ nghĩa là các bước quan trọng. Công cụ tìm kiếm được tích hợp các kỹ thuật này giúp cải thiện khả năng tìm kiếm và truy xuất dữ liệu. Nghiên cứu từ khóa và từ khóa liên quan cũng được áp dụng để tối ưu hóa các mô hình này.

21/02/2025

Bạn đang xem trước tài liệu:

Machine learningbased extraction of semantic relations from biomedical literature trích xuất mối quan hệ ngữ nghĩa trong văn bản y sinh dựa trên học máy

Tải đầy đủ

Trích đoạn nội dung tài liệu

Declaration Thereby declare that this Doctoral Dissertation was carried out by me for the degree of Doctor of Philosophy under the guidance and supervision of my supervisors. This dissertation is my own work and includes nothing, which is the outcome of work done in collaboration except as specified in the text. It is not substantially the same as any I have submitted for a degree, diploma or other qualification at any other university; and no part has already been, or is currently being submitted for any degree, diploma or other qualification. Hanoi , January 2022 Author Le Hoang Quynh iii Table of Contents DECLARATION.

00000 2 eee ee Lo Hi TABLE OF CONTENTS. ee ee viii LIST OF FIGURES. xi LIST OF TABLES .00000 00 02 ee 1 1 INTRODUCTION TO BIOMEDICAL RELATION EXTRACTION .1 Semantic relation exftraction.2 Biomedical named entity recognition. Biomedical relation classiicatlon.1 Literature review of biomedical named entity recognition .2 Literature review of biomedical relation extraction .1 Datasets for named entity recognition experiments .2 Datasets for relation classification experiments.

Named entity recognition evaluation. Relation classification evalualon. 00 ee ee 37 IV 2 ANEND-TO-END PIPELINE MODEL FOR BIOMEDICAL RELATION EXTRACTION .000 eee ee ee 38 2.1 Distant supervision learning with silverCID corpus .2 Proposed UET-CAM system.1 Joint model of named entity recognition and normalization (DNER) 43 2. Intra-sentence relation classification with support vector machine .3 Experimental results and discussion .1 Choosing the combining manner of SSI and skip-gram for named entity normalization results .2 Named entity recognition and normalization results.3, CID relation classiicatlonresults.

ee 62 3 AN IMPROVED CRE-BILSTM MODEL FOR BIOMEDICAL NAMED ENTITY RECOGNITION .1 Introduction to deep learning for named entity recognition .2 Proposed D3NER model.1 Data pre-processing .2 The TPAC embeddings layer .3 Context representing biLSTM layer.5 Conditional random fields layer.3 Experimental results and discussion .1 Experimental environment and model settings. The performance of D3NER model and comparisons .4 Contribution of the model components. Q Q Q Q HQ Q HH ee 86 4 HYBRID, ATTENTION-BASED AND ENSEMBLE DEEP LEARNING MODELS FOR BIOMEDICAL RELATION CLASSIFICATION .1 The shortest dependency path.2 The shortest dependency path.2 A hybrid adaptive deep learning model for biomedical relation extraction.2 Experimental corpora and comparative models. Experimental environment and model settings .4 Experimental results and discussion .3, An attentive augmented deep learning model for biomedical relation ex- traction 2.

Experimental environment and model setings.4 Experimental results and discussion .4 A multi-fragment ensemble deep learning model for biomedical relation extraction 2.1 Over-fitting problem of deep learning-based models .2 Bagging with bootstrap tramngdata.3 Proposed multi-fragment ensemble architecture .4 Experimental results and discussion. ee ee 129 GRAPH-BASED INTER-SENTENCE RELATION CLASSIFICATION IN BIOMEDICALTEXT.1 Inter-sentence relations classification problem.2 Proposed graph-based inter-sentence relation classification model. eee ee eee ee 134 5. Document sub-graph construction .3 Paths finding, merging and choosing .4 Shared-weight convolutional neural network.3 Experimental results and discussion .1 Experimental environment and model settings .2 Contribution of the added virtual edges in document sub-graph .3 Different sliding window size w for training and testing .4 Contribution of the model components .5 Comparison to comparativemodel.5 Summary CONCLUSION LIST OF PUBLICATIONS.

Vii ABBREVIATIONS Acc Accuracy Adam Adaptive Moment Estimation ANN Artificial Neural Network bagging Bootstrap Aggregating BB3 Bacteria Biotope Task BCS CDR corpus BioCreative V Chemical-Disease relation cor- pus BERT Bidirectional Encoder Representations from Transformers biLSTM Bidirectional Long Short-term Memory CBOW Continuous Bag-of-words CDR Chemical Disease Relation CID Chemical-induced Disease CNN Convolutional Neural Network CRF Conditional Random Fields CTD Comparative Toxicogenomics Database DDI Drug-drug Interaction DNER Disease Named Entity Recognition DNN Deep Neural Network DU Dependency Unit ELMO Embeddings from Language Models FN False Negative Vili FP False Positive FSU-PRGE The FSU PRotein GEne Corpus GD Gradient Descent HAScO Human-Aware Science Ontology HHEAR Human Health Exposure Analysis Resource HMM Hidden Markov Model TAA Inter-annotator Agreement IE Information Extraction KB Knowledge-base LSTM Long Short-term Memory MASS Man for All SeasonS MESH Medical Subject Headings mf Multi-fragment MLP Multilayer Perceptron MUC Message Understanding Conferences NCBI National Center for Biotechnology Informa- tion NCIT National Cancer Institute Thesaurus NE Named Entity NEN Named Entity Normalization NER Named Entity Recognition NLP Natural Language Processing OOV Out-Of- Vocabulary OWL Orthology Ontology P Precision PMC Pubmed Central 1X POS Part-of-speech R Recall RbSP Richer-but-Smarter Shortest Dependency Path RC Relation Classification RE Relation Extraction ReLU Rectified Linear Unit REP Replacement RGO Radiology Gamuts Ontology RNN Recurrent Neural Network SDP The Shortest Dependency Path SilverCID A Silver-standard Corpus for Chemical- induced Disease Relation Extraction SNOMED Systematized Nomenclature of Medicine SSI Supervised Semantic Indexing stdev Standard Deviation SVM Suport Vector Machine swCNN Shared-weight Convolutional Neural Network TN True Negative TP True Positive TPAC the Token-POS tag-Abbrviation-Character Embeedings UMLS Unified Medical Language System w/o REP With out Replacement List of Figures Growth of MEDLINE citations from 1986 to 2019. 2 Challenges’ subtasks/tracks organized based on NLP perspectives [64]. 3 The dissertation outline.1 An example taken from the BC5 CDR corpus with recognized names of Disease, Chemical and Specles.2 Examples of (a) inter-sentence relation and (b) intra-sentence relation.3 Examples of relations with specific and unspecific location.4 Examples of (a) Promotes - a directed relation and (b) Associated - an undirected relation taken from Phenebank corpus.5 Named entity recognition approaches taxonomy.6 Relation extraction approaches taxonomy.7 The statistics of corpora used in our experiments for relation classification.1 Analysis of the Direct Evidence field in the CTD databases.2 An example of constructing silverCID corpus.3 Architecture of the proposed UET-CAM system.4 Advanced SSI model using skip-gram information for NEN.5 Hybrid model of SSI and skip-gram model for NEN.6 Sequential back-off model of SSI and skip-gram model for NEN.7 An example of coreference In text.8 An examples of using multi-pass sieve for coreference resolution.1 The D3NER architecture.2 The TPAC embedding architecture of D3NER.1 Example of adependency tree.2 Examples of the shortest dependency paths .3 Examples of the dependency unit in the shortest dependency paths.4 The architecture of MASS model for relation classification.5 The multi-channel LSTM for word representation.6 Ablation test results for various components and information sources of MASSmodel.7 Examples of SDPs and attached child nodes.8 The architecture of RbSP model for relation classification.9 The multi-layer attention architecture to extract the augmented informa- tion from the children of a token on SDP.10 Ablation test results for compositional embeddings of RbSP model.11 Ablation test results for augmented information of RbSP model.12 Training loss, training accuracy, validation loss and validation accuracy of our RbSP model in BC5 CDR corpus.13 The range of RbSP model’s results on BCS CDR test set.14 The multi-fragment ensemble architecfure.15 The changes of multi-fragment ensemble model’s results with different size of training data. Ặ Ặ Q Q Q Q HQ HH ko 4.16 The changes #'I of multi-fragment ensemble model with different vote threshold.1 Examples of complicated cross-sentence relations.2 The proposed model for inter-sentence relation classification.3 Use sliding window to choose adjacent sentences for building document sub-graph.4 Examples of adocument sub-graph.5 Examples of two unexpected problems while generating the instance from document sub-graph.6 Example of an abstract with many NER annotations that leads to the ex- ee 140 plosion of similar paths.7 Diagram illustrating of aswCNN architecture.8 Ablation test results for virtual edges of the document sub-graph.9 The change of results with different size of sliding window.

Xi List of Tables 1.1 Example sentences labeled using different tagging schema .2 Examples for different relation types.3 Information about the BCS CDR, NCBI and FSU-PRGE corpora for NER.4 Information about the BC5 CDR, BB3, DDI and Phenebank corpora for relation classification.5 Defining the test metrics.1 Detailed Input/Output and the objectives of UET-CAM components.2 Large-scale feature set used in the intra-sentence relation extraction mod- ule of UET-CAM system.3 Named Entity Normalization results with different combining architectures.4 Disease named entity recognition results on BC5 CDR corpus of UET- CAM system.5 Relation classification results on BC5 CDR corpus of UET-CAM system.6 Analysis of the contribution of methods and resources used in the UET- CAM system for capturing CID relatonships.7 Sources of errors by our system system on the CDR test set.1 Configurations and parameters of D3NER model.2 Experimental results of D3NER for 20 runs each with different random initialization on BCS CDR and NCBI corpora.3 Performance of D3NER and compared state-of-the-art models on two benchmark corpora for Disease and Chemical NER.4 Experimental results of D3NER for 20 runs each with different random initialization on FSU-PRGE corpus (4-fold cross validation).5 Performance of D3NER and compared state-of-the-art model on FSU- PRGE corpus for Gene/protein NER.6 Ablation test results for different embeddings of D3NER model.7 Impact of fine-tunning embeddings as the D3NER’s hyper-parameters.8 D3NER confusion matrix on the CDR corpus.9 Examples for errors caused by D3NER on the BC5 CDR and FSU-PRGE COMpOra.1 Examples for different relation types.2 Configurations and parameters of MASS model.3 Results of MASS model on the BCS CDR corpus.4 Results of MASS model on the DDI-2013 corpus.5 Results of MASS model on the BB3 corpus.6 Results of MASS model on the Phenebank corpus.7 Examples of MASS model’s errors.8 Configurations and parameters of RbSP model.9 The RbSP model’s performance on BC5 CDR corpus.10 Multi-fragment ensemble results on BCS CDR corpus.11 The comparison of our ensemble proposed models with other compara- tive models on BC5 CDR corpus.12 The comparison of our ensemble proposed models with other compara- tive models on DDI corpus.1 Tuned hyper-parameter of proposed model.2 Ablation test results for added virtual edges in the document sub-graph.3 Results of the document sub-graph based model on BCS CDR corpus with different size of sliding window for training and testing.4 Ablation test results for various components of the document sub-graph based model on BC5 CDR corpus.5 The performance of document sub-graph-based model and some compar- ative models.6 The detailed results of the document sub-graph based model.7 Examples of errors on the BC5 CDR testset. 151 XIV Preface The necessities of the dissertation: In the past several decades, biomedicine and human health care have become one of the major service industries. They have been receiving increasing attention from the research community and the whole society., in 2011, biomedical research in the United States received 100—billion dollars of investment, with approximately 65% supported by industry, 30% by the government, and the remaining 5% by charities, foun- dations, or individual donors [137]. Up to the present, many researchers have been still working hard with an expectation that more advances would occur for supporting biomedical science and healthcare.

Therefore, the inevitable need is understanding and analyzing the existed information and knowledge bases. As a result, the field of biomedical research has overgrown, and the number of biomedical scientific publications is growing at an extremely high rate. Accessing and processing this data to keep abreast of the state-of-the-art and making discoveries in biomedical/healthcare scientific researches is essential for several types of users, in- cluding biomedical researchers, clinicians, database curators, and bibliometricians [77]. There is more than 3000 articles are published in biomedical journals every day [64].

MEDLINE®, a biomedical database of the US National Library of Medicine, is one of the most prominent and largest biomedical digital repositories. As of 2019, it already contains more than 26 million citations with a fast increasing number of articles in life sciences with a concentration on biomedicine!. Figure | illustrates the growth of MED- LINE from ~ 1 million in 1970 to ~ 26 million citations in 2019. More impressively, this number has increased nearly two times in 14 years, from 2005 (~ 13.

PubMed®” is a free resource developed and maintained by the NCBI which pro- Inttps://www.gov/bsd/medline_pubmed_production_stats.html "https ://www.gov/pubmed 27 25 23 21 19 17 15 13 11 9 7 5 3 1 PL FFP FP x.% Figure 1: Growth of MEDLINE citations from 1986 to 2019. The vertical axis shows the number ofcitation (in million). For clearly visualization, the Statistics before 2005 were presented every 5 years. vides free access to MEDLINE and some other databases.

Following the statistic re- ported in November 2019°, the total of PubMed citations cumulative has surpassed 30 million.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ

Tài liệu có tiêu đề "Trích Xuất Mối Quan Hệ Ngữ Nghĩa Trong Văn Bản Y Sinh Bằng Học Máy" khám phá cách thức áp dụng các kỹ thuật học máy để trích xuất và phân tích mối quan hệ ngữ nghĩa trong các văn bản y sinh. Bài viết nhấn mạnh tầm quan trọng của việc hiểu rõ các mối quan hệ này trong việc cải thiện chất lượng thông tin y tế, từ đó hỗ trợ các quyết định lâm sàng và nghiên cứu y học. Độc giả sẽ tìm thấy những lợi ích thiết thực từ việc áp dụng học máy trong lĩnh vực y sinh, giúp nâng cao hiệu quả trong việc xử lý và phân tích dữ liệu.

Để mở rộng thêm kiến thức về ứng dụng công nghệ trong y tế, bạn có thể tham khảo tài liệu Hcmute nghiên cứu ứng dụng giải thuật máy học machine learning và iot phát triển hệ thống điều khiển giám sát thông minh trong lĩnh vực y tế, nơi trình bày các ứng dụng của máy học trong giám sát y tế. Ngoài ra, tài liệu Đồ án hcmute phát triển ứng dụng đăng kí khám chữa bệnh cũng sẽ cung cấp cái nhìn sâu sắc về việc ứng dụng công nghệ thông tin trong quản lý khám chữa bệnh. Cuối cùng, bạn có thể tìm hiểu thêm về Hcmute xây dựng bộ phân loại bệnh tim từ cơ sở dữ liệu tín hiệu điện tim ecg, một nghiên cứu quan trọng trong việc phát triển các công cụ chẩn đoán y tế. Những tài liệu này sẽ giúp bạn có cái nhìn toàn diện hơn về sự giao thoa giữa công nghệ và y tế.

#xử lý ngôn ngữ tự nhiên

#phân tích ngữ nghĩa

#mô hình học máy

#công nghệ học máy

#trích xuất mối quan hệ ngữ nghĩa

#văn bản y sinh

Chủ đề

Công nghệ thông tin trong y tế

phân tích ngữ nghĩa trong văn bản

Học máy và y sinh

Ứng dụng học máy trong nghiên cứu y học