Nghiên Cứu Trích Xuất Tự Động Cụm Từ Tiếng Trung Trong Luận Án Tiến Sĩ

Luận án tiến sĩ nghiên cứu phương pháp tự động trích xuất kết hợp từ tiếng Trung, ứng dụng trong xử lý ngôn ngữ tự nhiên và phân tích ngữ liệu.

Trường đại học

The Hong Kong Polytechnic University

Chuyên ngành

Computing

Người đăng

Ẩn danh

Thể loại

thesis

2006

214

Phí lưu trữ

55 Point

Tóm tắt

I. Nghiên Cứu

Nghiên cứu này tập trung vào việc trích xuất tự động các cụm từ tiếng Trung từ luận án tiến sĩ. Mục tiêu chính là cải thiện hiệu suất của các thuật toán trích xuất cụm từ thông qua việc kết hợp thông tin cú pháp và ngữ nghĩa. Nghiên cứu đã xác định các loại cụm từ khác nhau và thiết kế các thuật toán phù hợp để trích xuất chúng. Đồng thời, nghiên cứu cũng đề xuất một thuật toán mới dựa trên bi-gram hai chiều để xác định các cụm từ có tần suất xuất hiện thấp nhưng có tính cố định cao.

1.1. Phương Pháp Nghiên Cứu

Phương pháp nghiên cứu bao gồm việc phân tích các thuật toán hiện có và đề xuất cải tiến. Các thuật toán được thiết kế để nhắm mục tiêu vào các loại cụm từ khác nhau, sử dụng các đặc trưng và tiêu chí phù hợp. Nghiên cứu cũng tích hợp thêm thông tin cú pháp và ngữ nghĩa để nâng cao hiệu suất trích xuất. Một tập dữ liệu lớn các cụm từ tiếng Trung được xây dựng để đánh giá và so sánh các thuật toán một cách khách quan.

1.2. Kết Quả Nghiên Cứu

Kết quả nghiên cứu cho thấy việc sử dụng các mẫu cú pháp có thể cải thiện đáng kể hiệu suất trích xuất cụm từ, đặc biệt là trong việc lọc các cụm từ giả. Các cụm từ được trích xuất đã được áp dụng trong xử lý hậu kỳ của hệ thống nhận dạng chữ viết tay tiếng Trung, cho thấy hiệu quả thực tế của nghiên cứu.

II. Trích Xuất Tự Động

Trích xuất tự động là quá trình sử dụng các thuật toán và công nghệ để xác định và trích xuất các cụm từ từ văn bản một cách tự động. Trong nghiên cứu này, trích xuất tự động được áp dụng để xác định các cụm từ tiếng Trung từ luận án tiến sĩ. Các thuật toán được thiết kế để nhắm mục tiêu vào các loại cụm từ khác nhau, sử dụng các đặc trưng và tiêu chí phù hợp. Trích xuất tự động cũng được cải thiện bằng cách tích hợp thêm thông tin cú pháp và ngữ nghĩa.

2.1. Thuật Toán Trích Xuất

Các thuật toán trích xuất tự động được thiết kế để nhắm mục tiêu vào các loại cụm từ khác nhau. Một thuật toán mới dựa trên bi-gram hai chiều được đề xuất để xác định các cụm từ có tần suất xuất hiện thấp nhưng có tính cố định cao. Các thuật toán này được đánh giá và so sánh dựa trên một tập dữ liệu lớn các cụm từ tiếng Trung.

2.2. Tích Hợp Thông Tin Cú Pháp và Ngữ Nghĩa

Trích xuất tự động được cải thiện bằng cách tích hợp thêm thông tin cú pháp và ngữ nghĩa. Các mẫu cú pháp được sử dụng để lọc các cụm từ giả và nâng cao hiệu suất trích xuất. Kết quả thực nghiệm cho thấy việc sử dụng các mẫu cú pháp có thể cải thiện đáng kể hiệu suất trích xuất cụm từ.

III. Cụm Từ Tiếng Trung

Cụm từ tiếng Trung là các kết hợp từ thường xuyên xuất hiện trong văn bản và mang ý nghĩa ngữ nghĩa cụ thể. Trong nghiên cứu này, các cụm từ tiếng Trung được phân loại dựa trên tính thành phần, tính thay thế, tính biến đổi và mối liên hệ nội bộ. Các thuật toán trích xuất tự động được thiết kế để nhắm mục tiêu vào các loại cụm từ khác nhau, sử dụng các đặc trưng và tiêu chí phù hợp.

3.1. Phân Loại Cụm Từ

Các cụm từ tiếng Trung được phân loại dựa trên tính thành phần, tính thay thế, tính biến đổi và mối liên hệ nội bộ. Các loại cụm từ khác nhau được xác định và các thuật toán trích xuất tự động được thiết kế để nhắm mục tiêu vào các loại cụm từ này.

3.2. Đặc Trưng Cụm Từ

Các đặc trưng của cụm từ tiếng Trung được phân tích để thiết kế các thuật toán trích xuất tự động phù hợp. Các đặc trưng này bao gồm tính thành phần, tính thay thế, tính biến đổi và mối liên hệ nội bộ. Các thuật toán được thiết kế để nhắm mục tiêu vào các loại cụm từ khác nhau, sử dụng các đặc trưng và tiêu chí phù hợp.

IV. Luận Án Tiến Sĩ

Luận án tiến sĩ này tập trung vào việc trích xuất tự động các cụm từ tiếng Trung từ văn bản. Nghiên cứu đã xác định các loại cụm từ khác nhau và thiết kế các thuật toán phù hợp để trích xuất chúng. Đồng thời, nghiên cứu cũng đề xuất một thuật toán mới dựa trên bi-gram hai chiều để xác định các cụm từ có tần suất xuất hiện thấp nhưng có tính cố định cao. Các cụm từ được trích xuất đã được áp dụng trong xử lý hậu kỳ của hệ thống nhận dạng chữ viết tay tiếng Trung, cho thấy hiệu quả thực tế của nghiên cứu.

4.1. Mục Tiêu Nghiên Cứu

Mục tiêu chính của luận án tiến sĩ là cải thiện hiệu suất của các thuật toán trích xuất tự động các cụm từ tiếng Trung. Nghiên cứu đã xác định các loại cụm từ khác nhau và thiết kế các thuật toán phù hợp để trích xuất chúng. Đồng thời, nghiên cứu cũng đề xuất một thuật toán mới dựa trên bi-gram hai chiều để xác định các cụm từ có tần suất xuất hiện thấp nhưng có tính cố định cao.

4.2. Ứng Dụng Thực Tế

Các cụm từ được trích xuất từ luận án tiến sĩ đã được áp dụng trong xử lý hậu kỳ của hệ thống nhận dạng chữ viết tay tiếng Trung. Kết quả thực nghiệm cho thấy việc sử dụng thông tin cụm từ có thể cải thiện đáng kể hiệu suất của các ứng dụng liên quan đến xử lý ngôn ngữ tự nhiên.

21/02/2025

Bạn đang xem trước tài liệu:

Luận án tiến sĩ the study on automatic chinese collocation extraction

Tải đầy đủ

Trích đoạn nội dung tài liệu

The Study on Automatic Chinese Collocation Extraction by Xu Ruifeng A Thesis Submitted in Partial Fulfillment of the Requirements _for the Degree of Doctor of Philosophy Department of Computing The Hong Kong Polytechnic University Jan 12, 2006 UMI Number: 3241091 INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

® UMI UMI Microform 3241091 Copyright 2007 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.

Box 1346 Ann Arbor, MI 48106-1346 CERTIFICATE OF ORIGINALITY I hereby declare that this thesis is my own work and that, to the best of my knowledge and belief, it reproduces no material previously published or written, nor material that has been accepted for the award of any other degree or diploma, except where due acknowledgement has been made in the text. Collocation is a lexical phenomenon in which two or more words are: habitually combined together as some conventional way of saying things. Collocation information is essential to many natural language processing tasks such as word sense disambiguation, machine translation, and information extraction. Most of current works on collocation extraction are statistical based with limited precision and recall because they can not well distinguish word co-occurrences, which are statistically significant, from true collocations, which are of habitual use and are thus either syntactically or semantically significant.

The objective of this study is to investigate methods to improve the performance of Chinese collocation extraction algorithms. Different types of collocations are identified. Collocation extraction algorithms are then desigried to target on different types of collocations using different features and criteria associated with these different types. In addition to improve statistical based collocation extraction algorithms, additional syntactic and semantic information are also incorporated into the algorithm to further improve the performance of collocation extraction.

On the study of the statistical based algorithms, a new algorithm based on bi-directional word bi-grams is designed to help identify collocations with low co-occurrence frequency and are of fixed use. A large scale Chinese collocation answer set is established so that collocation extraction algorithms can be evaluated and compared objectively by using the same training corpus and corresponding answer set. Collocations are then categorized into four types based on their compositionality, substitutability, modifiability and internal association. Based on the characteristics of each type of collocations, a multi-stage window- based collocation extraction is built where the n-gram collocations and different types of bi- gram collocations are separately extracted in different stages using different strategies and different discriminative features.

A Chinese shallow treebank, referred to as the PolyU Treebank, is annotated manually to provide syntactic and semantic knowledge to further help collocation extraction. This treebank is also used to train a chunker based on lexicalized Hidden Markov Model (HMM). The chunker provides ways to process running text for collocation extraction. By using the support collocation patterns and reject collocation patterns extracted from the annotated Chinese treebank and parsed running text, syntactic features are employed to further improve the performance of the window-based collocation extraction system.

Experimental results show that the use of syntactic patterns can significantly improve the - performance of collocation extraction, especially for filtering pseudo collocations. The extracted collocations were applied in the post-processing of a handwritten Chinese character recognition system. Experiments indicate that collocation information can be used in real application to improve the performance of these natural language related applications. Keyword: Collocation extraction, Treebank, Chunking and parsing, ii Publications Arising From the Thesis > [1] Guo-hong Fu, Ruifeng Xu, K.

Luke and Qin Lu, Chinese Text Chunking Using Lexicalized HMMs, In Proceedings of IEEE International Conference on Machine Learning and Cybernetics, pp.7-12, Guang Zhou, China, 2005 [2] Qin Lu, Jing Zhou and Ruifeng Xu, Machine Learning Approaches for Chinese Shallow Parsing, In Proceedings of IEEE International Conference on Machine Learning and Cybernetics, vol.2309-2314, Xi’an, China, 2003 {3] Qin Lu, S. Chan, Ruifeng Xu, et al. A Unicode Based Adaptive Segmentor, In Proceedings of 2nd Workshop on ACL SIGHAN, pp164-167, Sapporo, Japan, 2003 [4] Qin Lu, Yin Li and Ruifeng Xu, Improving Xtract for Chinese Collocation Extraction, In Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering, pp. 333-338, Beijing, China, 2003 [Š} Qin Lu, S.

Chan, Ruifeng Xu, et al. A Unicode Based Adaptive Segmentor, Journal of Chinese Language and Computing, vol. 3, pp221-234, 2004 [6] Ruifeng Xu, Qin Lu, Daniel S. Yeung and Xizhao Wang, Distant BI-Gram Model, Collocation and Their Application in Post-processing for Chinese Character Recognition, In Proceedings of IEEE International Conference on Machine Learning and Cybernetics, vol.2251-2255, Beijing, China, 2002 [7] Ruifeng Xu, Qin Lu and Yin Li, An Automatic Chinese Collocation Extraction Algorithm based on Lexical Statistics, In Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering, pp.321-326, Beijing, China, 2003 [8] Ruifeng Xu, Qin Lu and Wanyin Li, The Construction of a Chinese Shallow Treebank, In Proceedings of 3rd Workshop on ACL SIGHAN, pp.94-101,Barcelona, Spain, 2004 nỉ [9] Ruifeng Xu, Daniel Yeung and Daming Shi, A Hybrid Post-processing System for Offline Handwritten Chinese Character Recognition based on a Statistical Language Model.

International Journal of Pattern Recognition and Artificial Intelligence, vol. 415-428, 2005 ˆ [10] Ruifeng Xu, Qin Lu, Yin Li and Wanyin Li, The Design and Construction of the PolyU Shallow Treebank, International Journal of Computational Linguistics and Chinese Language Processing, vol.397-416, 2005 [11] Ruifeing Xu and Qin Lu, Multi-stage Chinese Collocation Extraction, In Proceedings ofIEEE International Conference on Machine Learning and Cybernetics, pp.3254-3259, Guang Zhou, China, 2005 | [12] Ruifeng Xu and Qin Lu, Improving Collocation Extraction by Using Syntactic Patterns, In Proceedings of IEEE Conference on Natural Language Processing and Knowledge Engineering 2005, pp.52-57, WuHan, China, 2005 [13] Ruifeng Xu and Qin Lu, A Multi-stage Collocation Extraction System, The Advances in Machine Learning and Cybernetics, Lecture Notes on Artificial Intelligences (LNAI 3930), (Yeung D.), Springer-Verlag, Berlin Heidelberg: pp.740-749, 2006 [14] Ruifeng Xu, Qin Lu and Sujian Li, The Design and Construction of a Chinese Collocation Bank, Accpeted to published in Proceedins of Ffth International Conference on Language Resources and Evaluation, Genoa, Italy, 2006 [15] SuJian Li, Wen-jie Li, Qin Lu and Ruifeng Xu, Verifying Person Descriptions with Term-Entity Association, In Proceedings ofIEEE International Conference on Machine Learning and Cybernetics, pp. 50-55, Guang Zhou, China, 2005 [16] Sujian Li, Yun Li, Luning Ji and Ruifeng Xu, Use of Dictionary Matching and String Frequency Statistics in Content Analysis of Automatic Indexing, In Proceedings of 8th Joint Symposium on Computational Linguistics, NanJing, China, 2005 [17] Wanyin Li, Qin Lu and Ruifeng Xu, Using Synonym Relations in Chinese Collocation Extraction, In Proceedings of 3rd Workshop on ACL SIGHAN, pp.86-93, Barcelona, Spain, 2004 [18] Wanyin Li, Qin Lu and Ruifeng Xu, Similarity based Chinese Synonyms Collocation Extraction, International Journal of Computational Linguistics and Chinese Language Processing, vol. This thesis could not have been done without the help and cooperation of many peoples, and it is now my great pleasure to take this opportunity to thank them.

First and foremost, I would like to express my deepest thanks to my supervisor, Prof. Qin Lu, for being a consistent source of support and encouragement. I could not imagine having a better advisor and mentor for me. Without her knowledge and perceptiveness, I would never have finished my Ph.

I gratefully acknowledge her who gives me enormous freedom to pursue my own interests while at the same time providing just the right amount of guidance to ensure the right research approach. It would be my great pleasure to thank Dr. James Liu, Prof. William Wong and Prof.

Maosong Sun, for their great efforts, valuable comments and excellent advices to improve the quality and readability of the earlier version of this thesis. I would like to thank Dr. WenJie Li, who constantly encouraged me, contributed her valuable insight in academic research during the past three years. Another great excellent person whom I would like to express my deep gratitude is Dr.

Sujian Li, my close friend, for the continuous support and kindly help. I would also like to thank all my friends in our research group, Mrs. Yin Li, Ms. WanYan Li, Mr.

Tin-Shing Chiu, Ms. LuNing Ji, Mr. Ming-Li Wu, Dr. BaoLi Li, Mr.

Wei Li and Mr. Qing Chen, who have always been the great support to me and have made this group a wonderful place to learn and have fun. I will treasure their friendship for the rest of my life. At last, I would like express my deepest appreciation to my father Yu-Shu Xu and my mother Shao-Lan Li, my wife Shu-Qi Jiang, my aunt Shu-Jun Xu and my uncle Bao-Ku Su, vii my elder brother Rui-Song Xu and his wife Di An, for their endless love and unwavering support.

This thesis is dedicated to my mother Shao-Lan Li who gave my live, to my wife Shu-Qi rd Jiang who gives me true love, and to my baby daughter Carol Xu, the hope. vi Table of Contents Certificate of Originality À0.nn i Publications Asing from the Th€SI§. c0 ng TH cọ 4 64 0 10 11 FC VN ố. vii Table of Contents.

ix List Of Figures .sccccccsccscssssnssesscescnccssssneeseessssessssesecescesessoasseeasscaeasoesseeseeeseesssneseusaeseaseosees xili ID 82g 1. -- -‹- ‹ sóc sọ Họ Họ họ 9 0 th 1 2 Basic Concepts and Thesis SCOD€.2 Motivation and Problem Sta†€Tn€T(. Research Objectives and Thesis Scope. Gc HQ KH 6 0.1 Review of Automatic Collocation Extraction Techn1que€s.1 Window-based Statistical Collocation Extraction Approach.2 Syntax-based Collocation Extraction Approach.3 Collocation Extraction using Semantic ÌnÍOrmatiOH.2 Review of Automatic Shallow P4TSIDE.1 Statistic-based Shallow PaTSiDE.2 Rule-based Shallow P4TSINE.- - --c cm HH mờ 35 4 Collocation Extraction Based on Lexical Statistics.1 Preparation of Training Corpus and Answer S€(.- cọ HH nh 40 4.

Applying Xtract to Chinese Collocation Extraction: CX?r4cf. Improving CXtract: CXÍTđCÍÏ.4 A New Collocation Extraction System: CX?rC£2.1 The Framework DeSIØH.- - ---- cà cà ng HH4 08000842 157 52 4.2 Construct a Word Co-occurrence Database for CXtract2 ¬ 57 4.3 Evaluation Of CX?rđC(2.5 Evaluations of Statistical Collocation Extraction Algorithms.-- so + HH4 ng tình ng nh ni, 64 5 Multi-Stage Collocation ExtractiOn.1 Categorization of Chinese CollocatIOnS.--Ặ sen ng g4 th re.2 Characteristic Analysis of Typical CollocatiOnS. The Design of A Multi-Stage Collocation Extraction System.1 Additional Feature Selections .2 Applying the Heuristic Rules to Eliminate Pseudo Collocations.3 The New Multi-stage Extraction Algorithm.4 Parameter Optimization based on Perceptron Training Rule.4 Experimental Results and EvaluatiOTiS.--- cà Sen 421118821 188823 xe 86 5.1 Experimental Data PreparatOTn.2 Experiments on Type 1 and Type 2 Collocation Extraction in Stage 3 .3 Experiments on Weight Parameter Ôptimization.4 Experiments on Multi-stage Collocation Extraction of Stage 1-3 and 5.5 Experiments on Pseudo collocation Filtering by Using Heuristic Rules.6 Experiments on Evaluating the Complete Collocation Extraction System 93 5.5 Chapter Sumưnar1z2tiOT.- - -- S2 - SH ki n0 9v 94 6 The Design and Development of Chinese Shallow Treebank and Automatic Chunkers 95 6.1 The Design and Development of PolyU Treebank .1 Basic Concepts and Background of Shallow Treebank .--- - ------ HH HH th K1 0011 1088124 28791 99 6.3 Annotation Guideline Desigm.-- - -- - --- SH HH HH nen 102 6.4 Implementation of the PolyU Treebank.- co SSS+ĂẰ he enreg 108 6.5 Quality Assurance and Annotation PTOBT€SS.6 Contributions of PolyU Treebank.2 The Design and Development of Automatic ChuniK€TS.1 Chunking Scope and R€pr€s€ntatiOn.- nen ng errườ 118 6.2 Chunking with POS FeatUr€s.-- -- HH HH HH ng 120 6.3 Chunking with Lexicalized FeafUT€S. -- - -- - Ác HH ng gey 125 6.4 Experiments and EvaluUatiOTS.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ

Nghiên Cứu Về Trích Xuất Tự Động Cụm Từ Tiếng Trung Trong Luận Án Tiến Sĩ là một tài liệu chuyên sâu tập trung vào việc phát triển các phương pháp tự động trích xuất cụm từ tiếng Trung từ các luận án tiến sĩ. Nghiên cứu này không chỉ giúp cải thiện hiệu quả trong việc xử lý ngôn ngữ tự nhiên mà còn mở ra hướng tiếp cận mới trong việc phân tích và tổng hợp thông tin từ các tài liệu học thuật. Đây là nguồn tài liệu quý giá cho những ai quan tâm đến lĩnh vực xử lý ngôn ngữ tự nhiên và trí tuệ nhân tạo.

Để mở rộng kiến thức về các ứng dụng học sâu trong xử lý ngôn ngữ, bạn có thể tham khảo Luận văn thạc sĩ khoa học máy tính ứng dụng học sâu vào xây dựng mô hình rút trích thông tin, nghiên cứu này tập trung vào việc xây dựng các mô hình rút trích thông tin hiệu quả. Ngoài ra, Luận văn tốt nghiệp khoa học máy tính using retrieval augmentation and deep generative models to build question answering systems cung cấp cái nhìn sâu sắc về việc xây dựng hệ thống trả lời câu hỏi dựa trên các mô hình sinh sâu. Cuối cùng, Luận văn thạc sĩ khoa học máy tính ứng dụng học sâu vào dịch từ vựng mà không cần dữ liệu song ngữ là một tài liệu thú vị về dịch thuật tự động mà không cần dữ liệu song ngữ, mở ra hướng nghiên cứu mới trong lĩnh vực này.

#xử lý ngôn ngữ tự nhiên

#luận án tiến sĩ

#nghiên cứu học thuật

#phân tích văn bản

#học máy trong ngôn ngữ

#trích xuất tự động

Chủ đề

Xử Lý Ngôn Ngữ Tự Nhiên

Nghiên cứu học thuật

học máy ứng dụng

công nghệ ngôn ngữ

Nghiên Cứu Trích Xuất Tự Động Cụm Từ Tiếng Trung Trong Luận Án Tiến Sĩ

I. Nghiên Cứu

1.1. Phương Pháp Nghiên Cứu

1.2. Kết Quả Nghiên Cứu

II. Trích Xuất Tự Động

2.1. Thuật Toán Trích Xuất

2.2. Tích Hợp Thông Tin Cú Pháp và Ngữ Nghĩa

III. Cụm Từ Tiếng Trung

3.1. Phân Loại Cụm Từ

3.2. Đặc Trưng Cụm Từ

IV. Luận Án Tiến Sĩ

4.1. Mục Tiêu Nghiên Cứu

4.2. Ứng Dụng Thực Tế

THÔNG TIN CHI TIẾT

Tác giả: Xu Ruifeng

Người hướng dẫn: Prof. Qin Lu

Trường học: The Hong Kong Polytechnic University

Chuyên ngành: Computing

Đề tài: The Study on Automatic Chinese Collocation Extraction

Loại tài liệu: thesis

Năm xuất bản: 2006

Địa điểm: Hong Kong

Nghiên Cứu Trích Xuất Tự Động Cụm Từ Tiếng Trung Trong Luận Án Tiến Sĩ

I. Nghiên Cứu

1.1. Phương Pháp Nghiên Cứu

1.2. Kết Quả Nghiên Cứu

II. Trích Xuất Tự Động

2.1. Thuật Toán Trích Xuất

2.2. Tích Hợp Thông Tin Cú Pháp và Ngữ Nghĩa

III. Cụm Từ Tiếng Trung

3.1. Phân Loại Cụm Từ

3.2. Đặc Trưng Cụm Từ

IV. Luận Án Tiến Sĩ

4.1. Mục Tiêu Nghiên Cứu

4.2. Ứng Dụng Thực Tế

Tài liệu liên quan

THÔNG TIN CHI TIẾT

Tác giả: Xu Ruifeng

Người hướng dẫn: Prof. Qin Lu

Trường học: The Hong Kong Polytechnic University

Chuyên ngành: Computing

Đề tài: The Study on Automatic Chinese Collocation Extraction

Loại tài liệu: thesis

Năm xuất bản: 2006

Địa điểm: Hong Kong

Có thể bạn quan tâm