Luận văn thạc sĩ: Khai thác cụm từ tiếng Việt từ tập văn bản

Luận văn thạc sĩ VNU UET nghiên cứu việc trích xuất cụm từ tiếng Việt từ tập hợp văn bản, góp phần vào ngôn ngữ học và xử lý ngôn ngữ tự nhiên.

Trường đại học

Vietnam National University

Chuyên ngành

Information Technology

Người đăng

Ẩn danh

Thể loại

Thesis

2010

Phí lưu trữ

30 Point

Mục lục chi tiết

1. Introduction

1.1. Overview Name Entity recognition(NER)

1.2. NER Approach

1.2.1. Rule based approach

1.2.2. Machine learning Approach

1.3. Thesis contribution

1.4. Thesis structure

2. Related Work

2.1. Overview our problem

2.2. Building NER corpus research

2.3. Researches about building corpus Process

2.4. Overview annotate tools

2.5. Summary

3. Corpus building process

3.1. Objective

4. Online Annotation Framework

4.1. Online annotation interface

4.2. Automate file distribution for annotator

4.3. Automate save and manage files

4.4. Explain unusual entity

4.5. Inter annotatetor agreements

4.6. Offline corpus evaluation

4.7. Named entity recognition system

6. Conclusion And Future work

6.1. Create corpus bigger and more quality

6.2. Improve online annotation framework

6.3. Building NER system base statistical

Name Entity guideline

A.1. Entity and Entity Name

A.2. Instance of entity

A.3. List of Entities

A.4. Entities recognize rules

Tóm tắt

I. Tổng quan về nghiên cứu khai thác cụm từ tiếng Việt từ tập văn bản

Nghiên cứu khai thác cụm từ tiếng Việt từ tập văn bản là một lĩnh vực quan trọng trong xử lý ngôn ngữ tự nhiên. Việc nhận diện và phân tích các cụm từ giúp cải thiện khả năng hiểu biết ngôn ngữ của máy tính. Các phương pháp hiện tại bao gồm khai thác ngữ nghĩa và phân tích văn bản, nhằm mục đích tối ưu hóa việc nhận diện các thực thể tên trong văn bản.

1.1. Khái niệm về khai thác ngữ nghĩa trong tiếng Việt

Khai thác ngữ nghĩa là quá trình nhận diện và phân loại các thực thể tên trong văn bản. Điều này bao gồm việc xác định tên người, địa điểm và tổ chức, từ đó giúp máy tính hiểu rõ hơn về ngữ cảnh của văn bản.

1.2. Tầm quan trọng của việc phân tích văn bản

Phân tích văn bản không chỉ giúp nhận diện các thực thể mà còn hỗ trợ trong việc trích xuất thông tin quan trọng. Điều này có thể ứng dụng trong nhiều lĩnh vực như tìm kiếm thông tin, dịch máy và phân tích dữ liệu.

II. Vấn đề và thách thức trong nghiên cứu khai thác cụm từ tiếng Việt

Mặc dù có nhiều tiến bộ trong lĩnh vực khai thác ngữ nghĩa, nhưng vẫn tồn tại nhiều thách thức. Một trong những vấn đề lớn nhất là thiếu hụt dữ liệu chất lượng cao cho tiếng Việt. Điều này ảnh hưởng đến độ chính xác của các mô hình học máy trong việc nhận diện thực thể.

2.1. Thiếu hụt dữ liệu chất lượng cao

Việc xây dựng một tập dữ liệu lớn và chất lượng cho tiếng Việt là rất khó khăn. Nhiều nghiên cứu hiện tại vẫn chưa có đủ dữ liệu để huấn luyện các mô hình học máy hiệu quả.

2.2. Khó khăn trong việc nhận diện ngữ nghĩa

Nhận diện ngữ nghĩa trong tiếng Việt gặp khó khăn do tính đa nghĩa và ngữ cảnh phong phú. Điều này đòi hỏi các mô hình phải được tối ưu hóa để xử lý các trường hợp phức tạp.

III. Phương pháp chính trong nghiên cứu khai thác cụm từ tiếng Việt

Có nhiều phương pháp được áp dụng trong nghiên cứu khai thác cụm từ tiếng Việt, bao gồm phương pháp dựa trên quy tắc và phương pháp học máy. Mỗi phương pháp có những ưu điểm và nhược điểm riêng, ảnh hưởng đến kết quả cuối cùng.

3.1. Phương pháp dựa trên quy tắc

Phương pháp này sử dụng các quy tắc ngữ pháp và từ điển để nhận diện các thực thể. Mặc dù dễ triển khai, nhưng độ chính xác thường không cao trong các trường hợp phức tạp.

3.2. Phương pháp học máy

Học máy cho phép xây dựng các mô hình phức tạp hơn, có khả năng học từ dữ liệu. Tuy nhiên, nó yêu cầu một lượng lớn dữ liệu được gán nhãn để đạt được hiệu quả cao.

IV. Ứng dụng thực tiễn của nghiên cứu khai thác cụm từ tiếng Việt

Nghiên cứu khai thác cụm từ tiếng Việt có nhiều ứng dụng thực tiễn trong các lĩnh vực như tìm kiếm thông tin, phân tích dữ liệu và dịch máy. Những ứng dụng này không chỉ giúp cải thiện hiệu suất mà còn nâng cao trải nghiệm người dùng.

4.1. Tìm kiếm thông tin hiệu quả hơn

Việc nhận diện chính xác các thực thể trong văn bản giúp cải thiện khả năng tìm kiếm thông tin, từ đó cung cấp kết quả chính xác hơn cho người dùng.

4.2. Ứng dụng trong dịch máy

Khai thác ngữ nghĩa có thể cải thiện chất lượng dịch máy, giúp máy tính hiểu rõ hơn về ngữ cảnh và ý nghĩa của các cụm từ trong văn bản.

V. Kết luận và tương lai của nghiên cứu khai thác cụm từ tiếng Việt

Nghiên cứu khai thác cụm từ tiếng Việt đang trên đà phát triển mạnh mẽ. Với sự tiến bộ của công nghệ học máy và sự gia tăng dữ liệu, tương lai của lĩnh vực này hứa hẹn sẽ mang lại nhiều kết quả tích cực.

5.1. Tiềm năng phát triển trong tương lai

Với sự phát triển của công nghệ, khả năng nhận diện và phân tích ngữ nghĩa sẽ ngày càng chính xác hơn, mở ra nhiều cơ hội mới cho nghiên cứu và ứng dụng.

5.2. Hướng đi mới cho nghiên cứu

Nghiên cứu cần tập trung vào việc xây dựng các tập dữ liệu chất lượng cao và phát triển các mô hình học máy tiên tiến hơn để nâng cao hiệu quả khai thác ngữ nghĩa.

22/07/2025

Bạn đang xem trước tài liệu:

Luận văn thạc sĩ vnu uet extraction of vietnamese collocation from text corpora

Tải đầy đủ

Trích đoạn nội dung tài liệu

Towards a framework for building an Annotated Named Entities Corpus Hoang Huu Son Faculty of Information Technology University of technology and engineering Vietnam National University, Hanoi Supervised by Doctor Pham Bao Son A thesis submitted in fulfillment of the requirements for the degree of Master of Information Technology June, 2010 LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at Coltech or any other educational institu- tion, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom i have worked at Coltech lab or else- where, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presen- tation and linguistic expression is acknowledged. i LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com Table of Contents 1 Introduction 1 1.1 Overview Name Entity recognition(NER) .1 Rule based approach .2 Machine learning Approach .1 Overview our problem .2 Building NER corpus research .3 Researches about building corpus Process .4 Overview annotate tools.

12 3 Corpus building process 13 3.1 Corpus building process .2 Built annotation guide line .2 Building Vietnamese NER corpus by off-line tools .1 Built annotation guide line .3 Discus about Vietnamese NER corpus building process. 27 ii LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com TABLE OF CONTENTS iii 4 Online Annotation Framework 28 4.1 Online annotation interface .2 Automate file distribution for annotator .3 Automate save and manage files .3 Explain unusual entity .1 Inter annotatetor agreements .2 Offline corpus evaluation .4 Named entity recognition system. 58 6 Conclusion And Future work 60 6.1 Create corpus bigger and more quality .2 Improve online annotation framework .3 Building NER system base statistical. 63 LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com iv TABLE OF CONTENTS A Name Entity guideline 64 A.1 Entity and Entity Name .2 Instance of entity .3 List of Entities .4 Entities recognize rules.

69 LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com List of Figures 3.1 Process building Annotation guide line .4 Comparing two user corpus .1 Online Annotation Process .2 Annotation online tools Interface .3 Annotation gudeline form Interface .4 Review Tool Interface .5 Compare two documents interface .1 Inter Annotation Agreements result of two User .2 Evaluate accuracy rate for each Entity kind .3 Evaluate online corpus accuracy rate for each entity kind .4 Name entity recognition system architecture .5 Jape rule to recognize Person entity .6 Performance on the training data using strict criteria .7 Performance on test data using strict criteria .8 Performance on the test data using lenient criteria. 58 v LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com List of Tables 5.1 An example of par corpus which annotate bu two user (User A and user B) .2 frequency annotated documents .3 Inter annotation agreements in online annotation .4 User corpus accurate rate in online method .5 Time spent to quality control corpus .6 Time spent During annotation process .7 Quality control time in online framework. 51 vi LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com Chapter 1 Introduction 1.1 Overview Name Entity recognition(NER) The ability to determine the named entities in a text has been established as an important task for several natural language processing areas, including information retrieval, machine translation, information extraction and language understanding. The term ”Named Entity” widely use in Nature Language Processing(NLP), was coined for the Sixth Message Understanding Conference(MUC-6).

At the time, MUC was focusing in Information Extraction(IE) tasks where structured informa- tion of computer activities and defense related activities is extracted from unstruc- tured text,such as newspaper articles. In defining tasks,people noticed that it is essential to recognize information units like names including: Person, organization and location names and numerics expression including: time, date, money, percent expression. Identifying references to these entities in text was recognition as one of the importance sub- task of IE and was called ”Name Entity Recognition and Classification”. The computational research aiming at automatically identifying named entities 1 LUAN VAN CHAT LUONG download : add luanvanchat@agmail.

Introduction in texts forms a vast and heterogeneous pool of strategies, methods and represen- tations. One of the first research papers in the field was presented by Lisa F. Rau (1991) at the Seventh IEEE Conference on Artificial Intelligence Applications. In genreral, each NER researches which have been devoted have to solve four problems: Language, Input,Kind of entity, and learning method.

Languages: NER have been applied to several languages. There are many good researches for English NER, they have solved language independence and multilingualism prob- lems. German is well studied in CONLL-2003 and in earlier works. Similarly, Spanish and Dutch are strongly represented, boosted by a major devoted confer- ence: CONLL -2002 (Collins, 2002).

Chinese is studied in some researches (Wang et al., 1992),(Computer et al., 1996), (Yu et al., 1998) and so are French (Petasis et al., 2001), (And, 2003), Greek (Karkaletsis et al., 1999) and Italian (Black et al. Many other languages received some attention as well: Basque (Whitelaw & Patrick, 2003), Bulgarian (Silva et al., 2004), Catalan (Carreras et al. Portuguese was examined by(Palmer et al. In Vietnamese, there are some NER research is apply, for example VN- KIM (Nguyen & Cao, 2007)IE system have just Format input NER research have been applied to many format of documents: General text, email, scientific text, journalistic,ect and mamy domain: sport, business,literature, etc.

Each system usually direct specific format and domain (Maynard et al. Designed a system for email, scientific texts and religious texts (Minkov & Wang, 2005) created a system specifically designed for email documents. Now day, studies LUAN VAN CHAT LUONG download : add luanvanchat@agmail. NER Approach 3 want to apply to newer kind of format and domain.

For example, MUC-6 collection composed of newswire texts, and on a proprietary corpus made of manual transla- tions of phone conversations and technical email Kind of Entity Although list entities depend kind and domain specific problems, NER systems usually record some entities: Person, Location,Organization, Date, Time, Money, Percent. Ambiguous have been appeared by Person, Location,Organization and other is fewer. In Each domain, NERs target some specific one. For instance, in medicine domain, entity can be mane of disease or name of medicine.2 NER Approach Similar to other NLP problems NER research have been developed into two main approaches: • Rule based approach.

• Machine learning approach.1 Rule based approach Using expert system to built rule system is traditional approach and they have been applied in NLP in general and NER in particular. Rule system is set of rule which have been built by people (in ordinary expert) to particular target. Rules will create by some features: Part of speech, context( words and phrases are in front of words and behind one etc.) and some properties(Uppercase, lowercase.) and some special dictionaries. For example: LUAN VAN CHAT LUONG download : add luanvanchat@agmail.

Introduction President Busto leave Iraq said Monday’s talks will include discussion on security, a timetable for U.S forces In this example, ”Busto” appear after the ”President”,for this reason ”Busto” is snnotated as Person entity. Similar, ”Iraq” appear before ”leave” verb so that it is seemed ”Location’ entity. In this approach, we don’t need to annotate corpus. System can be identified and classified immediately by set of rules.

Advantage of approach is: easy to built rule base system. So that many NER systems is rule base system since first period. However, it is difficult to enhance accuracy rate. Because organize set of rules is difficult.

If we do not organize appropriately their, the rules is overlap each other, and system can not identify and classify correctly.2 Machine learning Approach Now day, machine learning is common approach to solve NLP problems. In NER, it is used to enhance accuracy. These are some model have been applied: support vector machine, Hidden Markov model, decision tree, etc. There are three kinds of learning method have been applied in Machine learning: Un-supervised, supervised, and semi-supervised.

However, Un-supervised systems and semi-supervised don’t not for NER problems. There are a few researchs apply these methods: for example: Collins with system used un-annotate corpus (Collins & Singer, 1999). And Kim with system using proper name and un- annotate corpus. Systems which are applied supervised used more popularly in NER problems.

For example:Bikel with hidden markow model(Black et al., 1998) ,Borthwick with Maximum Entropy (Borthwick et al. In Machine Learning systems, we must built three sets: training set, test set and practice set. • A training set consists of an input vector and an answer vector, and is used LUAN VAN CHAT LUONG download : add luanvanchat@agmail. NER Approach 5 together with a supervised learning method to train a knowledge for the sys- tem.

In NER, a training set is a corpus which have been annotated standard labels. • A Test set is similar to training set. But target of test set is check and evaluate system accuracy. In NER problem, test set is a corpus which similar to train set.

• Practice set: is set which is applied machine leaning system to automatically identify and classify. Execute practice set is goal to built system.3 Comparing Annotation based learning have some advantages from manual hand writing rule: • Annotation based learning can continue indefinitely, over weeks and months, with relatively self contained annotation decision at each point.In contrasts rule writing must remain cognizant of potential previous rules interdependen- cies when adding and revising rules,ultimately bounding continued rule system growth by cognitive load factor. • Annotation by learning can more effective combine the effort of multiple peo- ple. The tagged sentences from deference data sets can be simple concatenated to form larger data sets with broader coverage.

• User who write rule require large skill, including not only linguistic knowledge for annotation, but also competence regular expression and ability to grap the complex interactive within rule list. However, in machine learning approaches, annotators only require can used fluently language. LUAN VAN CHAT LUONG download : add luanvanchat@agmail. Introduction • Performance of system which built by rule writer tend to exhibit considerably more variance.

While machine system tend to much more consistent result. Although machine learning approach have a lot of advantages. However we meet a main barrier: machine learning need a high quality corpus. So that the problem is how to build a high quality copus.

For Vietnamese, There is not any NER corpus is published. Although some systems have been built based on machine learning approach, they don’t share theirs corpus. So that it is difficult to other research improve accurate for NER system. For this reason, my thesis focus: • Solutions to build Vietnamese NER corpus • Quality control and evaluate the corpus.

• Apply the corpus into NER problem.3 Thesis contribution The thesis contribution includes: • We release a building corpus process base on • We apply the process to build NER corpus by offline tools method. Offline tools method is a manual way use desktop programs, for example: Callisto tool. Offline tools method is called as offline tools. • To overcome offline tools disadvantage, We build a online annotation frame- work.

The online frame work have some features – Annotation will be executed though Internet environments (Annotate anytime, anywhere). LUAN VAN CHAT LUONG download : add luanvanchat@agmail. Thesis structure 7 – Automate all steps in process: Manage files, distribute to annotator, etc. – Enable lager number annotator.

– Quality control corpus in many level. • Apply corpus to evaluate our NER system.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ

Chủ đề

Xử lý ngôn ngữ tự nhiên tiếng Việt

Luận văn thạc sĩ công nghệ thông tin

Nhận dạng thực thể có tên (NER)

Xây dựng corpus cho học máy