VIETNAM NATIONAL UNIVERSITY HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY THI-THANH-TAM DO TAGSET EVALUATION AND AUTOMATICAL ERROR VERRIFICATION IN POS TAGGED CORPUS MASTER THESIS (Natural language processing) Ha Noi - 2012 TIEU LUAN MOI download : skknchat@gmail.com VIETNAM NATIONAL UNIVERSITY HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY THI-THANH-TAM DO TAGSET EVALUATION AND AUTOMATICAL ERROR VERRIFICATION IN POS TAGGED CORPUS Branch of knowledge: Information technology Major: Computer science Code: 60 48 01 MASTER THESIS Supervisor: Dr. Nguyen Phuong Thai Ha Noi - 2012 ii TIEU LUAN MOI download : skknchat@gmail.com TABLE OF CONTENTS ACKNOWLEDGEMENTS. iii TABLE OF CONTENTS. iv LIST OF FIGURES.
vi LIST OF TABLES.vii NOTATIONS/ABBREVIATIONS .viii ORIGINALITY STATEMENT. 2 INTRODUCTION AND MOTIVATION. Characteristics of Vietnamese language. Vietnamese part of speech.
Criteria to classify .The ways to build up tagset. Organization of the thesis. 12 EVALUATING DISTRIBUTIONAL PROPERTIES -. 12 CONVERSION POSSIBILITY OF TAGSETS .A method for evaluating distributional properties of tagsets.
Result of tagset evaluation. 16 iv TIEU LUAN MOI download : skknchat@gmail. Possibility of Tagsets convertibility. 19 Result of tagset convertibility.
24 AUTOMATIC ERROR VERIFICATION. 24 OF POS - TAGGED CORPUS. Concept related to variation n-gram method. Types of Vietnamese tagging error.
A algorithm for detecting errors. Result of detecting errors in POS tagging. Word in Vietnamese. N-gram in word segmentation.
Result of detecting errors in word segmentation. 35 CONCLUSION AND SUMMARY. The Vietnamese treebank tagset. Syntax function tags in VTB.
Adverbial classification tag of verb in VTB. Phrase tagset in VTB. Clause tagset in VTB. 44 v TIEU LUAN MOI download : skknchat@gmail.com LIST OF FIGURES Figure 1.
The features of Vietnamese type. Purity as external evaluation criterion for cluster quality. Majority class and number of members of the majority class for the three cluster are: x,5 (cluster 1); o,4 (cluster 2); and , 3 (cluster 3). N-gram and variation nuclei in VTB corpus with n up to 29.
27 vi TIEU LUAN MOI download : skknchat@gmail.com LIST OF TABLES Table 1. The expression of grammatical meaning in Vietnamese. Corpus with VnQtag tagset annotation. Principle differences between Vietnamese and English.
Some frames is found in corpus. Result of tagset evaluation method. Some properties in tagset convertibility method in Hoangtube. Statistic ambiguous the word types in VnQtag corpus.
Statistic ambiguous the token in VnQtag corpus. Statistic detail ambiguous word types in VnQtag corppus. Statistic errors in corpus. The detail n-gram in tagged corpus.
The errors and ambiguous statistic in word segmentation algorithm. 33 Table 13: Detail of context and varitation in VTB corpus. 34 vii TIEU LUAN MOI download : skknchat@gmail.com CHAPTER 1 INTRODUCTION AND MOTIVATION 1. Characteristics of Vietnamese language Every language in the world has its own features and so has Vietnamese.
To understand more Vietnamese, we would like to list some emerging features and compare Vietnamese with some other languages such as Chinese, English.Ferlus and other domestic and international researchers in Vietnam, Vietnamese is native origin language, belongs to South Asian language, Mon-Khmer family, has relationship closely with Muong language. Besides, Vietnamese belongs to a isolating language type with three prominent features. Firstly, a syllable is foundation unit to form a word and a sentence. The syllable may be single word or be element to compose a complex word, a compound word and a reiteration word.
Secondly, the Vietnamese word is not inflectional. In particular, there are no difference between singular noun and plural noun; for example, “hai cuốn sách” (two books) and “một cuốn sách” (one book). Thirdly, grammatical meaning expresses mainly through word order and expletive method. Given some expletives such as “sẽ, đã, không” and sentence “Tôi ra ngoài”.
We can make three different meaning sentences from given input: “Tôi sẽ ra ngoài”; “tôi đã ra ngoài”;” tôi không ra ngoài”. The characteristics of Vietnamese Syllable is The grammatical foundation unit to Vietnamese word is meaning express mainly form word or not inflectional through word order and sentence expletive method Figure 1. The features of Vietnamese type In the world, some languages also belong to isolating language such as Chinese and Thai language. English, French, Russian are flexional language.
So, there are some different features, for instance comparing Vietnamese, English and Chinese sentence. 2 TIEU LUAN MOI download : skknchat@gmail. The expression of grammatical meaning in Vietnamese Vietnamese Chinese English Word order Tôi yêu anh ấy Wo ai ta I love him Anh ấy yêu tôi Ta ai wo He loves me Expletive Tôi không yêu anh ấy Wo bu ai ta I do not love him Unlike Vietnamese and Chinese, in above English sentence when word order changes, object pronoun turns into personal pronoun (himhe). Vietnamese part of speech 1.
Criteria to classify In European language, POS notion glues with morphological category such as gender, numeral, mood, so on. In Vietnam, there are two idea followed: Firstly, POS does not exist in Vietnamese because Vietnamese does not have morphological modification. (Le Quang Trinh, Nguyen Hien Le, Ho Huu Tung) Secondly, like European language, Vietnamese has also POS but to classify words in tags, or define POS of words, it is necessary to base on certain criteria. So far, Vietnamese branch has almost agreed using criteria following ( Diep Quang Ban, Hoang Van Thung, 2010): a.
General meaning: “The meaning of a POS is the general meaning of a words group, bases on vocabulary generalization foundation to form common grammatical category generalization (lexical-grammatical category)”. POSs are suitable for definition of classification category. These are groups having giant number of words that each group has a classification feature: object, quality, action or state, so on. Therefore, nhà, bàn, chim, học sinh, con, quyển, sự, so on, are classified into nouns because their vocabulary meaning is generalized and abstracted as objects.
The grammar category belongs to noun. Combination ability: With general meaning, words can get involve to one meaningful combination: some words can replace each other in a certain position of a combination, the rest of the combination make the setting for appearing replacement ability. Followed example: nhà, bàn, chim, cát, and so on, can appear and replace each other in combination type: nhà này, chim này, cát này, etc and are classified as nouns. 3 TIEU LUAN MOI download : skknchat@gmail.
Syntax function: Participating in sentence composition, words can stand in one or some certain positions in a sentence, or can replace each other in the positions, and express one relation about syntax function with other parts in the sentence composition, can be classified into one POS. For instance, some words such as nhà, bàn, chim, cát are noun. They may be subjects in sentences in which the subject function is a syntax function to classify them into noun. The ways to build up tagset Nowadays, there are two kinds of set of POS tags have developed in which the first kind received attention much more from linguistic researchers.
The first kind bases on 8 basic POS tags that are used many in dictionaries or linguistic materials. These are: noun, verb, adjective, pronoun, adverb, conjunction, interjection, emotive word. From the 8 basic tags, some finer set of POS tags are built up. Each researcher relies on certain criteria to build up the tagset finer (criteria are discussed in the section 1.
Notably, VnQtag tagset of Tran Thi Oanh contains 14 tags; VietTreeBank consists of 17 tags; VnQtag 59 tags (see appendix). The second kind is built up by mapping a tagset from other language to Vietnamese based on association between words of two languages (Dinh Dien and Hoang Kiem 2003) 1. Copora Annotated corpora are large bodies of text with linguistically-informative mark-up. They play an important role for current work in computational linguistics, so great attention has gone into developing such corpora.
Any countries, there are their own corpora as well. Some common corpora such as: British National corpus (Leech et at, 1994), the Penn Treebank (Marcus et at, 1993), or the German NEGRA Treebank (Skut et at, 1997), the Lancaster corpus of Mandarin Chinese (Tony McEnery and Richard Xiao, 2005). In Vietnam, there are notable corpora: VnQtag, VnPos, VTB. To build a corpus, some obligatory criteria need be ensured (McEnery and Wilson, 2001, p.
Sampling and representativeness: elements in a corpus must be general, diversified and plentiful. A sample is representative if what we find for the sample also holds for the general population. 4 TIEU LUAN MOI download : skknchat@gmail.com Finite size: bigger the size of a corpus is, higher it is appreciated but it is still finite size. Machine-readable form Standard reference We must admit that it takes much time to build a large corpus by manual due to need huge linguistic knowledge.
With manually built large corpus, the quality of corpus is not surely good corpus. Therefore, our thesis will find out and improve it. Two corpora we used in our experiments are VietTreeBank and VnQtag. After that, we would like to deeper discuss about building way of the corpora.
VietTreeBank VietTreeBank is the result of a national project VLSP that is developed by VTB group (Nguyen Phuong Thai, Vu Luong, Nguyen Thi Minh Huyen and annotators). The corpus includes 142 documents belonging to a politics-society topic of the Youth news responding to 10.000 Vietnamese sentence annotated syntax (word segmentation, POS tagging, syntax structure). The group based on MEMs and CRFs machine learning model to assign POS tags. The preciseness of the model is over 93%.
VTB is developed with the purpose to aid programs building: word segmentation, POS tagging, syntax parsing, and so on. VTB group chose two criteria to classify POS: combination ability and syntactic function words. For instance, noun has role as subject or object in a sentence. Besides, noun can combine with numeral (three, four) and attribute (each, every).
One POS tag can contain information about basic class of words (noun, verb, adjective, so on), morphological information (countable or uncountable), subcategory (verb goes with noun, verb goes with a clause, etc), semantic information or other syntax information. VTB group built up the tagset just based on basic class of words without other information such as morphological information, subcategory, etc (see tagset in appendix). In addition to POS information, the group describes basic syntax elements as phrase and clause. Syntax tags are the most foundation information in syntax tree, they forms spine of the tree.
A7 and A8 in appendix list phrase and clause tagset, respectively. 5 TIEU LUAN MOI download : skknchat@gmail.com Function tag of a syntax element expresses its role in syntax element in higher level. The tags are assigned to the main elements in the sentence such as subject, predicative, object. They provide information help us identify basic grammar relationship as followed.
Subject – Predicative Predicative Combination Complement …… Tagging process of each sentence in corpus consists of three steps: word segmentation, POS tagging, and syntactic parsing. VnQtag Building VnQtag tagset belongs to KC01 national project and is performed by development group including Nguyen Thi Minh Huyen, Vu Xuan Luong, Le Hong Phuong. The group based on a print dictionary (Vietnamese dictionary of Linguistic Institution in 2000) to carry out their work. First of all, they segmented sentences into words by a syllable otomat and a lexical otomat.
Then, they used Qtag tagger to assign POS label to Vietnamese words. The number of POS labels is 59 labels (see in appendix). In addition of grammar information, the group got adding semantic information (general meaning of word) to classify into 59 word class labels. For example, words are considered verb that they express general meaning about process.
Process meaning expresses directly in action feature of object. This is action meaning.