Lexicalized Statistical Parsing for Vietnamese Pham Thi Minh Thu Faculty of Information Technology Hanoi University of Engineering and Technology Vietnam National University, Hanoi Supervised by Doctor Le Anh Cuong A thesis submitted in fulfillment of the requirements for the degree of Master of Computer Science June, 2010 TIEU LUAN MOI download : skknchat@gmail.com Table of Contents Acknowledgements ii 1 Introduction 1 1.1 What is syntactic parsing? .2 Current Studies in Parsing .3 Vietnamese syntactic parsing .4 Objective of the Thesis .1 Context Free Grammar (CFG) .1 Top-down parsing .2 Bottom-up parsing .3 Comparison between top-down parsing and bottom-up parsing .3 Probabilistic context-free grammar (PCFGs) .1 The concept of PCFG .2 Disadvantages of PCFGs .4 Lexical Probabilistic Context Free Grammar (LPCFGs) .2 The concept of Lexical Probabilistic Context Free Grammar (LPCFGs) 16 2.3 Three models of Collins. 18 3 Vietnamese parsing and our approach 21 3. 22 iii TIEU LUAN MOI download : skknchat@gmail.com TABLE OF CONTENTS iv 3.2 The POS tagset and Syntax tagset for Vietnamese .4 Our approach in building a Vietnamese parser .1 Adapting Bikel's parser for Vietnamese .2 Analyze error and propse using heuristic rules. 30 4 Experiments and Discussion 33 4.2 Bikel's parsing tool .3 Adaptating Bikel's tool to Vietnamese .1 Investigate different configurations .4 Evaluation of the parser .4 Experimental results on using heuristic rules.
42 5 Conclusions and Future Work 46 5. 47 TIEU LUAN MOI download : skknchat@gmail.com List of Figures 1.1 The parse tree of sentence "I go to school" .2 A parse tree in Vietnamese .1 The parse tree of the Vietnamese sentence "mÌo b¾t chuét" .2 Two derivations of the sentence "T«i hiÓu Lan h¬n Nga" .3 A parse tree of Vietnamese in LPCFG .4 A tree with the "C" suffix used to identify .1 Set of tag in Penn Treebank .2 A sample of labeled data in Penn Treebank before manually treatment .3 A sample of labeled data in Penn Treebank after manually treatment .4 Tagset of Penn Treebank .5 A sample of complete data in English and Vietnamese .1 The Bikel's system overview .2 Result of testing standard Collins' model 2 with training data's size change from 60% to 100% of the full data. Where series 1 and series 2 stand for testing on sentences with length less equal 40 and 100 respectively. 43 v TIEU LUAN MOI download : skknchat@gmail.com List of Tables 2.1 Analysis table with CYK algorithm .1 POS tagset in Viet Treebank .2 Phrase tagset in Viet Treebank .3 Clause tagset in Viet Treebank .4 Syntax function tagset in Viet Treebank .1 The initial results on Viet Treebank with different configurations.
Key:CB = average crossing brackets, 0CB = zero crossing brackets, ≤ 2CB =≤ 2 crossing brackets. All results are percentages, except for those in the CB column .2 Number of sentence for training .3 The results with the change of the training data set .4 The error rate. We use 520 sentences for development testing. Then filtering sentences which have the F-score less than 70 %.
As the result, we collect 147 sentences into the set of error sentences. The Percentage of a error is calculated by the number of sentences commit this error divide 147. Because a sentence may be some errors so the total percentage may exceed 100 .5 The obtained results after applying some proposal rules to correct some wrong syntactic parsing. 44 vi TIEU LUAN MOI download : skknchat@gmail.com Chapter 1 Introduction For a long time, human being have always dreamed of an intelligent machine which can listen to, understand and implement humans' requirements.
Many scientists have tried to make that dream and devoted many achievements for the science of artificial intelligence. In artificial intelligence, natural language processing (NLP) is a field which studies on how to understand and generate automatically human language. NLP has many practical applications such as machine translation, information extraction, discourse analysis, text summarization. These applications have the same basic problems such as lexical analysis, syntactic parsing and semantic analysis.
In which, syntactic parsing is the central role and it is also the goal of this thesis.1 What is syntactic parsing? Syntactic parsing (parsing or syntactic analysis ) is the process of analyzing a given se- quence of tokens (i. a sentence) to identify their grammatical structure with respect to a given grammar. The grammatical structure is often represented in the form which displays visually the dependence of components as a tree (is called parse tree or syntactic tree). In other words, parsing is the problem to get a given sequence of words as input and output is the parse trees corresponding to that sequence.1 shows examples for parse tree: a) a English parse tree in usual form and b) a Vietnamese tree in other form.
Parsing is the major module of a grammar checking system. In order to check gram- mar, we need to parse input sentences, then examine the correctness of the structures in the output. Furthermore, a sentence which cannot be parsed may have grammatical errors. 1 TIEU LUAN MOI download : skknchat@gmail.
What is syntactic parsing? 2 Figure 1.1: The parse tree of sentence "I go to school" Figure 1.2: A parse tree in Vietnamese Parsing is also the important intermediate stage of representation for semantic analysis, and thus plays an important role in applications like machine translation, question an- swering, and information extraction. For example, in transfer-based machine translation the system will analyze the source sentence to output a parse tree and then construct the equivalent parse tree in the target language. The output sentence will be generated mainly based on this equivalent parse tree. It is to understand that in a question answering system we need parsing to find out which is the subject, object, or action.
It is also interesting that parsing can help speech processing. It supports to correct the fault of the speech recognition process. On the other hand, in speech synthesis parsing help put stress on the correct position in the sentence. TIEU LUAN MOI download : skknchat@gmail.
Current Studies in Parsing 3 Through these above example we can see that construct an accurate and effective parser will bring great benefits to many applications of natural language processing.2 Current Studies in Parsing As one of the basic and central problem of NLP, parsing attracts many studies. They belong to one of the two approaches: rule-based and statistics-based. In conventional parsing systems, a grammar is hand-crafted, often involves a large amount of lexically specific information in the form of sub-categorization information. In there, ambiguity, a major problem in parsing, is solved through selectional restrictions.
For example, a lexicon might specify that "eat" must take an object with the feature + "food". In (Collins, 1999), the author has showed several problems with selectional restrictions such as increasing the volume of information required when the vocabulary size becomes so large. In the other word, the biggest challenge is the large amount of vocabulary to require both selectional restrictions and structural preference should be encoded as the soft preferences instead of hard constraints. To overcome these obstacles, the researchers began to explore machine-learning ap- proaches to parsing problem, primary through statistical models.
In these approaches, a set of example pairs of sentence and the corresponding syntactic tree is annotated by hand and used to train parsing models. A set of trees is called a " treebank". Several parts of the treebank are reserved as test data for evaluating the model's accuracy. Early works investigate the use of probabilistic context free grammar (PCFG).
Using PCFG is considered as the next generation of parsing and is also as a beginning step in statistical parsing. In a PCFG, each grammar rule is associated with a probability. The probability of a parse tree is the product of the probabilities of all rules used in that tree. In the case, parsing is essentially the process of searching the tree that has the maximum probability.
However, a simple PCFG often fail due to its lack of sensitivity to lexical information and structural preferences. Then some solutions were proposed to resolve this problem. Several directions were listed in (Collins, 1999) such as: towards probabilistic version of lexicalized grammars; using supervised training algorithms; to construct models that had increased structural sensitivity; to look into history-based models. Among them lexical- ized probabilistic context free grammar (LPCFG) is a promising approach.
It can solve many ambiguity phenomena in parsing. Some works that based on this approach achieved high performance, such as in (Collins, 1997). After this research, Daniel M. Bikel and TIEU LUAN MOI download : skknchat@gmail.
Vietnamese syntactic parsing 4 his coworker have developed Collins models and designed a parser for multiple lan- guages. It has been applied successfully for some languages as English, Chinese and Arabic (Bikel, 2004). The concrete results of the parser for these languages is reported in (Bikel, 2004): For English, F-measure is 90.01 %; for Chinese, F-measure is 81.2 % and in Arabic F-measure is 75. According to these results and the comparison between current parsers (e.
Charniak parser, Bekelley parser, Standford parser), Bikels parser is still rated one of the best parser at present. Recently, this approach of using LPCFG continues being applied for many languages. Moreover, a number of new strategies have proposed to improve the accuracy of parsers. In several researches, using semi-supervised training methods becomes a promising ap- proach.
Their experimental results show that this approach outperforms the supervised one, without much additional computational cost. Some other studies has integrated se- mantic information into parsing in order to fully exploit the benefits of lexical resources and upgrade the parser, such as in (Xiong et al. (Xiong et al., 2005) described the way of incorporating semantic knowledge as follow: Firstly, they used two Chinese electronic semantics dictionaries and heuristic rules in order to extract semantic categories. Then they built a selection preference sub-model based on extracted semantic categories.
Similarly, in (Agirre & Baldwin, 2008), the sense infor- mation was added to parsing by substituting the original words with their semantic tags which correspond with their semantic classes, for example knife and scissors belong to TOOL class, cake and pork are assigned to FOOD class. In addition, some other sug- gested tactics have been enhanced of the performance of parsing as a powerful learning technique (sample selection) for reducing the amount of human-labeled training data (Carreras et al., 2008); or, a strategy for utilizing POS tags resources to annotate parser input in (Watson et al. Through the review of approaches in parsing and especially some recent studies, we found that LPCFG appears in all of state-of-the-art parsing systems. Therefore in our opinion, LPCFG is a good choice for Vietnamese parsing.3 Vietnamese syntactic parsing In Vietnam, works in natural language processing (i.
computational linguistics) in general and in parsing in particular have been only motivated very recently. A few of the parsers which follow the knowledge-based approach are constructed with the manual TIEU LUAN MOI download : skknchat@gmail. Objective of the Thesis 5 grammar rules. Since the construction of grammar rules is manual, the accuracy of the parser is not high.
It only analyzes a limited number of sentences generated by the grammar. The approach using statistics has been also studied, but also only at brief and has no experimental results. For example, (Quoc-The & Thanh-Huong, 2008) presented about LPCFG but surprisingly it did not provide any experiment, only some examples were provided to illustrate the syntactic ambiguity of Vietnamese. With such restricted results, no Vietnamese parser has been published widely.
It can say that while many countries in the world has gone forward a long way in parsing, Vietnam has just been at the stage to start. The precondition for deployment these models for Vietnamese is a corpus containing parsed sentences which is the crucial resources for statistical parsing. Since lack of corpus, the previous works on Vietnamese parsing have not had the significant experimental results.