Nghiên Cứu Về Cải Tiến Dịch Máy Cho Ngôn Ngữ Thiếu Tài Nguyên

Nghiên cứu về dịch máy cho ngôn ngữ ít tài nguyên, khám phá các phương pháp và thách thức trong việc cải thiện chất lượng dịch thuật.

Trường đại học

Japan Advanced Institute of Science and Technology

Chuyên ngành

Information Science

Người đăng

Ẩn danh

Thể loại

thesis

2017

115

Phí lưu trữ

35 Point

Mục lục chi tiết

Acknowledgements

Abstract

1. Introduction

1.2. MT for Low-Resource Languages

1.2.1. Statistical Machine Translation

1.2.2. Phrase-based SMT

1.2.3. Length-Based Methods

1.2.4. Word-Based Methods

1.2.5. Triangulation: The Representative Approach in Pivot Methods

1.2.6. Neural Machine Translation

3. Building Bilingual Corpora

3.1. Dealing with Out-Of-Vocabulary Problem

3.1.1. Word Similarity Models

3.2. Improving Sentence Alignment Using Word Similarity

3.3. Building A Multilingual Parallel Corpus

3.4. Experiments on Machine Translation

4. Pivoting Bilingual Corpora

4.1. Semantic Similarity for Pivot Translation

4.1.1. Semantic Similarity Models

4.1.2. Semantic Similarity for Triangulation

4.1.3. Experiments on Japanese-Vietnamese

4.1.4. Experiments on Southeast Asian Languages

4.2. Grammatical and Morphological Knowledge for Pivot Translation

4.2.1. Grammatical and Morphological Knowledge

4.2.2. Combining Features to Pivot Translation

4.3. Using Other Languages for Pivot

4.4. Rectangulation for Phrase Pivot Translation

5. Combining Additional Resources to Enhance SMT for Low-Resource Languages

5.1. Enhancing Low-Resource SMT by Combining Additional Resources

5.2. Experiments on Japanese-Vietnamese

5.3. Experiments on Southeast Asian Languages

5.4. Experiments on Turkish-English

5.5. Exploiting Informative Vocabulary

6. Neural Machine Translation for Low-Resource Languages

6.1. Neural Machine Translation

6.2. Byte-pair Encoding

6.3. Phrase-based versus Neural-based Machine Translation on Low-Resource Languages

6.4. NMT on Low-Resource Settings

6.5. Improving SMT and NMT Using Comparable Data

6.6. A Discussion on Transfer Learning for Low-Resource Neural Machine Translation

7. Conclusion

Tóm tắt

I. Tổng Quan Dịch Máy Ngôn Ngữ Thiếu Tài Nguyên Giới Thiệu 55 ký tự

Dịch máy là nhu cầu quan trọng của nhân loại. Sự ra đời của máy tính kỹ thuật số đã mở ra giấc mơ xây dựng máy móc dịch ngôn ngữ tự động. Gần như ngay khi máy tính điện tử xuất hiện, người ta đã nỗ lực xây dựng các hệ thống tự động để dịch, mở ra một lĩnh vực mới: dịch máy. Dịch máy (MT) là "các hệ thống vi tính hóa chịu trách nhiệm sản xuất bản dịch từ một ngôn ngữ tự nhiên sang ngôn ngữ khác, có hoặc không có sự trợ giúp của con người". Lịch sử phát triển của dịch máy rất dài. Các phương pháp tiếp cận khác nhau đã được khám phá như: dịch trực tiếp (sử dụng các quy tắc để ánh xạ đầu vào sang đầu ra), các phương pháp chuyển giao (phân tích thông tin cú pháp và hình thái), và các phương pháp liên ngôn ngữ (sử dụng các biểu diễn ý nghĩa trừu tượng). Các phương pháp tiếp cận thống trị của dịch máy hiện tại là dịch máy thống kê (SMT) và dịch máy thần kinh (NMT), dựa trên tài nguyên văn bản đã dịch, một xu hướng của các phương pháp hướng dữ liệu. Thay vì đó, một tập hợp các văn bản đã dịch được sử dụng để tự động tìm hiểu các quy tắc tương ứng giữa các ngôn ngữ. Xu hướng này đã cho thấy kết quả hiện đại trong các nghiên cứu gần đây cũng như được áp dụng trong hệ thống MT đang được sử dụng rộng rãi hiện nay, Google. Các văn bản đã dịch, được gọi là bilingual corpora, do đó trở thành một trong những yếu tố chính ảnh hưởng đến chất lượng dịch thuật.

1.1. Giới thiệu về dịch máy thống kê Statistical MT

Dịch máy thống kê (SMT) dựa trên các mô hình thống kê để dịch văn bản. Các mô hình này được học từ bilingual corpora. Mô hình phổ biến bao gồm mô hình kênh ồn và mô hình log-linear. SMT yêu cầu lượng lớn dữ liệu song ngữ để đạt hiệu quả cao. Nghiên cứu của Trieu Long Hai (2017) nhấn mạnh tầm quan trọng của bilingual corpora trong SMT.

1.2. Giới thiệu về dịch máy thần kinh Neural MT

Dịch máy thần kinh (NMT) sử dụng mạng nơ-ron sâu để mô hình hóa quá trình dịch. NMT đã đạt được những tiến bộ đáng kể trong những năm gần đây. Tuy nhiên, NMT cũng cần lượng lớn dữ liệu để huấn luyện hiệu quả. Việc thiếu dữ liệu là một thách thức lớn đối với NMT cho ngôn ngữ thiếu tài nguyên. Nghiên cứu cũng khám phá việc sử dụng NMT cho các ngôn ngữ này.

II. Vấn Đề Dữ Liệu Thách Thức Dịch Máy Ngôn Ngữ Nghèo 59 ký tự

Có nhiều nỗ lực trong việc xây dựng bilingual corpora lớn như Europarl (bilingual corpus của 21 ngôn ngữ châu Âu), tiếng Anh-tiếng Ả Rập, tiếng Anh-tiếng Trung. Xây dựng bilingual corpora lớn như vậy đòi hỏi nhiều nỗ lực. Do đó, bên cạnh bilingual corpora của các ngôn ngữ châu Âu và một số cặp ngôn ngữ khác, có rất ít bilingual corpora lớn cho hầu hết các cặp ngôn ngữ trên thế giới. Vấn đề này dẫn đến một nút thắt cổ chai cho dịch máy ở nhiều cặp ngôn ngữ thiếu bilingual corpora lớn, được gọi là ngôn ngữ thiếu tài nguyên. Công trình này định nghĩa ngôn ngữ thiếu tài nguyên là các cặp ngôn ngữ không có hoặc có bilingual corpora nhỏ (dưới một triệu cặp câu). Cải thiện MT trên ngôn ngữ thiếu tài nguyên trở thành một nhiệm vụ thiết yếu đòi hỏi nhiều nỗ lực cũng như thu hút nhiều sự quan tâm hiện nay.

2.1. Khó khăn trong thu thập dữ liệu song ngữ cho ngôn ngữ hiếm

Việc thu thập dữ liệu song ngữ chất lượng cao cho ngôn ngữ hiếm là vô cùng khó khăn. Quá trình này tốn kém về thời gian, công sức và nguồn lực tài chính. Do đó, việc tìm kiếm các phương pháp tự động hóa hoặc bán tự động để tạo ra dữ liệu song ngữ trở nên cấp thiết. Một số phương pháp đã được đề xuất, bao gồm back-translation và sử dụng dữ liệu tổng hợp.

2.2. Ảnh hưởng của dữ liệu ít ỏi đến chất lượng dịch máy

Khi dữ liệu song ngữ ít ỏi, các mô hình dịch máy, dù là dịch máy thống kê hay dịch máy thần kinh, đều gặp khó khăn trong việc học các quy tắc dịch chính xác. Điều này dẫn đến chất lượng dịch kém, đặc biệt là đối với các câu phức tạp hoặc chứa các từ vựng hiếm gặp. Đánh giá chất lượng dịch máy trong điều kiện thiếu dữ liệu là một thách thức lớn.

III. Hướng Dẫn Tăng Cường Dữ Liệu Dịch Máy Giải Pháp Hiệu Quả 59 ký tự

Các giải pháp đã được đề xuất để giải quyết vấn đề bilingual corpora không đủ. Có hai chiến lược chính: xây dựng bilingual corpora mới và sử dụng bilingual corpora đã có. Đối với chiến lược đầu tiên, bilingual corpora có thể được xây dựng thủ công hoặc tự động. Xây dựng bilingual corpora lớn bằng con người có thể đảm bảo chất lượng của corpora; tuy nhiên, nó đòi hỏi chi phí nhân công và thời gian cao. Do đó, xây dựng bilingual corpora tự động có thể là một giải pháp khả thi. Nhiệm vụ này liên quan đến một lĩnh vực con: sentence alignment, trong đó các câu là bản dịch của nhau có thể được trích xuất tự động. Hiệu quả của các thuật toán sentence alignment ảnh hưởng đến chất lượng của bilingual corpora.

3.1. Cải tiến sentence alignment để tăng dữ liệu song ngữ

Sentence alignment là quá trình xác định các cặp câu tương ứng trong hai văn bản song ngữ. Việc cải thiện độ chính xác của sentence alignment có thể giúp trích xuất dữ liệu song ngữ chất lượng cao hơn. Các phương pháp cải tiến bao gồm sử dụng thông tin từ điển, thông tin ngữ pháp và word embeddings.

3.2. Xây dựng multilingual parallel corpus cho ngôn ngữ ít ỏi

Việc xây dựng multilingual parallel corpus có thể giúp tăng cường dữ liệu cho dịch máy cho ngôn ngữ ít ỏi. Các phương pháp xây dựng bao gồm sử dụng dữ liệu từ Wikipedia, dữ liệu từ các tổ chức quốc tế và dữ liệu từ các trang web song ngữ. Sử dụng dữ liệu tổng hợp cũng là một giải pháp tiềm năng.

3.3. Sử dụng dữ liệu đơn ngữ để cải thiện sentence alignment

Sử dụng dữ liệu đơn ngữ là một phương pháp tiềm năng để cải thiện sentence alignment, đặc biệt khi dữ liệu song ngữ hạn chế. Bằng cách học word embeddings từ dữ liệu đơn ngữ, ta có thể ước tính độ tương đồng giữa các từ và câu, từ đó cải thiện độ chính xác của sentence alignment.

IV. Phương Pháp Pivot Khai Thác Triệt Để Corpus Song Ngữ 58 ký tự

Các bilingual corpora hiện có có thể được sử dụng để trích xuất các quy tắc dịch cho một cặp ngôn ngữ gọi là pivot methods. Cụ thể, ngôn ngữ (các) trục được sử dụng để kết nối dịch từ ngôn ngữ nguồn sang ngôn ngữ đích nếu có bilingual corpora của các cặp ngôn ngữ nguồn-trục và trục-đích. Công trình của Trieu Long Hai (2017) đã đề xuất hai phương pháp để cải thiện pivot methods.

4.1. Sử dụng semantic similarity cho pivot translation hiệu quả

Sử dụng semantic similarity giúp cải thiện pivot translation bằng cách chọn các cụm từ tương đương về mặt ngữ nghĩa, ngay cả khi chúng không tương đương về mặt từ vựng. Các mô hình semantic similarity có thể được huấn luyện trên dữ liệu đơn ngữ hoặc dữ liệu song ngữ.

4.2. Kết hợp kiến thức ngữ pháp và hình thái vào pivot translation

Kết hợp kiến thức ngữ pháp và hình thái có thể giúp cải thiện độ chính xác của pivot translation. Ví dụ, thông tin về từ loại (POS) và dạng gốc (lemma) có thể giúp chọn các cụm từ tương đương hơn về mặt cú pháp và ngữ nghĩa.

4.3. Phương pháp triangulation cải tiến để dịch ngôn ngữ hiếm

Cải tiến phương pháp triangulation bằng cách sử dụng semantic similarity để giải quyết vấn đề thiếu thông tin. Tích hợp kiến thức ngữ pháp và hình thái để cải thiện phương pháp triangulation thông thường.

V. Mô Hình Kết Hợp Nâng Cao Dịch Máy Ngôn Ngữ Thiếu Hụt 60 ký tự

Đề xuất một mô hình lai giúp cải thiện đáng kể MT trên ngôn ngữ thiếu tài nguyên bằng cách kết hợp hai chiến lược xây dựng bilingual corpora và khai thác bilingual corpora hiện có. Các thí nghiệm được thực hiện trên ba cặp ngôn ngữ khác nhau: tiếng Nhật-tiếng Việt, các ngôn ngữ Đông Nam Á và tiếng Thổ Nhĩ Kỳ-tiếng Anh để đánh giá phương pháp được đề xuất.

5.1. Kết hợp dữ liệu bổ sung để cải thiện SMT cho ngôn ngữ nghèo

Kết hợp các nguồn dữ liệu bổ sung như dữ liệu đơn ngữ, dữ liệu comparable và dữ liệu pivot có thể giúp cải thiện hiệu suất của dịch máy thống kê (SMT) cho ngôn ngữ nghèo tài nguyên. Các phương pháp kết hợp bao gồm sử dụng transfer learning và domain adaptation.

5.2. Ứng dụng mô hình kết hợp cho cặp ngôn ngữ Nhật Việt

Nghiên cứu của Trieu Long Hai (2017) đã áp dụng mô hình kết hợp cho cặp ngôn ngữ Nhật-Việt. Kết quả cho thấy mô hình kết hợp đã cải thiện đáng kể chất lượng dịch so với các mô hình SMT truyền thống. Mô hình này tận dụng cả dữ liệu song ngữ và dữ liệu đơn ngữ.

VI. Dịch Máy Thần Kinh Cho Ngôn Ngữ Thiếu Nghiên Cứu Tiềm Năng 60 ký tự

Một số điều tra thực nghiệm đã được thực hiện trên các cặp ngôn ngữ thiếu tài nguyên bằng cách sử dụng NMT để cung cấp một số cơ sở thực nghiệm hữu ích cho việc cải thiện hơn nữa phương pháp này trong tương lai cho ngôn ngữ thiếu tài nguyên.

6.1. So sánh NMT và SMT trên ngôn ngữ ít dữ liệu Ưu nhược điểm

So sánh NMT và SMT trong việc sử dụng Wikipedia corpus. So sánh giữa phrase-based và neural-based machine translation trên ngôn ngữ thiếu tài nguyên.

6.2. Thảo luận về Transfer Learning cho NMT ngôn ngữ hiếm

Thảo luận về việc sử dụng transfer learning cho neural machine translation (NMT) trên ngôn ngữ hiếm. Nghiên cứu về domain adaptation cho Low-Resource MT. Cross-lingual transfer learning và zero-shot translation.

24/05/2025

Bạn đang xem trước tài liệu:

A study on machine translation for low resource languages

Tải đầy đủ

Trích đoạn nội dung tài liệu

A STUDY ON MACHINE TRANSLATION FOR LOW-RESOURCE LANGUAGES By TRIEU, LONG HAI submitted to Japan Advanced Institute of Science and Technology, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Written under the direction of Associate Professor Nguyen Minh Le September, 2017 A STUDY ON MACHINE TRANSLATION FOR LOW-RESOURCE LANGUAGES By TRIEU, LONG HAI (1420211) A thesis submitted to School of Information Science, Japan Advanced Institute of Science and Technology, in partial fulfillment of the requirements for the degree of Doctor of Information Science Graduate Program in Information Science Written under the direction of Associate Professor Nguyen Minh Le and approved by Associate Professor Nguyen Minh Le Professor Satoshi Tojo Professor Hiroyuki Iida Associate Professor Kiyoaki Shirai Associate Professor Ittoo Ashwin July, 2017 (Submitted) Copyright c 2017 by TRIEU, LONG HAI Acknowledgements Abstract Current state-of-the-art machine translation methods are neural machine translation and statistical machine translation, which based on translated texts (bilingual corpora) to learn translation rules automatically. Nevertheless, large bilingual corpora are unavailable for most languages in the world, called low-resource languages, that cause a bottleneck for machine translation (MT). Therefore, improving MT on low-resource languages becomes one of the essential tasks in MT currently. In this dissertation, I present my proposed methods to improve MT on low-resource languages by two strategies: building bilingual corpora to enlarge training data for MT systems and exploiting existing bilingual corpora by using pivot methods.

For the first strategy, I proposed a method to improve sentence alignment based on word similarity learnt from monolingual data to build bilingual corpora. Then, a multilingual parallel corpus was built using the proposed method to improve MT on several Southeast Asian low-resource languages. Experimental results showed the effectiveness of the proposed alignment method to improve sentence alignment and the contribution of the extracted corpus to improve MT performance. For the second strategy, I proposed two methods based on semantic similarity and using grammatical and morphological knowledge to im- prove conventional pivot methods, which generate source-target phrase translation using pivot language(s) as the bridge from source-pivot and pivot-target bilingual corpora.

I con- ducted experiments on low-resource language pairs such as the translation from Japanese, Malay, Indonesian, and Filipino to Vietnamese and achieved promising results and im- provement. Additionally, a hybrid model was introduced that combines the two strategies to further exploit additional data to improve MT performance. Experiments were con- ducted on several language pairs: Japanese-Vietnamese, Indonesian-Vietnamese, Malay- Vietnamese, and Turkish-English, and achieved a significant improvement. In addition, I utilized and investigated neural machine translation (NMT), the state-of-the-art method in machine translation that has been proposed currently, for low-resource languages.

I compared NMT with phrase-based methods on low-resource settings, and investigated how the low-resource data affects the two methods. The results are useful for further de- velopment of NMT on low-resource languages. I conclude with how my work contributes to current MT research especially for low-resource languages and enhances the development of MT on such languages in the future. Keywords: machine translation, phrase-based machine translation, neural-based ma- chine translation, low-resource languages, bilingual corpora, pivot translation, sentence alignment 2 Acknowledgements For three years working on this topic, it is my first long journey that attract me to the academic area.

It is also one of the biggest challenges that I have ever dealt with. This work gives me a lot of interesting knowledge and experiences as well as difficulties that require me with the best efforts. At the moment of writing this dissertation as a summary for the PhD journey, it reminds me a lot of support from many people. This work cannot be completed without their support.

First of all, I would like to thank my supervisor, Associate Professor Nguyen Minh Le. Professor Nguyen gives me a lot of comments, advices, discussions in my whole three-year journey from the starting point when I approached this topic without any prior knowledge about machine translation until my last tasks to complete my dissertation and research. Doing PhD is one of the most interesting things in studying, but it is also one of the most challenge things for everyone in the academic career. Thanks to the useful and interesting discussions with professor Nguyen, I have overcome the most difficult periods in doing this research.

Not only teach me some first lessons and skills in doing research, professor Nguyen also has interesting and useful discussions that help me a lot in both studying and the life. I would like to thank the committee: Professor Satoshi Tojo, Professor Hiroyuki Iida, Associate Professor Ittoo Ashwin, Associate Professor Kiyoaki Shirai for their comments. This can be one of the first work in my academic career, that cannot avoid a lot of mistakes and weaknesses. By discussing with the professors in the committee, and receiving their valuable comments, they help me a lot in improving this dissertation.

I also would like to thank my collaborators: Associate Professor Nguyen Phuong Thai for his comments, advices, and experience in sentence alignment and machine translation. I would like to thank Vu Tran, Tin Pham, Viet-Anh Phan for their interesting discussions and collaborations in doing some topics in this research. Thanks so much to Vu Tran, Chien Tran for their technical support. I would like to thank my colleagues and friends, Truong Nguyen, Huy Nguyen, for their support and encourage.

I also would like to give a special thank to professor Jean- Christophe Terrillon Georges for his advices and comments on the writing skills and En- glish manuscripts of my papers, special thank to professor Ho Tu Bao for valuable advices in research. Thanks so much to Danilo S. Carvalho, Tien Nguyen for their comments. Last but not least, I would like to thank my parents, Thi Trieu, Phuong Hoang, my sister Ly Trieu, and my wife Xuan Dam for their support and encouragement in all time not only in this work but in my life.

3 4 Table of Contents Abstract 1 Acknowledgements 1 Table of Contents 3 List of Figures 4 List of Tables 6 1 Introduction 7 1.2 MT for Low-Resource Languages .1 Statistical Machine Translation .1 Phrase-based SMT .1 Length-Based Methods .2 Word-Based Methods .3 Triangulation: The Representative Approach in Pivot Methods .4 Neural Machine Translation. 19 3 Building Bilingual Corpora 21 3.1 Dealing with Out-Of-Vocabulary Problem .1 Word Similarity Models. 22 1 TABLE OF CONTENTS 3.2 Improving Sentence Alignment Using Word Similarity .2 Building A Multilingual Parallel Corpus .5 Experiments on Machine Translation. 40 4 Pivoting Bilingual Corpora 41 4.1 Semantic Similarity for Pivot Translation .1 Semantic Similarity Models .2 Semantic Similarity for Triangulation .3 Experiments on Japanese-Vietnamese .4 Experiments on Southeast Asian Languages .2 Grammatical and Morphological Knowledge for Pivot Translation .1 Grammatical and Morphological Knowledge .2 Combining Features to Pivot Translation .1 Using Other Languages for Pivot .2 Rectangulation for Phrase Pivot Translation.

70 5 Combining Additional Resources to Enhance SMT for Low-Resource Languages 72 5.1 Enhancing Low-Resource SMT by Combining Additional Resources .2 Experiments on Japanese-Vietnamese .3 Experiments on Southeast Asian Languages .4 Experiments on Turkish-English .1 Exploiting Informative Vocabulary. 82 2 TABLE OF CONTENTS 5. 86 6 Neural Machine Translation for Low-Resource Languages 88 6.1 Neural Machine Translation .2 Byte-pair Encoding .2 Phrase-based versus Neural-based Machine Translation on Low-Resource Languages. NMT on Low-Resource Settings .3 Improving SMT and NMT Using Comparable Data .3 A Discussion on Transfer Learning for Low- Resource Neural Machine Translation.

95 7 Conclusion 96 3 List of Figures 2.1 Pivot alignment induction .2 Recurrent architecture in neural machine translation .1 Word similarity for sentence alignment .2 Experimental results on the development and test sets .3 SMT vs NMT in using the Wikipedia corpus .1 Semantic similarity for pivot translation .2 Pivoting using syntactic information .3 Pivoting using morphological information .1 A combined model for SMT on low-resource languages. 73 4 List of Tables 3.1 English-Vietnamese sentence alignment test data set .2 IWSLT15 corpus for training word alignment .3 English-Vietnamese alignment results .4 Sample English word similarity .5 Sample Vietnamese word similarity .6 OOV ratio in sentence alignment .7 Sample English-Vietnamese alignment .8 English word similarity .9 Sample IBM Model 1 .10 Induced word alignment .11 Wikipedia database dumps’ resources used to extract parallel titles .12 Extracted and processed data from parallel titles .13 Sentence alignment output .14 Extracted Southeast Asian multilingual parallel corpus .15 Monolingual data sets .16 Experimental results on the development and test sets .17 Data sets on the IWSLT 2015 experiments .18 Experimental results using phrase-based statistical machine translation .19 Experimental results on neural machine translation .20 Comparison with other systems participated in the IWSLT 2015 shared task 40 4.1 Bilingual corpora for Japanese-Vietnamese pivot translation .2 Japanese-Vietnamese development and test sets .3 Monolingual data sets of Japanese, English, Vietnamese .4 Japanese-Vietnamese pivot translation results .5 Bilingual corpora of Southeast Asian language pairs .6 Bilingual corpora for pivot translation of Southeast Asian language pairs .7 Monolingual data sets of Indonesian, Malay, and Filipino .8 Pivot translation results of Southeast Asian language pairs .9 Examples of grammatical information for pivot translation .10 Southeast Asian bilingual corpora for training factored models .11 Results of using POS and lemma forms .12 Indonesian-Vietnamese results .13 Filipino-Vietnamese results. 55 5 LIST OF TABLES 4.14 Input factored phrase tables .15 Extracted phrase pairs by triangulation .16 Out-Of-Vocabulary ratio .17 Results of statistical significance tests .18 Experimental results on different metrics: BLEU, TER, METEOR .19 Ranks on different metrics .20 Spearman rank correlation between metrics .21 Wilcoxon on Malay-Vietnamese .22 Wilcoxon on Indonesian-Vietnamese .23 Wilcoxon on Filipino-Vietnamese .24 Wilcoxon on Malay-Vietnamese .25 Wilcoxon on Indonesian-Vietnamese .26 Wilcoxon on Filipino-Vietnamese .27 Sample translations: POS and lemma factors for pivot translation .28 Sample translation: Indonesian-Vietnamese .29 Sample translation: Filipino-Vietnamese .30 Using other languages for pivot .31 Using rectangulation for phrase pivot translation .1 Japanese-Vietnamese results on the direct model .2 Japanese-Vietnamese results on the combined models .3 Results of Japanese-Vietnamese on the big test set .4 Results of statistical significance tests on Japanese-Vietnamese .5 Southeast Asian results on the direct models .6 Southeast Asian results on the combined model .7 Bilingual corpora for Turkish-English pivot translation .8 Experimental results on the Turkish-English .9 Experimental results on the English-Turkish translation .10 Building a bilingual corpus of Turkish-English from Wikipedia .11 Dealing with out of vocabulary problem using the combined model .12 Sample translations: using the combined model (Japanese-Vietnamese) .14 Sample translations: using the combined model (Filipino-Vietnamese) .1 Bilingual data set of Japanese-English .2 Experimental results in Japanese-English translation .3 Bilingual data sets of Indonesian-Vietnamese .4 Experimental results on Indonesian-Vietnamese translation .5 Experimental results English-Vietnamese .6 English-Vietnamese results using the Wikipedia corpus .1 Machine Translation Translation between languages is an important demand of humanity. With the advent of digital computers, it provided a basis for the dream of building machines to translate languages automatically.

Almost as soon as electronic computers appeared, people made efforts to build automatic systems for translation, which also opened a new field: machine translation. As defined in Hutchins and Somers, 1992 [33], machine translation (MT) is "computerized systems responsible for the production of translation from one natural language to another, with or without human assistance". Machine translation has a long history in its development. Various approaches were explored such as: direct translation (using rules to map input to output), transfer methods (analyzing syntactic and morphological information), and interlingual methods (using representations of abstract meaning).

The field attracted a lot of interest from community like: a study of realities of machine translation from US funding agencies in 1966 (ALPAC report), commercial systems from the past (Systran in 1968, Météo in 1976, Logos and METAL in 1980s) to current development by large companies (IBM, Microsoft, Google), and many projects in universities and academic institutes. Dominated approaches of current machine translation are statistical machine translation (SMT) and neural machine translation (NMT), which are based on resources of translated texts, a trend of data-driven methods.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ

Tài liệu "Cải Tiến Dịch Máy Cho Ngôn Ngữ Thiếu Tài Nguyên" tập trung vào việc nâng cao chất lượng dịch máy cho các ngôn ngữ có ít tài nguyên, đặc biệt là tiếng Việt. Tài liệu này trình bày các phương pháp và công nghệ mới nhằm cải thiện độ chính xác và hiệu quả của hệ thống dịch máy, từ đó giúp người dùng có trải nghiệm tốt hơn khi sử dụng các công cụ dịch thuật.

Độc giả sẽ tìm thấy nhiều lợi ích từ tài liệu này, bao gồm việc hiểu rõ hơn về các thách thức trong việc dịch máy cho ngôn ngữ thiếu tài nguyên và các giải pháp tiềm năng để khắc phục những vấn đề này. Để mở rộng kiến thức, bạn có thể tham khảo thêm các tài liệu liên quan như Luận văn thạc sĩ khoa học máy tính xây dựng hệ thống học sâu tự động thêm dấu cho tiếng việt, nơi bạn sẽ tìm thấy thông tin về việc tự động hóa trong ngôn ngữ tiếng Việt. Ngoài ra, tài liệu Luận văn thạc sĩ enhancing the quality of machine translation system using cross lingual word embedding models cũng sẽ cung cấp cái nhìn sâu sắc về việc cải thiện chất lượng dịch máy thông qua các mô hình nhúng từ đa ngôn ngữ. Những tài liệu này sẽ giúp bạn có cái nhìn toàn diện hơn về lĩnh vực dịch máy và các ứng dụng của nó trong ngôn ngữ thiếu tài nguyên.

#nghiên cứu ngôn ngữ

#học máy trong dịch thuật

#công nghệ dịch tự động

#ứng dụng AI trong dịch thuật

#Ngôn ngữ thiếu tài nguyên

#cải tiến dịch máy

Chủ đề

Ứng dụng AI trong ngôn ngữ

Công nghệ ngôn ngữ tự nhiên

Nghiên cứu về dịch máy

Phát triển ngôn ngữ thiếu tài nguyên