Đồ án tốt nghiệp: Phát triển hệ thống AI trích xuất dữ liệu từ tài liệu in tiếng Việt

Khám phá đồ án tốt nghiệp về phát triển hệ thống AI trích xuất dữ liệu từ tài liệu in tiếng Việt, ứng dụng robotics và trí tuệ nhân tạo.

Trường đại học

Ho Chi Minh City University of Technology and Education

Chuyên ngành

Robotics and Artificial Intelligence

Người đăng

Ẩn danh

Thể loại

Graduation thesis

2024

111

Phí lưu trữ

35 Point

Mục lục chi tiết

COMMITMENT

ACKNOWLEDGEMENT

ABSTRACT

LIST OF TABLES

LIST OF FIGURES

LIST OF ACRONYMS

1. CHAPTER 1: Motivation

1.1. Scientific and Practical Significances

1.2. Objectives

1.3. Scope of The Thesis

1.4. Scientific Research Methods

1.5. Limitations

Tóm tắt

I. Giới thiệu về hệ thống AI trích xuất dữ liệu từ tài liệu in tiếng Việt

Hệ thống AI trích xuất dữ liệu từ tài liệu in tiếng Việt là một giải pháp công nghệ tiên tiến nhằm tự động hóa quá trình số hóa và quản lý thông tin từ các tài liệu in. Với sự kết hợp của xử lý ngôn ngữ tự nhiên (NLP), nhận dạng văn bản (OCR), và phân tích dữ liệu, hệ thống này giúp tăng hiệu quả và độ chính xác trong việc trích xuất thông tin. Đặc biệt, hệ thống tập trung vào việc xử lý các tài liệu tiếng Việt, một lĩnh vực còn ít được nghiên cứu tại Việt Nam. Công nghệ này không chỉ giảm thiểu sự phụ thuộc vào lao động thủ công mà còn tối ưu hóa quy trình quản lý tài liệu trong các tổ chức.

1.1. Bối cảnh và động lực

Trong bối cảnh chuyển đổi số tại Việt Nam, việc số hóa tài liệu giấy trở thành yêu cầu cấp thiết. Các tài liệu giấy hiện nay gặp nhiều hạn chế như khó quản lý, chi phí bảo trì cao và không gian lưu trữ lớn. Hệ thống AI ra đời nhằm giải quyết các thách thức này bằng cách tự động hóa quá trình trích xuất và quản lý dữ liệu. Điều này giúp các tổ chức tiết kiệm thời gian, chi phí và nâng cao hiệu quả công việc.

1.2. Mục tiêu và phạm vi

Mục tiêu chính của hệ thống là phát triển một giải pháp tự động hóa để trích xuất thông tin từ các tài liệu in tiếng Việt. Hệ thống tập trung vào việc nhận dạng và trích xuất các thông tin cụ thể như tên, số điện thoại từ các biểu mẫu được quét. Phạm vi nghiên cứu bao gồm việc phát triển các thuật toán OCR tối ưu cho tiếng Việt, tích hợp với các loại máy quét khác nhau và tạo ra một giao diện người dùng thân thiện.

II. Công nghệ và phương pháp áp dụng

Hệ thống sử dụng các công nghệ tiên tiến như OCR, phân tích bố cục tài liệu (Document Layout Analysis), và xử lý ngôn ngữ tự nhiên (NLP) để đạt được hiệu quả cao trong việc trích xuất dữ liệu. Các phương pháp này được kết hợp với nhau để tạo ra một quy trình tự động hóa hoàn chỉnh, từ việc quét tài liệu đến lưu trữ và quản lý thông tin.

2.1. Nhận dạng ký tự quang học OCR

OCR là công nghệ cốt lõi trong hệ thống, giúp chuyển đổi hình ảnh văn bản thành dữ liệu kỹ thuật số. Hệ thống sử dụng các mô hình OCR tối ưu cho tiếng Việt, bao gồm cả văn bản đánh máy và viết tay. Các thuật toán như Scale Invariant Feature Transform (SIFT) và Convolutional Neural Networks (CNN) được áp dụng để nâng cao độ chính xác.

2.2. Phân tích bố cục tài liệu

Phân tích bố cục tài liệu giúp hệ thống xác định các phần tử quan trọng trong tài liệu như tiêu đề, bảng biểu và hình ảnh. Các mô hình như LayoutLMv3 và Document Image Transformer (DiT) được sử dụng để phân loại và trích xuất thông tin một cách hiệu quả.

III. Ứng dụng và giá trị thực tiễn

Hệ thống mang lại nhiều giá trị thực tiễn, đặc biệt trong việc quản lý tài liệu và tự động hóa quy trình làm việc. Các ứng dụng cụ thể bao gồm số hóa tài liệu, quản lý cơ sở dữ liệu và hỗ trợ ra quyết định dựa trên dữ liệu được trích xuất.

3.1. Số hóa tài liệu

Hệ thống giúp chuyển đổi các tài liệu giấy sang định dạng số một cách nhanh chóng và chính xác. Điều này giúp giảm thiểu chi phí lưu trữ và tăng cường bảo mật thông tin.

3.2. Quản lý cơ sở dữ liệu

Thông tin được trích xuất sẽ được lưu trữ trong cơ sở dữ liệu, giúp dễ dàng truy xuất và quản lý. Hệ thống hỗ trợ tích hợp với các nền tảng quản lý dữ liệu hiện có, tạo ra một giải pháp toàn diện cho các tổ chức.

IV. Kết quả và thảo luận

Hệ thống đã đạt được kết quả khả quan trong việc trích xuất thông tin từ các tài liệu tiếng Việt. Các mô hình OCR và phân tích bố cục cho độ chính xác cao với văn bản đánh máy và chấp nhận được với văn bản viết tay. Tuy nhiên, vẫn còn một số hạn chế cần được cải thiện trong tương lai.

4.1. Đánh giá hiệu suất

Các mô hình OCR đạt được độ chính xác cao với văn bản đánh máy, nhưng cần cải thiện với văn bản viết tay. Các chỉ số đánh giá như Character Error Rate (CER) và Word Error Rate (WER) được sử dụng để đo lường hiệu suất.

4.2. Hạn chế và hướng phát triển

Hệ thống hiện tại gặp khó khăn trong việc xử lý các tài liệu có chất lượng kém hoặc bố cục phức tạp. Cần nghiên cứu thêm để cải thiện độ chính xác và mở rộng khả năng xử lý các loại tài liệu khác nhau.

21/02/2025

Bạn đang xem trước tài liệu:

Đồ án tốt nghiệp robtics và trí tuệ nhân tạo development of an ai system for data extraction from vietnamese printed documents

Tải đầy đủ

Trích đoạn nội dung tài liệu

MINISTRY OF EDUCATION AND TRAINING HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION GRADUATION THESIS MAJOR: ROBOTICS AND ARTIFICIAL INTELLIGENCE DEVELOPMENT OF AN AI SYSTEM FOR DATA EXTRACTION FROM VIETNAMESE PRINTED DOCUMENTS INSTRUCTOR: BUI HA DUC STUDENT: HUYNH VINH PHUC NGUYEN XUAN PHI CHU NHAT MINH QUAN Ho Chi Minh city, July 2024 MINISTRY OF EDUCATION AND TRAINING HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION ---------------------------------- FACULTY OF MECHANICAL ENGINEERING GRADUATION THESIS DEVELOPMENT OF AN AI SYSTEM FOR DATA EXTRACTION FROM VIETNAMESE PRINTED DOCUMENTS Supervisor: BUI HA DUC, PhD Student: HUYNH VINH PHUC Student ID: 20134005 Student: NGUYEN XUAN PHI Student ID: 20134004 Student: CHU NHAT MINH QUAN Student ID: 20134021 Year of Admission: 2020-2024 Ho Chi Minh city, July 2024 HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION FACULTY OF MECHANICAL ENGINEERING ---------------------------------- DEPARTMENT OF MECHATRONICS GRADUATION THESIS DEVELOPMENT OF AN AI SYSTEM FOR DATA EXTRACTION FROM VIETNAMESE PRINTED DOCUMENTS Supervisor: BUI HA DUC, PhD Student: HUYNH VINH PHUC Student ID: 20134005 Student: NGUYEN XUAN PHI Student ID: 20134004 Student: CHU NHAT MINH QUAN Student ID: 20134021 Class: 20134 Year of Admission: 2020 - 2024 Ho Chi Minh City, July 2024 COMMITMENT • Project: DEVELOPMENT OF AN AI SYSTEM FOR DATA EXTRACTION FROM VIETNAMESE PRINTED DOCUMENTS • Lecturer: Bui Ha Duc, PhD • Student: Huynh Vinh Phuc • Student ID: 20134005 - Class: 20134 • Adress: Hiep Phu Ward, Thu Duc City, Ho Chi Minh City • Phone number: 0357642052 • Email: phuchuynhvinh.com • Student: Nguyen Xuan Phi • Student ID: 20134004 - Class: 20134 • Adress: Linh Chieu Ward, Thu Duc City, Ho Chi Minh City • Phone number: 094540064 • Email: xphi.com • Student: Chu Nhat Minh quan • Student ID: 20134021 - Class: 20134 • Adress: Hiep Phu Ward, Thu Duc City, Ho Chi Minh City • Phone number: 0938822147 • Email: quancnm.com • Graduation thesis submission date: 04/07/2024 • Commitment: “I affirm that the graduation thesis presented here is the result of my research and efforts. I have not replicated any content from published articles without appropriate citations. Should any breach be identified, I acknowledge full accountability for the consequences.” Ho Chi Minh City, July 4, 2024 i ACKNOWLEDGEMENT We would like to begin by expressing our profound gratitude, on behalf of our team, to our supervisor, Bui Ha Duc, PhD. Your unwavering commitment, expertise, and guidance have been instrumental in shaping our research and leading us to success.

Your mentorship, patience, and continuous encouragement have been vital in navigating our academic journey. Your extensive knowledge and valuable insights have helped us overcome challenges and expand our intellectual horizons. We deeply appreciate your dedication and support. Additionally, we extend our sincere thanks to Ho Chi Minh City University of Technology and Education for providing an exceptional learning environment and resources.

The university has consistently fostered growth, innovation, and academic excellence. The dedicated faculty and staff have greatly influenced our academic and personal development, and we are grateful for their commitment to nurturing future scholars and leaders. Our heartfelt thanks go to our families for their enduring love, support, and understanding. Your unwavering belief in us has been the foundation of our journey, and we are forever grateful for your sacrifices and the countless ways you have encouraged us.

Your steadfast support has empowered us to overcome obstacles and strive for excellence. Finally, we wish to express our gratitude to all those who have contributed to our growth and development, both directly and indirectly. Your encouragement, advice, and confidence in our abilities have been invaluable throughout this challenging yet rewarding journey. Sincerely, Huynh Vinh Phuc Nguyen Xuan Phi Chu Nhat Minh Quan ii ABSTRACT Document AI refers to the use of machine learning technique to automatically analyze the structure of documents and extract relevant information.

Its purpose is to streamline the processing of large volumes of documents by identifying key elements, such as tittle, images, tables, question and answer, and converting them into structured data. This enhances efficiency and accuracy in data extraction, enabling businesses to automate workflows and improve decision-making processes. Currently, in Vietnam, paper-based documents remain the predominant medium for storing and conveying information, surpassing electronic alternatives. However, these paper documents come with inherent limitations, including inaccessibility, high maintenance costs, difficulty in managing data, and the need for extensive physical storage space.

Consequently, the digitization of documents is crucial for all organizations seeking to enhance information management efficiency. With an average processing volume of up to one thousand unstructured documents per day, typically stored in formats like PDFs or scanned images, the imperative for digitization becomes even more apparent. However, common practices in Vietnam often involve manual data entry methods, which are not only time-consuming but also reliant on human labor. The objectives of this project are to research, design and develop a software compatible with various types of scan devices to capture the scanned image of a document, and apply artificial intelligence to digitalization and extract key information from those documents.

The extracted information is then stored in the database for display and management. By applying many techniques including image processing, document layout analysis, text detection and text recognition, we have successfully created a system that is capable of extracting information from various types of forms with high performance for typed texts and acceptance performance for handwritten texts. iii TABLE OF CONTENTS COMMITMENT. iii TABLE OF CONTENTS.

iv LIST OF TABLES .vii LIST OF FIGURES. viii LIST OF ACRONYMS. Scientific and Practical Significances. Scope of The Thesis.

Scientific Research Methods. Structure of The Report. Introduction to Document Imaging Method. Introduction to Document AI.

Introduction to Document Layout Analysis. Document Layout Types. Document Layout Analysis. Introduction to Optical Character Recognition.

Scale Invariant Feature Transform. Metrics Evaluation in Document AI. Document Layout Analysis Metrics. Optical Character Recognition Metrics.

DOCUMENT IMAGE ACQUISITION. Application Programming Interface. Create Connection to Desired Scanner. DEVELOPMENT OF DATA GENERATION.

Glyph Template Creation. Conversion of Template Images to Character Images. Conversion of Character Images to SVG Format Images. Conversion of SVG Format Character Images to Handwritten Fonts.

Synthetic Data Generation. Advantages and Disadvantages of Synthetic Data. Real Data Collection. DEVELOPMENT OF TEMPLATE CREATION.

Template Creation Workflow. Optical Character Recognition. Document Layout Analysis. Data Storage Development.

Template Creation Sequence Diagram. Template Creation GUI. DEVELOPMENT OF INFORMATION EXTRACTION. Information Extraction Workflow.

Information Extraction Sequence Diagram. Information Extraction GUI. EXPERIMENTAL RESULTS AND DISCUSSION. Document Layout Analysis Model.

Text Detection Model. Text Recognition Model. 83 CONCLUSION AND FUTURE WORKS. I vi LIST OF TABLES Table 3.1: Baseline hardware requirements .2: Comparison of document layout analysis models .3: Dataset description for training text recognition model .4: Training parameter for text recognition model .5: Comparison of text recognition models .6: Illustration of correctly recognized images .7: Illustration of misidentified images.

83 vii LIST OF FIGURES Figure 2.1: Sample for image capturing (Source: Internet).2: Some type of scanners (Source: Internet) .3: OCR is the process of transform image to text (Source: [1]) .4: Document image classification examples (Source: [1]) .5: Document layout analysis examples (Source: [1]) .6: Table detection examples (Source: [1]) .8: General document layout analysis framework (Source: [2]) .9: Document layout analysis taxonomy (Source: [2]) .10: A document is passed through a generic layout analysis model, resulting in a layout segmentation mask with the following classes: title (blue), text (red), table (green), and figure (grey) .11: An example of energy map text line segmentation (Source: [2]) .13: The architecture and pre-training objectives of LayoutLMv3 (Source: [24])31 Figure 2.14: Architecture of DiT (Source: [25]) .15: General OCR process (Source:[26]) .16: Vietnames text recognition results of VietOCR (Source: [34]) .17: TransformeOCR architecture in VietOCR (Source: [34]) .18: AttentionOCR architecture in VietOCR (Source: [34]) .1: Design of scanner interface .2 Block diagram of Application Programming Interface architecture .3 Single-User to Multi-Scanner .4: Multi-User to Multi-Scanner (Server-Based) .5: Block diagram of scanning process .6: Bit and byte order of image data .1: Glyph template with characters background page 1 (Source: [36]) .2: Area containing characters .3: QR code image is cut from the image.4: Character image in PNG format (65.5: Character image in BMP format (65.6: Character image in SVG format (65.7: A glyph after automatically designing both side bearings values (left_value, right_value) = (60, -50) .8: A sample image of synthetic data .1: The template creation system .2: Image preprocessing workflow .4: A sample of document parsing .5: A sample of document layout analysis .6: Template formatting workflow .7: A new table with the same name as the form's title is created .8: Sequence diagram of template creation system .9: Template Creation system GUI. Each box can be clicked and adjust the size, position. The text each box carries can be rewritten in the top left input field of the interface (the field is “Enter text for selected box”) .1: The Information extraction system .2: Template matching with single points (blue) and matched lines (green) .3: The information after being extracted is stored in the database with its corresponding table.4: Sequence diagram of information extraction system .5: Information extraction system GUI .1: A sample result of YOLOv8 (left) and LayoutLMv3 (right) .2: Text detection results in typed and handwritten forms. 80 ix LIST OF ACRONYMS ADF Automatic Document Feeder AI Artificial Intelligence ANN Artificial Neural Network API Application Programming Interface BMP Bitmap CER Charater Error Rate CNN Convolutional Neural Network DiT Document image Transformer DLA Document Layout Analysis FCNN Fully Convolutional Neural Network GUI Graphical User Interface IoU Intersection over Union LSTM Long Short-Term Memory mAP mean Average Precision OCR Optical Character Regconition PNG Portable Network Graphics QR Quick Response RLSA Run Length Smearing Algorithm SIFT Scale Invariant Feature Transform SVG Scalable Vector Graphics WER Word Error Rate YOLO You Only Look Once x CHAPTER 1.

Motivation Digital transformation in Vietnam is occurring at a rapid pace. The effective operation of the national data sharing and integration platform enables over 1.6 million transactions daily. The national insurance database has compared and verified information for 91 million citizens from the national population database. Additionally, the civil servant and public employee database has collected data on almost 2.1 million individuals, achieving a 95% connectivity rate with all ministries, branches, and localities.

A key component of digital transformation is digitization, which involves converting information from analog to digital formats. Examples include converting handwritten text to digital format or analog audio recordings to digital format. Digitization is more than just scanning documents; it encompasses extracting information from documents, storing these digital files in repositories or the cloud, and performing ongoing maintenance and management. Digitization allows for effective information storage in databases, automating the registration and retrieval process.

This enhances document security by mitigating the risks of data loss, theft, or damage associated with paper formats, thereby preventing serious data breaches. Another benefit of digitization is cost savings. Paper documents are expensive to produce, store, and access. Digitizing these files reduces costs and minimizes paper waste, making it environmentally friendly.

Furthermore, digitization frees employees from manual data entry processes, reducing time spent on repetitive tasks and enhancing organizational productivity. Minimizing human errors improves data accuracy. Automating data entry, retrieval, classification, and sorting reduces the need for labor-intensive tasks, replacing time- consuming work with swift and precise automated algorithms. Managing vast amounts of paper documents poses significant challenges in the contemporary era.

Effective solutions necessitate support from advanced computer 1 technologies. However, computers lack the inherent ability to read, understand, and analyze paper documents autonomously to extract and digitize data. This presents a substantial difficulty, as it often requires considerable human effort and time. Consequently, researchers and companies are striving to automate this process or, at the very least, minimize human involvement.

This drive has led to the development of systems designed to digitize and distill information from documents efficiently.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ

Hệ thống AI trích xuất dữ liệu từ tài liệu in tiếng Việt là một tài liệu quan trọng giới thiệu về công nghệ AI tiên tiến, giúp tự động hóa quá trình trích xuất thông tin từ các tài liệu in bằng tiếng Việt. Hệ thống này không chỉ tăng hiệu quả xử lý dữ liệu mà còn giảm thiểu sai sót, đặc biệt hữu ích trong các lĩnh vực như quản lý tài liệu, nghiên cứu và phân tích dữ liệu. Để hiểu sâu hơn về các ứng dụng của AI trong xử lý ngôn ngữ tiếng Việt, bạn có thể tham khảo Luận văn thạc sĩ khoa học máy tính ứng dụng học sâu vào xây dựng mô hình rút trích thông tin. Ngoài ra, nếu quan tâm đến việc trích xuất thông tin thực thể và quan hệ trong văn bản, Luận văn thạc sĩ khoa học máy tính trích xuất thông tin thực thể và quan hệ trong văn bản tiếng việt bằng mô hình đồ thị động sẽ là tài liệu bổ ích. Cuối cùng, để khám phá cách AI xử lý văn bản phức tạp hơn, hãy xem Luận văn thạc sĩ khoa học máy tính phân loại văn bản dựa trên mô hình tiền xử lý transformer. Mỗi liên kết mở ra cơ hội để bạn mở rộng kiến thức và hiểu biết về lĩnh vực này.

#xử lý ngôn ngữ tự nhiên

#công nghệ AI

#trích xuất thông tin

#AI trích xuất dữ liệu

#tài liệu in tiếng Việt

#đồ án tốt nghiệp AI

Chủ đề

Xử Lý Ngôn Ngữ Tự Nhiên

Phát triển phần mềm

Ứng dụng AI trong giáo dục

Đồ án tốt nghiệp: Phát triển hệ thống AI trích xuất dữ liệu từ tài liệu in tiếng Việt

COMMITMENT

ACKNOWLEDGEMENT

ABSTRACT

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

LIST OF ACRONYMS

1. CHAPTER 1: Motivation

1.1. Scientific and Practical Significances

1.2. Objectives

1.3. Scope of The Thesis

1.4. Scientific Research Methods

1.5. Limitations

I. Giới thiệu về hệ thống AI trích xuất dữ liệu từ tài liệu in tiếng Việt

1.1. Bối cảnh và động lực

1.2. Mục tiêu và phạm vi

II. Công nghệ và phương pháp áp dụng

2.1. Nhận dạng ký tự quang học OCR

2.2. Phân tích bố cục tài liệu

III. Ứng dụng và giá trị thực tiễn

3.1. Số hóa tài liệu

3.2. Quản lý cơ sở dữ liệu

IV. Kết quả và thảo luận

4.1. Đánh giá hiệu suất

4.2. Hạn chế và hướng phát triển

THÔNG TIN CHI TIẾT

Tác giả: Huynh Vinh Phuc

Người hướng dẫn: Bui Ha Duc, PhD

Trường học: Ho Chi Minh City University of Technology and Education

Chuyên ngành: Robotics and Artificial Intelligence

Đề tài: Development of an AI System for Data Extraction from Vietnamese Printed Documents

Loại tài liệu: Graduation thesis

Năm xuất bản: 2024

Địa điểm: Ho Chi Minh City

Đồ án tốt nghiệp: Phát triển hệ thống AI trích xuất dữ liệu từ tài liệu in tiếng Việt

COMMITMENT

ACKNOWLEDGEMENT

ABSTRACT

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

LIST OF ACRONYMS

1. CHAPTER 1: Motivation

1.1. Scientific and Practical Significances

1.2. Objectives

1.3. Scope of The Thesis

1.4. Scientific Research Methods

1.5. Limitations

I. Giới thiệu về hệ thống AI trích xuất dữ liệu từ tài liệu in tiếng Việt

1.1. Bối cảnh và động lực

1.2. Mục tiêu và phạm vi

II. Công nghệ và phương pháp áp dụng

2.1. Nhận dạng ký tự quang học OCR

2.2. Phân tích bố cục tài liệu

III. Ứng dụng và giá trị thực tiễn

3.1. Số hóa tài liệu

3.2. Quản lý cơ sở dữ liệu

IV. Kết quả và thảo luận

4.1. Đánh giá hiệu suất

4.2. Hạn chế và hướng phát triển

Tài liệu liên quan

THÔNG TIN CHI TIẾT

Tác giả: Huynh Vinh Phuc

Người hướng dẫn: Bui Ha Duc, PhD

Trường học: Ho Chi Minh City University of Technology and Education

Chuyên ngành: Robotics and Artificial Intelligence

Đề tài: Development of an AI System for Data Extraction from Vietnamese Printed Documents

Loại tài liệu: Graduation thesis

Năm xuất bản: 2024

Địa điểm: Ho Chi Minh City

Có thể bạn quan tâm