DISSECTING EMOTION: CHALLENGES IN CLASSIFICATION AND ANALYSIS

Luận án về phân tích cảm xúc trên mạng xã hội: khám phá các thách thức trong phân loại, xử lý dữ liệu và ứng dụng mô hình học máy, học sâu để giải quyết.

Trường đại học

Vietnam National University, Hanoi International School

Chuyên ngành

Business Data Analytics

Người đăng

Ẩn danh

Thể loại

graduation project

2024

Phí lưu trữ

30 Point

Mục lục chi tiết

Acknowledgements

Declaration of Authorship

List of Abbreviations

List of Tables

List of Figures

1. CHAPTER 1: INTRODUCTION

1.1. Problem Statement

1.2. Motivation

1.3. Related Works

2. CHAPTER 2: THEORETICAL FOUNDATIONS

2.1. Text Classification

2.1.1. What is Text Classification?

2.1.2. Why is text classification important?

2.2. Importance and Challenges in dataset annotation

2.3. Contextual Factors in Emotion Analysis

2.4. Loss Function

2.4.1. Cross Entropy

2.4.2. Focal Loss

2.4.3. Self-adjusting Dice Loss

2.5. Evaluation Metrics

2.5.1. Accuracy

2.5.2. Precision

2.5.3. Recall

2.5.4. F1 Score (Weighted)

4. CHAPTER 4: EXPERIMENTS AND EVALUATION

4.1. Machine Learning Model

4.2. Transformation Model with Augmentation

Abstract

Tóm tắt

I. Tổng quan về Phân Tích Cảm Xúc trong Dữ Liệu Mạng Xã Hội

Phân tích cảm xúc trong dữ liệu mạng xã hội đang trở thành một lĩnh vực nghiên cứu quan trọng. Với sự phát triển nhanh chóng của công nghệ thông tin, việc hiểu và phân tích cảm xúc của người dùng trên các nền tảng mạng xã hội là rất cần thiết. Nghiên cứu này không chỉ giúp hiểu rõ hơn về tâm lý người dùng mà còn cung cấp thông tin quý giá cho các nhà quản lý giáo dục và doanh nghiệp.

1.1. Định nghĩa và tầm quan trọng của phân tích cảm xúc

Phân tích cảm xúc là quá trình xác định và phân loại cảm xúc từ văn bản. Nó giúp nhận diện tâm trạng của người dùng, từ đó đưa ra các quyết định phù hợp trong giáo dục và kinh doanh.

1.2. Lịch sử phát triển của phân tích cảm xúc

Lĩnh vực này đã phát triển mạnh mẽ từ những năm 2000, với sự gia tăng của các nghiên cứu về cảm xúc trong văn bản. Các công nghệ như Machine Learning và NLP đã đóng vai trò quan trọng trong việc cải thiện độ chính xác của phân tích cảm xúc.

II. Thách thức trong Phân Tích Cảm Xúc từ Dữ Liệu Mạng Xã Hội

Mặc dù có nhiều tiến bộ trong công nghệ, phân tích cảm xúc vẫn gặp phải nhiều thách thức. Những thách thức này bao gồm sự đa dạng và phức tạp của ngôn ngữ, cũng như sự thiếu hụt dữ liệu chất lượng cao. Các yếu tố văn hóa và ngữ cảnh cũng ảnh hưởng lớn đến việc phân loại cảm xúc.

2.1. Độ phức tạp của ngôn ngữ và cảm xúc

Ngôn ngữ con người rất phong phú và đa dạng, với nhiều cách diễn đạt khác nhau cho cùng một cảm xúc. Điều này làm cho việc phân loại cảm xúc trở nên khó khăn hơn.

2.2. Vấn đề về dữ liệu không đồng nhất

Dữ liệu không đồng nhất có thể dẫn đến việc phân loại sai cảm xúc. Việc thiếu hụt các biểu hiện cảm xúc trong dữ liệu cũng là một thách thức lớn.

III. Phương pháp Giải Quyết Thách Thức trong Phân Tích Cảm Xúc

Để vượt qua các thách thức trong phân tích cảm xúc, nhiều phương pháp đã được phát triển. Các công nghệ như Machine Learning và Deep Learning đã được áp dụng để cải thiện độ chính xác trong việc phân loại cảm xúc.

3.1. Sử dụng Machine Learning trong phân tích cảm xúc

Machine Learning cung cấp các mô hình mạnh mẽ để phân loại cảm xúc từ văn bản. Các thuật toán như Logistic Regression và Decision Trees đã được sử dụng rộng rãi.

3.2. Ứng dụng Deep Learning trong phân tích cảm xúc

Deep Learning, đặc biệt là các mô hình như BERT, đã cho thấy hiệu quả cao trong việc phân tích cảm xúc. Những mô hình này có khả năng học hỏi từ dữ liệu lớn và cải thiện độ chính xác.

IV. Ứng dụng Thực Tiễn của Phân Tích Cảm Xúc trong Giáo Dục

Phân tích cảm xúc có thể được áp dụng trong nhiều lĩnh vực, đặc biệt là giáo dục. Việc hiểu cảm xúc của học sinh có thể giúp cải thiện chất lượng giảng dạy và hỗ trợ tâm lý cho học sinh.

4.1. Cải thiện chất lượng giảng dạy

Thông qua việc phân tích cảm xúc, giáo viên có thể điều chỉnh phương pháp giảng dạy để phù hợp hơn với tâm lý của học sinh.

4.2. Hỗ trợ tâm lý cho học sinh

Phân tích cảm xúc giúp phát hiện sớm các vấn đề tâm lý của học sinh, từ đó có biện pháp hỗ trợ kịp thời.

V. Kết luận và Tương Lai của Phân Tích Cảm Xúc

Phân tích cảm xúc trong dữ liệu mạng xã hội là một lĩnh vực đầy tiềm năng. Với sự phát triển của công nghệ, khả năng phân tích cảm xúc sẽ ngày càng chính xác hơn. Nghiên cứu trong lĩnh vực này không chỉ có giá trị trong giáo dục mà còn trong nhiều lĩnh vực khác.

5.1. Tương lai của công nghệ phân tích cảm xúc

Công nghệ phân tích cảm xúc sẽ tiếp tục phát triển, với nhiều ứng dụng mới trong các lĩnh vực khác nhau.

5.2. Tầm quan trọng của nghiên cứu liên tục

Nghiên cứu liên tục trong lĩnh vực này là cần thiết để cải thiện độ chính xác và khả năng ứng dụng của phân tích cảm xúc.

13/05/2025

Bạn đang xem trước tài liệu:

Dissecting emotion challenges in classification and analysis

Tải đầy đủ

Trích đoạn nội dung tài liệu

VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL GRADUATION PROJECT DISSECTING EMOTION: CHALLENGES IN CLASSIFICATION AND ANALYSIS Ngô Phương Minh Hanoi – June 17th Year 2024 VIETNAM NATIONAL UNIVERSITY, HANOI INTERNATIONAL SCHOOL GRADUATION PROJECT DISSECTING EMOTION: CHALLENGES IN CLASSIFICATION AND ANALYSIS SUPERVISOR: Assoc. Trần Thị Oanh STUDENT: Ngô Phương Minh STUDENT ID: 20070958 COHORT: BDA SUBJECT CODE: INS401101 MAJOR: Business Data Analytics Hanoi – June 17th Year 2024 Acknowledgements This dissertation would not have been possible without the support of many people. First and foremost, I would like to express my sincerest gratitude to my supervisor Assoc. Tran Thi Oanh for provided me with comprehensive and constructive guidance throughout my thesis journey.

Furthermore, I am very grateful for Ms. Linh for always following up close with my work, and give me detailed instruction. Last but not least I would like to say thanks to my family and friends, I would not make it without you. Once again, I am deeply grateful to Assoc.

Tran Thi Oanh for her invaluable guidance and support. Sincerely, Phuong Minh Ngo Declaration of Authorship I declare that this thesis and the work presented within it are my own original work. I have acknowledged all sources and materials used, and no part of this thesis has been copied from other works except where properly referenced. This work has not been submitted for any other degree or qualification at any other institution.

For the comparison of my work with existing sources I agree that it shall be entered in a database where it shall also remain after examination, to enable comparison with future thesis submitted. Further rights of reproduction and usage, however, are not granted. List of Abbreviations AI Artificial intelligent BERT Bidirectional Encoder Representations from Transformers DL Deep learning DT Decision tree LR Logistic regression ML Machine learning NLP Natural Language Processing RF Random forest SMOTE Synthetic Minority Over-sampling Technique SVM Support Vector Machine 1 List of Tables Table 1 Data description. 21 Table 2 Emotion count.

23 Table 3 Detailed text cleaning process. 35 Table 4 Machine learning model evaluation. 40 Table 5 Transformer model evaluation. 41 Table 6 Experiments on different loss function.

42 Table 7 Evaluation on augment data. 45 2 List of Figures Figure 1 BERT architecture by (Seminar Information Systems (WS19/20), 2020). 19 Figure 2 Pipeline for Emotion Classification. 20 Figure 3 Emotion distribution (%).

22 Figure 4 Text preprocessing flow. 24 Figure 5 Special Characters Frequency. 25 Figure 6 Special Characters Bi-gram Frequency. 26 Figure 7 Special Characters Tri-gram Frequenc.

27 Figure 8 Emoji distribution. 28 Figure 9 Top 5 emojis by label. 29 Figure 10 Token length distribution dor each label. 30 Figure 11 Word cloud for each label.

32 Figure 12 Top 10-word Bigram for each label. 34 Figure 13 Visualization of text tokenizing by (Rastogi, 2022). 36 Figure 14 Word bi-gram (cleaned). 38 3 Table of Contents List of Abbreviations.

1 List of Tables. 2 List of Figures. 10 Chapter 2 Theoretical Foundations .1 Importance and Challenges in dataset annotation.2 Contextual Factors in Emotion Analysis .2 Self-adjusting Dice Loss .5 Machine Learning model .4 Support Vector Machine .6 Deep Learning model. 36 Chapter 4 Experiments and Evaluation.

Machine Learning Model. Transformation Model with Augmentation. 47 5 Abstract This thesis delves into the complexity of emotion detection of students’ feelings on social media within educational context. The study begins by addressing the fundamental challenges in dataset annotation and the contextual factors influencing emotion analysis.

Detailed attention is given to dataset preparation, noise identification, exploratory visualization, data cleaning, and tokenization processes, ensuring the integrity and relevance of the data used. Various machine learning and deep learning models, including logistic regression, decision trees, random forest, support vector machines, transformers, and pretrain BERT model on Vietnamese text, are explored for their efficacy in emotion classification tasks. Experimenting on various loss function include cross entropy, focal loss, and self-adjusting dice loss are investigated. The thesis further delves into data augmentation techniques to mitigate dataset imbalances and enhance model robustness.

Experimental results from these methodologies are presented and analyzed comprehensively to assess model performance. This thesis contribute on extensive experimentation for emotion detection in low-resource languages and emphasizing a rigorous sequential data cleaning process. 6 Chapter 1 Introduction Emotion analysis is arguably among the most important areas of study in the field of computer science and artificial intelligence. The need to know the feelings of humankind through data and computation is not only highly applicable but also poses various challenges to researchers.

It further becomes complicated when we have to tackle the varieties and complexities of emotions that humankind can show. Research on emotion has gained popularity in the past several years across a wide range of fields, including the social sciences, humanities, and psychology ever since 2008 in study by Strapparava and Mihalcea. With the ever-increasing popularity of digital data and processing technology, classifying and analyzing emotions from text data is more urgent and important than ever. Emotion in education field has been a subject of concern even from 2007 in a book called “Emotion in Education”.

Nowadays, with the explosion of social networking, today's students easily vocal their thoughts, feelings on the internet. In Vietnam students strongly express their emotion, feeling and opinion through so called 'confession' on Facebook pages and groups though a post along with handful of comments under it. Analysis of these "confessions" apart from being interesting from the point of view of deep comprehension of student psychology can result in receiving really valuable data for optimization of the programs on education and psychological support of study at school. It is possible through the application of the natural language processing (NLP) by using data science methods that include machine learning, deep learning, artificial intelligent (AI), etc., to classify "confessions" automatically, then analyze them, which in turn gives judgment and decision support for the educational stages and school administrators in 7 their concerns.

Thus, this might help improve the educational processes and enhance support for student issues, which are social and psychological problems they face. The aim of this research is to develop the methods of automatic identification of emotions in text. For that, different methods of text processing are used in this dissertation. The corresponding machine learning methods of NLP are touched upon in the study from simple classification model like decision tree to advanced one like Bidirectional Encoder Representations from Transformers (BERT) to better evaluate and understand students’ emotions on online social platform.

The thesis contribute excessive experimental for emotion detection, from model to loss function to data augmentation on low-resource language and thorough sequential data cleaning process.1 Problem Statement Despite recent progress in NLP and machine learning, analysis of emotions from text is an incredibly challenging task for several reasons, inherently harder in educational settings where the ability to understand or fail to understand cues can have a dramatic impact on how students and the school also. This has to do with one of the major reasons, ambiguity, and variability of human language. Emotions come in innumerable forms and are affected by cultural, contextual, and individual factors. Sarcasm, idiomatic expressions, or nuanced use of a language makes it even more complicated to put the emotion in correct class.

Most NLP models become unsuccessful at capturing the very subtlety, and hence they either go misclassified or do not show nuanced comprehension. Another challenge is the availability and quality of data. Large datasets need to have the emotion content well labeled for effective emotion classification. Datasets might not have all expressions of emotions or they can be biased, affecting its performance.

8 The dynamic nature of how languages evolve and develop new forms like emojis and internet slang make this even more challenging. Models trained in traditional text may not generalize very well to these new forms of expression, in effect making it an ongoing exercise in updates and adaptation. Hence, the requirement for continuous research and experiment is necessary to achieve highest accuracy possible.2 Motivation There are quite a number of different factors causing the motivation of this research rather than just theoretical research on emotion classification and analysis within the field of education. Firstly, in the modern period of digitization and intensive development of information technologies which led to extensive development of the internet and social networking sites hence creating an overwhelmingly huge amount of text contains information on human emotions.

The research and application of methods for automatically classifying and analyzing emotions from these data are needed to get an understanding of social interactions and student cognition in digital environments. Secondly, emotion play an equally important role in the learning process and development of students. Only by being able to address these feelings and their opinions will the school be able to adjust teaching and, at the same time, create favorable conditions for students to develop comprehensively in a better way. Which also benefits the school by improve the quality of education and student satisfaction.

Thirdly, tools and techniques of emotion analysis from text find a wide area of applications in other fields. Experiments with many methods in this field will not only benefit the education sector but will also provide immense potential for diverse applications in the world of technology. Thus, studies in classifying and analyzing 9 emotions in education alone and emotion in general are not only an urgent need but also a very important contribution.3 Related Works There are quite handful of works on emotion recognition, especially on social media due to its useful applications. Recently, Luis Romero Gomez et al.

(2023) explored the performance of BERT, DistilBERT, and RoBERTa for emotion recognition using on social media. They found that while all models achieved F1-scores over 92%, DistilBERT was recommended due to its superior results, reduced training times, and lower resource consumption. In 2023 Koufakou et al. underscores the potential of leveraging advanced data augmentation strategies to mitigate data imbalanced issues and improve the robustness of emotion detection models.

There are various benchmark datasets in Vietanmese text, for example, the Vietnamese Students' Feedback Corpus (UIT-VSFC) Kiet Van Nguyen et al. (2018), the UIT-VSMEC Vietnamese Social Media Emotion Corpus Ho et al. The study in 2020 achieved an annotation agreement of over 82% and applied both machine learning and deep neural network models to classify emotions. They reported a best overall weighted F1-score of 59.74% on the original UIT-VSMEC corpus using CNN with word2vec embeddings, highlighting the challenges and opportunities in emotion detection for Vietnamese texts.

Despite quite a few research of NLP on Vietnamese text, there are no paper available on emotion classification of students on social media. Thus, this paper aims to fill this gap by exploring emotion classification among students on social media using advanced NLP techniques. 10 Chapter 2 Theoretical Foundations This chapter’s scope are the theoretical foundations of Natural Language Processing (NLP) specifically for emotion classification task (Wilhelm, 2019). Which cover theoretically from the definition to all the methods will be use base on the standard flow of NLP.1 Text Classification What is Text Classification? According to elastic Text classification is a type of machine learning that categorizes text documents or sentences into predefined classes or categories.

It analyzes the content and meaning of the text and then uses text labeling to assign it the most appropriate label. Real-world applications of text classification include sentiment analysis, and topic categorization and in this paper will be emotion detection (classifying seven emotions). Text classification plays a major role in natural language processing (NLP) by enabling computers to understand and organize large amounts of unstructured text. This simplifies tasks such as content filtering, recommendation systems, and customer feedback analysis.

In short is an umbrella term for categorizes text into predefined classes. Why is text classification important?

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ

Tài liệu có tiêu đề Phân Tích Cảm Xúc trong Dữ Liệu Mạng Xã Hội: Thách Thức và Giải Pháp cung cấp cái nhìn sâu sắc về việc phân tích cảm xúc từ dữ liệu mạng xã hội, một lĩnh vực đang ngày càng trở nên quan trọng trong thời đại số. Tài liệu nêu rõ những thách thức mà các nhà nghiên cứu và doanh nghiệp phải đối mặt, chẳng hạn như sự đa dạng trong ngôn ngữ và cách diễn đạt cảm xúc của người dùng. Đồng thời, nó cũng đề xuất các giải pháp hiệu quả để cải thiện độ chính xác của các mô hình phân tích cảm xúc.

Độc giả sẽ tìm thấy nhiều lợi ích từ tài liệu này, bao gồm việc hiểu rõ hơn về cách thức hoạt động của các công cụ phân tích cảm xúc và cách áp dụng chúng trong thực tiễn. Để mở rộng kiến thức của mình, bạn có thể tham khảo tài liệu Khóa luận tốt nghiệp khoa học dữ liệu nghiên cứu mô hình phân tích cảm xúc dựa trên khía cạnh đa thể thức cho tiếng việt, nơi cung cấp cái nhìn sâu hơn về mô hình phân tích cảm xúc trong ngữ cảnh tiếng Việt. Những tài liệu này sẽ giúp bạn có cái nhìn toàn diện hơn về lĩnh vực phân tích cảm xúc và ứng dụng của nó trong thực tế.

#giải pháp phân tích cảm xúc

#phân tích cảm xúc mạng xã hội

#Sentiment analysis social media data

#Xử lý ngôn ngữ tự nhiên (NLP)

#Machine learning cho phân tích cảm xúc

#Thách thức phân loại cảm xúc

Chủ đề

Ứng dụng của phân tích cảm xúc

Phương pháp phân tích cảm xúc dữ liệu lớn

Độ chính xác trong phân tích cảm xúc