VIETNAM NATIONAL UNIVERSITY OF HO CHI MINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY TRAN TIEN HUNG GRADUATION THESIS UNSUPERVISED DOMAIN ADAPTATION IN SCENE TEXT RECOGNITION USING ENTROPY MINIMIZATION BACHELOR’S DEGREE OF COMPUTER SCIENCE HO CHI MINH CITY, 2023 VIETNAM NATIONAL UNIVERSITY OF HO CHI MINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY TRAN TIEN HUNG- 19521587 GRADUATION THESIS UNSUPERVISED DOMAIN ADAPTATION IN SCENE TEXT RECOGNITION USING ENTROPY MINIMIZATION BACHELOR’S DEGREE OF COMPUTER SCIENCE SUPERVISOR PhD. NGO DUC THANH HO CHI MINH CITY, 2023 THESIS COMMITTEE MEMBERS Council of graduation thesis committee , established under number-. decision on date. from the principal of University of Information Technology — VNUHCM.
Lec e eee c eee eeeeeeeeeeeeeeeneeteneeeeneenes — Chairperson 2 Ki nh ng chi th tin eee — Secretary. 4 kh kg kh nhi kh in tin ky — Commissioner. ACKNOWLEDGEMENT We would like to convey our sincere gratitude to our advisor, Ph. Ngo Duc Thanh, for his insightful guidance and direction throughout the creation of this thesis.
Thanh has always been a fantastic mentor who understands how to communicate with his pupils and offers excellent advice. We are also thankful to the UIT AI Club and the Faculty of Computer Scienece for their invaluable resources. The majority of the project is based on their resources and references, which allow us to build our project and undertake numerous ex- periments. We are also thankful for our research group’s support and encouragement.
Es- pecially, we would like to thank Khanh, Phuong, and Tan for their outstanding assistance and technical advice during the project’s development. Finally, we would like to express our gratitude to our family and friends, as well as Tu, for their support and spiritual assistance during the difficult phase. All things considered, this project could not be achieved without the support of everyone listed and involved. Thank you for your encouragement and support.
Contents IAbstractl 1_ Introductioni [LT Scene Text Recognition|.4 Replacement for the disabled community] .3 Challenges and recent research|.1 Challenges in Scene Text Recogrition|.2 Unsupervised Domain Adaptation|.2 Unsupervised Domain Adaptation in Scene Text Recognition| 18 1.3 Requirement and Contribution|.ẶẶẶ SỐ ằ SỐ. 2_ Prior Research| 21 Relatedworksl.2 Feature extraction stage|.3 | Sequence modeling stagel.2 Unsupervised domain adaptation in Scene text recognition].1 Scene Text Recognition with Attention-based Model].2 Sequence-to-sequence UDA with Minimizing Latent Entropy] 26 2.3 Class-balanced Self-paced Learning]. 3 Experiment implementations Le 3. eeee Le ee 3.2 Stage en ar.
ww Be eee 3. SR v em ek a 3. ee 4_ Thesis conclusion 49 List of Figures 1.1 Scene Text Recognition visualizaton.3 Scene Text Recognition examples|.4 Scene Text Recognitloninput|.5 Scene Text Recognition output| .7 Example of irregular dataset. a) Perspective text, samples from SVTP [33].
b) Curved text, taken from CUTE80 [35]. 12 Xa eee 13 Áiiáa.12 Document information extracHion|.13 Imperfect imaging conditions.14 Blurry, distortion and geometric transformation.15 Perspective, shear and small text images.1 Visualization of[2|SITRframeworks|.2_ SU-FOCALID overall architecture. F is the decoder with the focal mechanism with green and red arrows indicating weight control. The brown dot line indicates each block shares weight with the other.
c is the predicted character in a sequence. Ly;eq is the classi- fication loss and Leyide 1s the entropy minimization loss).3 Class distributions in synthetic dataset including MJ and ST).4 Class distributions in real-world dataset including SVT, IC03, IC13, IC15,CUTE,IIT5k|.5 MJSynth samples generation process|l.1 Bar plot for accuracy comparison.2 Flowchart of hyper-parameters optimization. Pink boxes are « where x is & and 1 is «3, blue boxes are y, green boxes are official bench- marks} 2. QC CS Q HQ HH v2 40 3.
Model loss comparisons with synthetic as source dataset and real- world as target dataset). ee ee eee 42 3.4 Entropy minimization loss with focal entropy and cross-entropy.5 Classification loss comparisons with synthetic as source dataset and real-world as target dataset with multiple trials|.6 Model loss for optimized (a,7) = (0.592,4) as focal entropy and | = 2 for focal loss compared to cross-entropy loss and cross- entropy as decoder loss (L„;„„) and entropy minimization loss (Le yide) respectively| 2.7 Entropy minimization loss - Leyige for optimized focal-based and cross-entropy based model|.8 Classification loss - Ly;eq for focal-based and cross-entropy based ¬ 3.9 Comparison between SU-FOCALID and SMILE]. List of Tables 3.2_ Comparison with UDA methods on regular benchmarks. Bold is the highest value and underline is the second-highest value.
48 List of Abbreviations 1 CNN Convolutional Neural Networks 2 OCR Optical Character Recognition 3. STR Scene Text Recognition 4 EM Expectation Maximization 5 SOTA State Of The Art 6 STN Spatial Transformer Networks 7 CTC Connectionist Temporal Classification 8 UDA Unsupervised Domain Adaptation 9 SU-FOCALID Sequence-to-sequence Unsupervised domain adaptation with FOCAL on Imbalance Distribution 10 LSTM Long Short Term Memory 11 BiLSTM Bi-directional Long Short Term Memory 12 RNN Recurrent Neural Network Abstract Scene Text Recognition is a subproblem of Optical Character Recognition that will be the focus of this thesis. In recent years, numerous Scene Text Recogni- tion approaches have been developed. Since the amount of real dataset is not sig- nificant and labeling process can be time consuming for Scene Text Recognition.
The most common training method involves training a model using synthetic data and then predicting on actual data. Yet, this could result in domain shifts between synthetic and actual images. In addition, each real-world benchmark has its own unique characteristics, including perspective, curved text, contrast, and brightness, etc., resulting in poor performance and inaccurate predictions. To address these limitations, Unsupervised Domain Adaptation (UDA) has been proposed, which can reduce the disparity between source domain datasets and target domain datasets.
With pseudo labels generated by a source-labeled dataset, we may utilize Expectation Maximization methods in conjunction with Entropy Minimization to produce predictions with high confidence, hence reduc- ing the discrepancy of an unlabeled target domain dataset. In addition, by em- ploying class-wise self-pace balance, Unsupervised Domain Adaptation can pick a high-confidence portion of pseudo label for the training set, thereby improving the adaptation process. In order to implement the described concept, we have modified and validated multiple models in Unsupervised Domain Adaptation and proposed a sequence-to-sequence unsupervised domain adaptation with focal against im- balance distribution (SU-FOCALID) based on a Scene Text Recognition frame- work that applies adaptation on minimizing latent entropy in pseudo labels gen- erated by its decoder in order to strengthen predictions from the unlabeled target dataset. SU-FOCALID will be evaluated on official scene text recognition bench- marks with prior UDA methods.
Our main contributions include: ® Presenting main concept in Unsupervised Domain Adaptation for Scene Text Recognition using Entropy Minimization. ¢ Presenting official Scene Text Recognition benchmarks and training datasets. ® Proposing sequence-to-sequence unsupervised domain adaptation against imbalance distribution (SU-FOCALID). Keywords: scene text recognition, unsupervised domain adaptation, entropy minimization, deep learning, domain shift.
Chapter 1 Introduction Text has always been an essential aspect of human life, and its application has benefited humanity throughout human evolution. Text is a system of symbols used for recording and intercultural communication. Rich and precise semantic information carried by text is used in many applications such as image search [47], intelligent inspection [6], industrial automation [52], robot navigation [9] and instant translation [24]. Consequently, recognizing text for applications in the real world is a crucial task for computer vision in order to enhance technol- ogy.
Scene Text Recognition, often known as text recognition in natural scenes, is a major branch of Optical Character Recognition in the Computer Vision field. Despite the fact that text recognition in scanned documents is extensively de- veloped, Scene Text Recognition remains difficult due to numerous real-world factors, such as complicated backgrounds, diverse typefaces, diverse text posi- tions, and bad image conditions. Early studies [52], mainly based on hand- crafted features which had low efficiency and resulted in poor performance. Re- cently, deep learning has demonstrated promising results on numerous bench- marks, and proposals have been introduced alongside competitive State Of The Art method results during the span of the year.
Since then, numerous ways are based on neural networks with additional techniques that represent its benefits. Introduction 4 Also, training strategies and learning methods have contributed to the enhance- ment of model performance. However, Scene Text Recognition model require a large amount of data in order to perform well, while labeling process is time-consuming and therefore lack of human-labeled real-world data. To counteract the lack of data, synthetic data (18]have been presented, creating the main training pattern for future STR research.
This method includes training the model on synthetic data, fol- lowed by validation or tuning on real-world data. Prior research focused mostly on modifying architecture rather than training diverse datasets. [2] [2] proposed a main STR model with four stages: transformation, feature extraction, sequence modeling, and prediction. The majority of modern scene text recognition models adhere to this scheme.
Another method is data-centric, and suggests that a model trained on one domain may perform badly when presented with data from a different domain, a phenomenon known as domain shift. Trained model on all domain can lead well performance across domain and can reduce the need for human-labeled data. Recent interesting domains are handwritten, real-world, document printed and synthetic, based on these domain, we can just use one sin- gle domain in order to validate across domains or use a union of cross-domain data, this created a learning strategy called Unsupervised Domain Adaptation. In this thesis, we will focus on training from a source domain to a target domain and validating across domains using Unsupervised Domain Adaptation for Scene Text Recognition.
Using Expectation Maximization techniques and clus- tering methods to get entropy distribution, [13] recommended semi-supervised learning to deliver high-confidence predictions in deep learning models, with low entropy indicating the model is confident with one sample and vice versa. Consequently, it can be utilized as a pseudo-label in the training process. We re- produce result from a method called sequence-to-sequence on minimizing latent Chapter 1. Introduction 5 entropy (SMILE) to observe the adaptation process along with proposing a new method called Sequence-to-sequence unsupervised domain adaptation against imbalance distribution called SU-FOCALID.1 Scene Text Recognition 1.1 Definition Text has been used by humanity as a way to communicate and to document culture, knowledge, history, and accomplishments.
The image technologies of the twenty-first century have progressed progressively, with more sophisticated equipment (camera, smartphone) for capturing high-quality photographs. As a result, text in images has grown increasingly popular in the field of Computer Vision, as the precise and rich information conveyed by text is crucial in many vision-based application scenarios. Recognizing text in natural scene has become an active study subject in computer vision and pattern recognition. Yet, extract- ing text from natural situations and using it in another process is a hard task with several fundamental concerns and problems.
Charterer ———> charterer Pleading ————* Pleading FIGURE 1.1: Scene Text Recognition visualization.2: Optical Character Recognition visualization. Scene Text Recognition is divided into two scenarios, one called Optical Charac- ter Recognition on scanned documents and Scene Text Recognition for the latter. Both of these can be distinguished by different aspects suggested by [7] such as background, font, form, noise and access. Specifically, we will be focusing on Scene Text Recognition scenarios later on, the difference can be referring from below: €leansui FIGURE 1.3: Scene Text Recognition examples.
Introduction 7 ¢ Background: OCR in scanned documents have a white backdrop and are less noisy; the presence of a mark depends on the document’s content. Text in natural scenes, on the other hand, may contain many items and noise in the backdrop, such as (sign, board, people, animals, vehicles.), which could make the image more complex and difficult to detect. In addition, the background may visually resemble text, which might make recognition much more difficult. ¢ Font: Documents that have been scanned typically have a single font for all of their information, along with an uniform font size, making them easy to recognize.
Unlike scanned documents, the font of natural settings varies depending on the images used.