MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY LỂ TUẤN DŨNG --------------------------------------- Tuan Dung LE HỆ THỐNG THÔNG TIN IMPROVING MULTI-VIEW HUMAN ACTION RECOGNITION WITH SPATIAL-TEMPORAL POOLING AND VIEW SHIFTING TECHNIQUES MASTER OF SCIENCE THESIS IN INFORMATION SYSTEM 2017-2018 Hanoi – 2018 17057204899661000000 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY --------------------------------------- Tuan Dung LE IMPROVING MULTI-VIEW HUMAN ACTION RECOGNITION WITH SPATIAL-TEMPORAL POOLING AND VIEW SHIFTING TECHNIQUES Speciality: Information System MASTER OF SCIENCE THESIS IN INFORMATION SYSTEM SUPERVISOR : 1. Thi Oanh NGUYEN Hanoi – 2018 Master student : Tuan Dung LE – CBC17016 Page 2 ACKNOWLEDGEMENT First of all, I sincerely thank the teachers in the School of Information and Communication Technology as well as all the teachers at the Hanoi University of Technology has taught me the knowledge and valuable experience during the past 5 years. I would like to thank the two supervisors, Dr. Nguyen Thi Oanh - lecturer in Information Systems and Communication, Institute of Information and Communication Technology, Hanoi University of Technology and Dr.
Tran Thi Thanh Hai, MICA Research Institute has guided me to complete this master thesis. I have learned a lot from them, not only the knowledge of the field of computer vision but also working and studying skills such as writing papers, preparing slides and presenting to the crowd. Finally, I would like to send my thanks to my family, friends and people who have always supported me in the process of studying and researching this thesis. Hanoi, March 2018 Master student Tuan Dung LE Master student : Tuan Dung LE – CBC17016 Page 3 TABLE OF CONTENT ACKNOWLEDGEMENT .3 TABLE OF CONTENT .4 LIST OF FIGURES .6 LIST OF TABLES .8 LIST OF ABBREVIATIONS AND DEFINITIONS OF TERMS.
HUMAN ACTION RECOGNITION APPROACHES .2 Baseline method: combination of multiple 2D views in the Bag-of-Words model .2 Combination of spatial/temporal information and Bag-of-Words model .1 Combination of spatial information and Bag-of-Words model (S-BoW).2 Combination of temporal information and Bag-of-Words model (T-BoW) .3 View shifting technique .1 Western Virginia University Multi-view Action Recognition Dataset (WVU) .2 Northwestern-UCLA Multiview Action 3D (N-UCLA).40 CONCLUSION & FUTURE WORK .44 Master student : Tuan Dung LE – CBC17016 Page 4 APPENDIX 1 .47 Master student : Tuan Dung LE – CBC17016 Page 5 LIST OF FIGURES Figure 1. 1 a) human body in frame, b) binary silhouttes, c) 3D Human Pose (visual hull), d) motion history volume, e) Motion Context, f) Gaussian blob human body model, g) cylindrical/ellipsoid human body model [1]. 2 Construct HOG-HOF descriptive vector based on SSM matrix[6]. 3 a) Original video of walking action with viewpoints and , their volumes and silhouettes, b) epipolar geometry in case of extracted actor body silhouettes, c) epipolar geometry in case of dynamic scene with dynamic actor and static background without extracting silhouettes[9].
5 Illustration of spatio-temporal interest point detected in a people clapping’s video [16]. 6 Three ways to combine multiple 2D views information in the BoW model [11]. 2 Dividing space domain based on bounding box and centroid. 3 Illustration of T-BoW model.
4 Illustration of view shifting in testing phase. 1 Ilustration of 12 action classes in the WVU Multi-view actions dataset. 2 Cameras setup for capturing WVU dataset. 3 Ilustration of 10 action classes in the N-UCLA Multi-view Actions 3D dataset.
4 Cameras setup for capturing N-UCLA dataset. 5 Illustration of confusion matrix. 6 Confusion matrix: a) Basic BoW model with codebook D3, accuracy 70,83%; b) S-BoW model with 4 spatial parts codebook D3, accuracy 82,41%. 7 Confusion matrices: a) S-BoW model with 6 spatial parts, codebook D3, accuracy 78,24%; b) S-BoW model with 6 spatial parts and view shifting, codebook D3, accuracy 96,67%.
8 Confusion matrices: a) Basic BoW model, codebook D3, accuracy 59,57%; b) S-BoW mofel with 6 spatial parts, codebook D3, accuracy 63,40%.41 Master student : Tuan Dung LE – CBC17016 Page 6 Figure 3. 9 Illustration of view shifting on N-UCLA dataset.42 Master student : Tuan Dung LE – CBC17016 Page 7 LIST OF TABLES Table 3. 1 Accuracy (%) of basic BoW model on WVU dataset. 2 Accuracy (%) of T-BoW model on WVU dataset.
3 Accuracy (%) of S-BoW model on WVU dataset. 4 Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting technique on WVU dataset. 5 Comparison with others methods on WVU Dataset. 6 Accuracy (%) of basic model on N-UCLA dataset.
7 Accuracy (%) of T-BoW model on N-UCLA dataset. 8 Accuracy (%) of the combination of S-BoW model and view shifting on N-UCLA dataset. 9 Accuracy (%) of S-BoW model with (w) and without (w/o) view shifting technique on N-UCLA dataset .42 Master student : Tuan Dung LE – CBC17016 Page 8 LIST OF ABBREVIATIONS AND DEFINITIONS OF TERMS Index Abbreviation Full name 1 MHI Motion History Image 2 MEI Motion Energy Image 3 LMEI Localized Motion Energy Image 4 STIP Spatio-Temporal Interest Point 5 SSM Self-Similarities Matrix 6 HOG Histogram of Oriented Gradient 7 HOF Histogram of Optical Flow 8 IXMAS INRIA Xmas Acquisition Sequences 9 BoW Bag-of-Words 10 ROIs Region of Interest Master student : Tuan Dung LE – CBC17016 Page 9 INTRODUCTION In the growing social scene from the 3.0 era (automation of information technology and electronic production) to the new 4.0 (a new convergence of technologies such as the Internet Things - Internet, collaboration robots, 3D printing and cloud computing, and the emergence of new business models), automatically collecting and processing information by the computer is very necessary. This leads to higher demands on the interaction between humans and machines both in precision and speed.
Thus, the problems of object recognition, motion recognition, speech recognition. are now attracting a lot of interest of scientists and companies around the world. Nowadays, video data is easily generated by devices such as digital cameras, laptops, mobile phones, and video-sharing websites. Human action recognition in the video, contributing to the automated exploitation of the resources of this rich data source.
Applications related to human action recognition problems such as: Security and traditional monitoring systems include networks of cameras and are monitored by humans. With the increase in the number of cameras as well as these systems being deployed in multiple locations, the supervisor's efficiency and accuracy issues are required to cover the entire system. The task of computer vision is to find a solution that can replace or assist the supervisor. Automatic recognition of abnormalities from surveillance systems is a matter that attracts a lot of research.
The problem of enhancing interaction between humans and machines is still challenging, the visual cues are the most important method of non-verbal communication. Effectively exploiting gesture-based communication will create a more accurate and natural human-computer interaction. A typical application in the field is the "smart home", intelligent response to the gesture, the action of the user. However, these applications are still incomplete and still attract more research.
In addition, human action recognition problem is also applied in a number of other applications, such as robots, content-based video analysis, content-based and recovery-based video compression, video indexing, and virtual reality games. Master student : Tuan Dung LE – CBC17016 Page 10 With the aim of studying and approaching the problem of human action recognition using a combination of multiple views, we explored some of the recent approaches and chose to experiment with the method of using combination of local feature and Bag-of-Words model. After analyzing the weaknesses of the method, we proposed a plan for improvement and evaluate it by doing experiments. The thesis will be presented in the following format: Chapter 1: This chapter focuses on the approaches to provide readers with an overview of the problem of human action recognition in general and using multiple views in particular.
The last part of this chapter introduces a method that using combination of local feature and the Bag-of-Words model, evaluates the advantages and disadvantages of the method, and then introduces the proposed improvement methods. Chapter 2: This chapter focuses on presenting an improvement framework using a combination of spatial/temporal information and view shifting techniques. Chapter 3: Experiment the proposed method and give the results with some evaluation. Conclusion and Future works: This section will look at what has been and is not done in the master's thesis and highlight pros and cons and future development.
References Master student : Tuan Dung LE – CBC17016 Page 11 CHAPTER 1. HUMAN ACTION RECOGNITION APPROACHES 1.1 Overview Recognition and analysis of human actions has been a subject that has attracted much interest over the past three decades and is currently being actively researched in the field of computer vision. This is a good solution to solve the problems of a large number of potential applications in the scope of intelligent monitoring, video recovery, video analysis and human-machine interaction. Recent research has highlighted the difficulty of this problem with the large fluctuations in human actions data such as the variability in the way individuals perform actions; movement and clothing; camera angles and motion effects; light fluctuations; occlusion due to objects in the environment or parts of the human body; or disturbances in the surroundings.
Because there are so many factors that can affect the outcome of the problem, current methods are often limited or placed in simple scenarios with simple backgrounds, simple action classes, and stationary cameras or limit the variation in viewing angles. Many different approaches have been proposed over the years for human action recognition. These approaches may be categorized depending on the visual information used to describe the action. Single-view methods use a camera to record the human body during the execution of the action.
However, the appearance of the action is quite different when viewed at arbitrary angle of view. Thus, single-view methods are often accompanied by a basic assumption that action is observed from the same angle in both the training data and the testing data. The efficiency of single- view methods is significantly reduced if this assumption is not true. The obvious way to improve the accuracy of human action recognition is to increase the number of views per action by increasing the number of cameras, which enables us to exploit a larger amount of visual information to describe an action.
The multi-views approach has been studied for only a decade now because the limited capabilities of devices and tools in previous decades did not adequately meet the calculated volume of the Master student : Tuan Dung LE – CBC17016 Page 12 method need. Recent technological advances have brought powerful tools that allow the multi-view approach to become available in a variety of application contexts. Action recognition methods can be divided into two approaches: the traditional approach of using manual features, the approach of neural network. An approach using neural networks typically requires large sets of training data, otherwise it would be ineffective.
In practical applications, datasets are usually medium and small in size. Therefore, in the context of this study, we are interested in a traditional approach that utilizes manually selected features. In this approach, the performance representation can be constructed from 2D data (2D approach) or from 3D data (3D approach) [1]. 3D approaches The general trend in 3D methods is to integrate visual information captured by various angles of view, then represent actions by a 3D model.
This is, usually, achieved by combining 2D human body poses in terms of binary silhouettes denoting the video frame pixels belonging to the human body on each camera (Fig 1. After obtaining the corresponding 3D human body representation, actions are described as sequences of successive 3D human body poses. Human body representations adopted by 3D methods include visual hulls (Fig 1.1c), motion history volumes (Fig 1.1d) [2], optical flow corresponding to the human body (Fig 1.1e) [3], Gaussian blobs (Fig 1.1f) [4], cylindrical/ellipsoid body models (Fig 1.1g) [5] … Master student : Tuan Dung LE – CBC17016 Page 13 Figure 1.