RY OF EDUCATION AND TRAINING LYERSITY OF SCLENCE AND TECHNOLOGY NIÃn1ĐN TYVN NI Tien Nam NGUYEN NIL PNOHL DNOHL 4H SKELETON-BASED TILMAN ACTIVITY REPRESENTATION AND RECOGNITION MASTER OF SCIENCE THESIS TIN TNFORMATION SYSTEM 810% YOHA Hanoi - 2019 MINISTRY OF EDL ON AND TRAINING HANOI UNIVERSITY OF SCLENCE AND TECHNOLOGY Tien Nam NGUYEN SKELETON-BASED HUMAN ACTIVITY REPRESENTATION AND RECOGNITION Speciality: Information System. MASTER OF SCIENCE THESIS IN INFORMATION SYSTEM SUPERVISOR: 1. ‘Thi Lan LE Hanoi - 2019 GÔNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập — Tự do — [lạnh phúc BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ Họ và tên tác giả luận văn: Nguyễn Đề i luận văn: Nghiên cứu và phát triển phương pháp biểu diễn vả nhận đạng hoạt động người dựa trên khung xương Chuyên ngành: Hệ thông thông tin Mii sé SV: CBC18019 Tác giá, Người hướng dẫn khoa học và Hội đồng cham luận văn xác nhận tác giá đã sửa chữa, bỗ sung luận văn theo biên bản họp lIậi đồng ngày 26/10/2019 với c nội dung sau STT Yêu cầu của hội đẳng i dung da stra chữa, bồ sung 1 Gop chuong 4 va 5 Da gop chương 4 va chuong § thinh 1 chương tên là Các kết quả thực nghiém (18n tiéng Anh: Experimental results) 2 Giải thích lí do lựa chọn các Học viên đã bỗ sung thêm chỉ tiết li phương pháp nhận đạng. sứ do lựa chọn phương pháp ở chương Ì dung trong dé tai phần 3 3 Bố sung các độ đo đánh giá Học viên bố sung thêm thông tin về Precision, Recall, Fl cách tính các độ đo đánh giá đã được trình bày ở chương 4 phân 2 (Evaluation metric).
Cac d§ do Precision, Recall va F1 score déu cd thể được sử dụng để đánh giá hệ thống nhân dạng. Tuy nhiên, trong luận án, để có thể so sảnh với các phương pháp đã để xuất trước đó, tủy vào cơ sở di liệu mà các độ do khác nhau được sử dụng. Cơ sở dữ liệu MSRAction3D sử dụng độ chính xác (Accuracy) trong khi co sở dữ liệu CMIDFaI sử đụng độ do F1 score. Trong bản chỉnh sửa của luận văn, bên cạnh các độ đo sử dụng riêng cho từng cơ sở đữ liệu, học viên đã bố and may become ineffective as each joint has a certain level of engagement in an action.
Moreover, the authors employs only Joint positions as joint features. It seems not good enough to represent action. So other features in representation action are investigated Goints velocities), com>ined with joints positions to create more discrimination fealure of cach action. This thesis improves the Cov3DJ method presented [2] by two improvements: (1) proposing two different schemes to select the most informative joints for acion representation anc (2) combining velocity information wilh posi- tions of the joints for action representation, To evaluate the effectiveness of the proposed method, extensive experiments have been performed on two public datasets (MSRAction3D [3] and CMDFall [4].
On MSRAction3D, the experimental results show that the proposed method obtains 6.17% of improvement over the original method and outperforrns many state-of-the- art methods, On CMDFall dalasct, the proposed method with FL score of 9.64 outperforms the deep learning networks ResTCN (Fl score: 0.39) [4] and LSTM (I score: 0. The contributions of the thesis have been published in an international conferece. Referenecs 56 Acknowlcdgements T would first like to thank my thesis advisor Associate Professor Le Thi Lan, head of the Computer Vision Department at MICA Institute. The door of Assox.
Prof, Lan office was always open whenever Tran into ¢ troubdle spot or had a question about my research or writing. She consistently allowed this thesis to be my own work, but steered me in the right the direction whenever she thought T needed it, T would also like to thank the experts who were involved in the validation survey for this thesis: Dr.Vu Hai, Assoc. Tran Thi Thanh Hai, PhD student Pham Dinh Tan who participated and give me more useful infor- mation. Without their passionate participation and input, the validation survey could not have been successfully conducted, I would also like to acknowledge to School of [nformation and Communica- tion technology where T have been crealed all lhe best conditional to make the master thesis, and [ am gratefully indebted the teachers in SOICT tor very valuable cormments on this thesis.
Finally, I must express my very profound gratitude to my parents, my sister and also to my colleagues in Toshiba Software Development VietNam (Nha Dink Duc, Pham Van Thanh and many colleagues) for providing: me with uafailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplistment would not have been possible without them. Thank you ! Acknowlcdgements T would first like to thank my thesis advisor Associate Professor Le Thi Lan, head of the Computer Vision Department at MICA Institute. The door of Assox.
Prof, Lan office was always open whenever Tran into ¢ troubdle spot or had a question about my research or writing. She consistently allowed this thesis to be my own work, but steered me in the right the direction whenever she thought T needed it, T would also like to thank the experts who were involved in the validation survey for this thesis: Dr.Vu Hai, Assoc. Tran Thi Thanh Hai, PhD student Pham Dinh Tan who participated and give me more useful infor- mation. Without their passionate participation and input, the validation survey could not have been successfully conducted, I would also like to acknowledge to School of [nformation and Communica- tion technology where T have been crealed all lhe best conditional to make the master thesis, and [ am gratefully indebted the teachers in SOICT tor very valuable cormments on this thesis.
Finally, I must express my very profound gratitude to my parents, my sister and also to my colleagues in Toshiba Software Development VietNam (Nha Dink Duc, Pham Van Thanh and many colleagues) for providing: me with uafailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplistment would not have been possible without them. Thank you ! Abstract Human action recognition problem with the aim is to predict what action of people is making, is curently receiving increasing alienion frem com- mter vision researchers due to its widely potential applications in many fields such as human computer interaction, surveillance camera, robotics, health care. Recently, the lease of vost-cflcclive depth cameras such as Microsoft Kin ect und Asus Xtion PROLIVE allows lo open new opportu- nities for HAR as they provide richer information of the scene.
Thanks to these sensors, besides color images, depth and skeleton infonnation arc also available. Moreover, the latest research results on human rose estimation in RGB video show that the humaa pose and skeleton can be accurately estimaled even in complex scenes. Using skelclon information for human action recognition has several aclvantages in comparison with those using color and depth information. As results, a wide range of methods for HAR using skeleton information have been introduced [1].
The methods proposed. for skeleton-based HAR can be categorized into two groups: hand-crafted features and deep learning. Each has its own advantages and disadvan- tages. Decp learning based techniques obtains impressive resulls several benchmark datasets.
However, they usually require large datasets and high performance computing hardware. Among hanc-crafted descriptors for ac- tion represenlalion, Cov3DJ with covariance malrix of 3D joint posilions proves its effectiveness and computational efficiency [2]. To take into ac- count the duration variation of action, a temporal hicrarshy representation is introduced with multiple layers. However, the disadvantage of Cov3DI is that it uses of all joints in the skeleton, which causes computational burden sung thêm báng 4.7 ở chương 4 kết qua nhân dạng trên tất cả các dộ do cho 2 cơ sở đữ liệu thử nghiệm.
Ngày 07 tháng L1 năm 2019 Giáo viên hướng dẫn Tác giá luận văn CHỦ TỊCH HỘI DÒNG 3.2 Stralegy 2 (AM) far most information joints deleclon.3 Action representation by covariance descriptor.1 Temporal covariance descriplor with position infrmation. Temporal covariance descriptor with velocity information. Temporal hierarchy covariance descriptor.4 Classification wilh support vector mavhine.1 Linear separable training.2 Non linear separable (raining. we wee 2D 4 Experimental results 8L 4.
412 CMDFall 33 Evalualion metric. & Experiment Environments te oF Evaluation of features used for joint representation iB 4.1 Results on MSRAction3D dataset 44.2 Results on CMDFull dalascl. 45 Evaluation of the most intormative joints selection.1 The effect of the number of most informative somnts.2 Comparison between two strategies. Comparison with state-of-the-art methods.
ae a Time compulalion. 5 Conelusions el Conclusions. Publications 52 and may become ineffective as each joint has a certain level of engagement in an action. Moreover, the authors employs only Joint positions as joint features.
It seems not good enough to represent action. So other features in representation action are investigated Goints velocities), com>ined with joints positions to create more discrimination fealure of cach action. This thesis improves the Cov3DJ method presented [2] by two improvements: (1) proposing two different schemes to select the most informative joints for acion representation anc (2) combining velocity information wilh posi- tions of the joints for action representation, To evaluate the effectiveness of the proposed method, extensive experiments have been performed on two public datasets (MSRAction3D [3] and CMDFall [4]. On MSRAction3D, the experimental results show that the proposed method obtains 6.17% of improvement over the original method and outperforrns many state-of-the- art methods, On CMDFall dalasct, the proposed method with FL score of 9.64 outperforms the deep learning networks ResTCN (Fl score: 0.39) [4] and LSTM (I score: 0.
The contributions of the thesis have been published in an international conferece. Referenecs 56 and may become ineffective as each joint has a certain level of engagement in an action. Moreover, the authors employs only Joint positions as joint features. It seems not good enough to represent action.
So other features in representation action are investigated Goints velocities), com>ined with joints positions to create more discrimination fealure of cach action. This thesis improves the Cov3DJ method presented [2] by two improvements: (1) proposing two different schemes to select the most informative joints for acion representation anc (2) combining velocity information wilh posi- tions of the joints for action representation, To evaluate the effectiveness of the proposed method, extensive experiments have been performed on two public datasets (MSRAction3D [3] and CMDFall [4]. On MSRAction3D, the experimental results show that the proposed method obtains 6.17% of improvement over the original method and outperforrns many state-of-the- art methods, On CMDFall dalasct, the proposed method with FL score of 9.64 outperforms the deep learning networks ResTCN (Fl score: 0.39) [4] and LSTM (I score: 0. The contributions of the thesis have been published in an international conferece.
Abstract Human action recognition problem with the aim is to predict what action of people is making, is curently receiving increasing alienion frem com- mter vision researchers due to its widely potential applications in many fields such as human computer interaction, surveillance camera, robotics, health care. Recently, the lease of vost-cflcclive depth cameras such as Microsoft Kin ect und Asus Xtion PROLIVE allows lo open new opportu- nities for HAR as they provide richer information of the scene. Thanks to these sensors, besides color images, depth and skeleton infonnation arc also available. Moreover, the latest research results on human rose estimation in RGB video show that the humaa pose and skeleton can be accurately estimaled even in complex scenes.
Using skelclon information for human action recognition has several aclvantages in comparison with those using color and depth information. As results, a wide range of methods for HAR using skeleton information have been introduced [1]. The methods proposed.