MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF TECHNOLOGY TIAN QUANG KIIOAT TOPIC MODELING AND ITS APPLICATIONS MAJOR: INFORMATION TECHNOLOGY ‘THESIS FOR THE DEGREE OF MASTER OF SCIENCE SUPERVISOR: Prof, HO TUBAQ HANOI, 2009 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF TECHNOLOGY TIESIS FOR TIME DEGREE OF MASTER OF SCIENCE MAJOR: INFORMATION TECHNOLOGY TOPIC MODELING AND ITS APPLICATIONS TILAK QUANG EIIOAT HANOT, 2009 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF TECHNOLOGY THIÂN QUANG KHOÁT TOPIC MODELING AND ITS APPLICATIONS MAJOR: INFORMATION TECHNOLOGY THESIS FOR THE DEGREE OF MASTER OF SCIENCE SUPERVISOR: Prof. HOTU BAO TIANGI, 2009 LIST OF PHRASES Abbreviation Full name AI Artificial Intelligence ART Author-Recipient-Topic Model AT Author-Topie Model BTM Bigram Topie Model cDTM Continuous Dynamic Topic Model CTM Correlated Topic Model dDTM Discrete Dynamic Topic Model DELSA Dirichlet Enhanced LSA DiscLDA Discriminative LDA EM Expectation Maximization HDP Hierarchical Dirichlet Processes HDP-RE Hierarchical Dirichlet Processes with random effects hLDA. Hierarchical Latent Dirichlet Allocation HMM-LDA. Hidden Markov Model LDA HTMM Hidden Topie Markov Model IG-LDA Incremental Gibbs LDA IR Information Retrieval LDA Latent Dirichlet Allocation LSA Latent Semantic Analysis MBTM Memory Bounded Topic Model MCMC.
Markov Chain Monte Carlo nCRP Nested Chinese restaurant process. NetSTM Network Regularized Statistical Topic Model PF-LDA Particle Filter LDA pLSA Probabilistic Latent Semantic Analysis PLSV Probabilistic Latent Semantic Visualization sLDA Supervised Latent Dirichlet Allocation Spatial LDA Spatial Latent Dirichlet Allocation STM Syntactic Topic Model SVD Singular Value Decomposition TEM Tempered EM algorithm ies ACKNOWLEDGEMENT First and foremost, I would like to present my gratitude to my supervisor, Professor [lo ‘tu Bao, for introducing me to this attractive research area, for his willingness to promptly suppor! me to complete the thesis, and for mary invaluable advices from the starting point of my thesis. 1 would like to sincerely thank Nguyen Phuong Thai and Nguyen Cam ‘fu for sharing some data sets and for pointing me to some sources on the network where I can find the implementations of some topic models Thanks are also to Phung Trung Nghia for spending his valuable days on helping me to load the data for my experiments. Finally, I would like to thank David Bloi and Thomas Griffiths for their insightful discussions on ‘fopic Modeling and for providing the C implementation of one of their Lopic madels.
ies ACKNOWLEDGEMENT First and foremost, I would like to present my gratitude to my supervisor, Professor [lo ‘tu Bao, for introducing me to this attractive research area, for his willingness to promptly suppor! me to complete the thesis, and for mary invaluable advices from the starting point of my thesis. 1 would like to sincerely thank Nguyen Phuong Thai and Nguyen Cam ‘fu for sharing some data sets and for pointing me to some sources on the network where I can find the implementations of some topic models Thanks are also to Phung Trung Nghia for spending his valuable days on helping me to load the data for my experiments. Finally, I would like to thank David Bloi and Thomas Griffiths for their insightful discussions on ‘fopic Modeling and for providing the C implementation of one of their Lopic madels. LIST OF TABLES Table 2.
Some sclcctcd Probabilistic topic modc]a. DiselL2A for Classifieatien. Comparison of query likelihood retrieval (QL.), cluster-based retrieval (CBDM) and retrieval with the LDA-based document models (LBIM). The most probable topics from NIPS and Vnlixpress collections.
Finding the lopics ofa document. Finding topics of a report. Selected topics found by IMM-LDA Table 5. Classes of function words found by HMM-LDA.
LIST OF TABLES Table 2. Some sclcctcd Probabilistic topic modc]a. DiselL2A for Classifieatien. Comparison of query likelihood retrieval (QL.), cluster-based retrieval (CBDM) and retrieval with the LDA-based document models (LBIM).
The most probable topics from NIPS and Vnlixpress collections. Finding the lopics ofa document. Finding topics of a report. Selected topics found by IMM-LDA Table 5.
Classes of function words found by HMM-LDA. LIST OF TABLES Table 2. Some sclcctcd Probabilistic topic modc]a. DiselL2A for Classifieatien.
Comparison of query likelihood retrieval (QL.), cluster-based retrieval (CBDM) and retrieval with the LDA-based document models (LBIM). The most probable topics from NIPS and Vnlixpress collections. Finding the lopics ofa document. Finding topics of a report.
Selected topics found by IMM-LDA Table 5. Classes of function words found by HMM-LDA. PLEDGE T promise that the content of Uns thesis was written solely by me. Any of the contont was written based on the reliable references such as published papers in distinguished international conferences and joumals, and books published by widely-known publishers.
Many parts and discussions of the thesis are new, not previously published by any other authors, Chapter 1 INTRODUGTION Information Retrieval (TR) has been being a very active area and has a long history. of TR oflen assaciates wilh increasingly huge corpora such as collections of Web pages, collections of scientific papers over years. Therefore, it poses many hard questions thal have received much allention from researchers, One of the most famous questions that sccm to be never ended is how to automatically index the documents of a given corpus or database. Another substantial question is haw to find the most relevant documents in the semantic manner from the Internet or a given corpus to a given user’s query.
Finding and ranking are usually important tasks in IR. Many tools for supporting these tasks are available now, for example, Google and Yahoo. Ilowever most of these available tocls are only able to search for documents via words mulching instead of semantic matching, Semantics ix well-known to be complicated, so finding and ranking documents in the presence of semantics are extremely hard. Despite of this fact, these lasks however potentially have many important applications, which in my opinion are future web service technologies, for instance, semantic searching, semantic advertising, academic recommending, and intelligent.
controlhng Scmanlics is 4. hol topic not. only in the TR community bul alse i the Artificial Intelligence (AL) community. in particular, in the field of knowledge representation it is crucial to know how to effectively represent natural knowledge gathered from the environment around so thal reusing it or imlegrating new knowledge are sy and efficient, ‘To obtain a good knowledge database, semantics cannot be absent since any word has its own meanings and has semantic relations to some other words.
As we know, a word may have multiple senses and play different roles in LIST OF FIGURES Figure 1.1 Some approaches to representing knowledge.1 Á general view on Topic Modeling.2 Probabilistic lopie models in view of the bag-of-words assumption.3 Viewing generative models in terms of Topics 17 Figure 2.A parametric view on generative models.1 A corpus consisting of § documents 23 Figure 3.2 ‘Au illustration of finding topics by LSA using cosine.3 A geometric illustration of representing items in 2-dimensional space.4 Finding relevant documents using QR-based method 34 Figure 4.1 Graphical model representation of pL8A.2 A geometric interpretation ofpLSA.3 Graphical model representation of LDA 46 Vigure 4.4 A geometric Interpretation o£ LÙA.5 A variational inference algorithm for LDA.6 A goumetric illustration of documeril generation process Figure 4.7 An example o£ hierarchy of topios [#].8 A graphical model representation of BTM 61 Figure 5.1 LDA lor Classification.2 ‘the dynamics of the three hottest and three coldest topics.3 Evolution of topes through decades 66 LIST OF FIGURES Figure 1.1 Some approaches to representing knowledge.1 Á general view on Topic Modeling.2 Probabilistic lopie models in view of the bag-of-words assumption.3 Viewing generative models in terms of Topics 17 Figure 2.A parametric view on generative models.1 A corpus consisting of § documents 23 Figure 3.2 ‘Au illustration of finding topics by LSA using cosine.3 A geometric illustration of representing items in 2-dimensional space.4 Finding relevant documents using QR-based method 34 Figure 4.1 Graphical model representation of pL8A.2 A geometric interpretation ofpLSA.3 Graphical model representation of LDA 46 Vigure 4.4 A geometric Interpretation o£ LÙA.5 A variational inference algorithm for LDA.6 A goumetric illustration of documeril generation process Figure 4.7 An example o£ hierarchy of topios [#].8 A graphical model representation of BTM 61 Figure 5.1 LDA lor Classification.2 ‘the dynamics of the three hottest and three coldest topics.3 Evolution of topes through decades 66 PLEDGE T promise that the content of Uns thesis was written solely by me. Any of the contont was written based on the reliable references such as published papers in distinguished international conferences and joumals, and books published by widely-known publishers. Many parts and discussions of the thesis are new, not previously published by any other authors, LIST OF PHRASES Abbreviation Full name AI Artificial Intelligence ART Author-Recipient-Topic Model AT Author-Topie Model BTM Bigram Topie Model cDTM Continuous Dynamic Topic Model CTM Correlated Topic Model dDTM Discrete Dynamic Topic Model DELSA Dirichlet Enhanced LSA DiscLDA Discriminative LDA EM Expectation Maximization HDP Hierarchical Dirichlet Processes HDP-RE Hierarchical Dirichlet Processes with random effects hLDA. Hierarchical Latent Dirichlet Allocation HMM-LDA.
Hidden Markov Model LDA HTMM Hidden Topie Markov Model IG-LDA Incremental Gibbs LDA IR Information Retrieval LDA Latent Dirichlet Allocation LSA Latent Semantic Analysis MBTM Memory Bounded Topic Model MCMC. Markov Chain Monte Carlo nCRP Nested Chinese restaurant process. NetSTM Network Regularized Statistical Topic Model PF-LDA Particle Filter LDA pLSA Probabilistic Latent Semantic Analysis PLSV Probabilistic Latent Semantic Visualization sLDA Supervised Latent Dirichlet Allocation Spatial LDA Spatial Latent Dirichlet Allocation STM Syntactic Topic Model SVD Singular Value Decomposition TEM Tempered EM algorithm LIST OF TABLES Table 2. Some sclcctcd Probabilistic topic modc]a.
DiselL2A for Classifieatien. Comparison of query likelihood retrieval (QL.), cluster-based retrieval (CBDM) and retrieval with the LDA-based document models (LBIM). The most probable topics from NIPS and Vnlixpress collections. Finding the lopics ofa document.
Finding topics of a report. Selected topics found by IMM-LDA Table 5. Classes of function words found by HMM-LDA. LIST OF FIGURES Figure 1.1 Some approaches to representing knowledge.1 Á general view on Topic Modeling.2 Probabilistic lopie models in view of the bag-of-words assumption.3 Viewing generative models in terms of Topics 17 Figure 2.A parametric view on generative models.1 A corpus consisting of § documents 23 Figure 3.2 ‘Au illustration of finding topics by LSA using cosine.3 A geometric illustration of representing items in 2-dimensional space.4 Finding relevant documents using QR-based method 34 Figure 4.1 Graphical model representation of pL8A.2 A geometric interpretation ofpLSA.3 Graphical model representation of LDA 46 Vigure 4.4 A geometric Interpretation o£ LÙA.5 A variational inference algorithm for LDA.6 A goumetric illustration of documeril generation process Figure 4.7 An example o£ hierarchy of topios [#].8 A graphical model representation of BTM 61 Figure 5.1 LDA lor Classification.2 ‘the dynamics of the three hottest and three coldest topics.3 Evolution of topes through decades 66 LIST OF PHRASES Abbreviation Full name AI Artificial Intelligence ART Author-Recipient-Topic Model AT Author-Topie Model BTM Bigram Topie Model cDTM Continuous Dynamic Topic Model CTM Correlated Topic Model dDTM Discrete Dynamic Topic Model DELSA Dirichlet Enhanced LSA DiscLDA Discriminative LDA EM Expectation Maximization HDP Hierarchical Dirichlet Processes HDP-RE Hierarchical Dirichlet Processes with random effects hLDA.
Hierarchical Latent Dirichlet Allocation HMM-LDA. Hidden Markov Model LDA HTMM Hidden Topie Markov Model IG-LDA Incremental Gibbs LDA IR Information Retrieval LDA Latent Dirichlet Allocation LSA Latent Semantic Analysis MBTM Memory Bounded Topic Model MCMC. Markov Chain Monte Carlo nCRP Nested Chinese restaurant process. NetSTM Network Regularized Statistical Topic Model PF-LDA Particle Filter LDA pLSA Probabilistic Latent Semantic Analysis PLSV Probabilistic Latent Semantic Visualization sLDA Supervised Latent Dirichlet Allocation Spatial LDA Spatial Latent Dirichlet Allocation STM Syntactic Topic Model SVD Singular Value Decomposition TEM Tempered EM algorithm TABLE OF CONTENTS List of Phrases List.
of Tables List of Figures. seissessesisnessierestsnetains stasis Chapter 1 INTRODUCTION. Chapler 2 MODERN PROGRESS IN TOPIC MODELING 2. Lincar algcbra bascđ modkls.
243 Discussion and notes Chapter 3 LINEAR ALGEBRA BASED TOPIC MODELS.2 Lalent Semantic Analysis, 3.ccc~ec 3⁄4 Discussion. Chapter 4 PROBABILISTIC ‘TOPIC MODELS 41 An overview.2 Probabilistic Latent Semantic Analysis. 43% Latent Dirichlet Allocation 44 Hicrarchical Latent Dizichict Allocation, 4.5 Bigram Togic Moáil.s 2 Chapler 5 SOME APPLICATIONS OF TOPIC MODELS 1 Classification. sees uw 2 Analyzing research trends over times.