Luận văn về chỉ số modular và bước đi ngẫu nhiên trong bài toán tìm kiếm cộng đồng

Nghiên cứu chỉ số modular và bước đi ngẫu nhiên trong phát hiện cộng đồng giúp hiểu rõ cấu trúc mạng và tối ưu hóa phân tích dữ liệu.

Chuyên ngành

Applied Mathematics

Người đăng

Ẩn danh

Thể loại

master’s thesis

2022

79
2
0

Phí lưu trữ

30 Point

Mục lục chi tiết

Declaration

Acknowledgements

Contents

List of Figures

Introduction

Notations and conventions

1. CHƯƠNG 1: NETWORKS AND COMMUNITIES

1.1. On network science

1.2. Community structure

1.3. The topics of this thesis

3. CHƯƠNG 3: RANDOM WALKS IN COMMUNITY DETECTION

3.1. Random walks and stochastic matrices

3.2. The Walktrap algorithm

4. CHƯƠNG 4: BIBLIOGRAPHY

Appendix A: A Python implementation of Walktrap

Tài liệu "Nghiên cứu chỉ số modular và bước đi ngẫu nhiên trong phát hiện cộng đồng" khám phá các phương pháp và chỉ số quan trọng trong việc phát hiện cộng đồng trong mạng lưới. Nghiên cứu này không chỉ cung cấp cái nhìn sâu sắc về cách thức hoạt động của các chỉ số modular mà còn phân tích vai trò của bước đi ngẫu nhiên trong việc xác định cấu trúc cộng đồng. Những điểm chính trong tài liệu bao gồm cách áp dụng các phương pháp này để tối ưu hóa việc phát hiện cộng đồng, từ đó giúp người đọc hiểu rõ hơn về các kỹ thuật phân tích mạng.

Để mở rộng kiến thức của bạn về các chủ đề liên quan, bạn có thể tham khảo tài liệu Xây dựng mô hình phân lớp với tập dữ liệu nhỏ dựa vào học tự giám sát và cải thiện biểu diễn đặc trưng sâu, nơi bạn sẽ tìm thấy các phương pháp học máy có thể áp dụng trong phân tích mạng. Ngoài ra, tài liệu Ứng dụng quan hệ thứ tự và bậc tôpô trong nghiên cứu một số lớp bao hàm thức cũng cung cấp cái nhìn sâu sắc về các mối quan hệ trong mạng lưới. Cuối cùng, tài liệu Vận dụng tư tưởng hồ chí minh về đoàn kết quốc tế trong việc kết hợp sức mạnh dân tộc và sức mạnh thời đại để phục hồi và phát triển nền kinh tế ở việt nam từ sau đại dịch covid 19 đến nay có thể giúp bạn hiểu thêm về các khía cạnh xã hội và kinh tế liên quan đến mạng lưới cộng đồng. Những tài liệu này sẽ giúp bạn mở rộng kiến thức và khám phá sâu hơn về các khía cạnh của phát hiện cộng đồng và phân tích mạng.

Trích đoạn nội dung tài liệu

MINISTRY OF EDUCATION VIETNAM ACADEMY OF AND TRAINING SCIENCE AND TECHNOLOGY GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY Hoang Duc Anh MODULARITY AND RANDOM WALKS IN COMMUNITY DETECTION Major : Applied Mathematics Code: 8 46 01 12 MASTER’S THESIS IN MATHEMATICS ADVISOR: Assoc. Phan Thi Ha Duong Hanoi – 2022 1 Declaration I hereby declare that this thesis and the work presented in it are the result of my own study. Whenever the works of others are involved, every effort has been made to give credit clearly, with due references to the literature. I confirm that this thesis has not been previously included in a thesis or dissertation submitted for a degree or any other qualification at this graduate university or any other institution. I take full responsibility for the above declaration. Student Hoang Duc Anh 2 Acknowledgements I would like to express a deep gratitude to my advisor, Assoc. Phan Thi Ha Duong, who introduced me to network science and has provided me with support and guidance throughout my study. Her encouragement and enthusiasm for research have been a constant source of inspiration for me throughout the project. I would like to thank the researchers at the Institute of Mathematics and the group Mathematical Foundation for Computer Science for having created a wonderful envi- ronment for young students like me. I also thank the lecturers and administrative staff at the Institute as well as the Graduate University of Science and Technology for their valuable lessons and dedicated help during my degree. I would like to acknowledge the generous support of Vingroup JSC, who has funded my study for the last two years. I was supported by the Master, PhD Scholarship Programme of Vingroup Innovation Foundation (VINIF), Institute of Big Data, codes VINIF. 3 Contents Declaration 1 Acknowledgements 2 Contents 3 List of Figures 5 Introduction 6 Notations and conventions 8 1 NETWORKS AND COMMUNITIES 9 1.1 On network science .3 The topics of this thesis .1 Definition of modularity .3 Modularity in community detection . 33 3 RANDOM WALKS IN COMMUNITY DETECTION 38 3.1 Random walks and stochastic matrices .3 The Walktrap algorithm .1 Summary of the thesis .2 Some further directions . 63 4 Bibliography 65 A A Python implementation of Walktrap 71 5 List of Figures 1.1 A graph with two communities.2 Some common network structures.1 Effect of the resolution parameter.2 Significance of modularity on a graph with two balanced groups.3 Significance of modularity on a graph with two unbalanced groups.4 Significance of modularity on a cycle.1 Spectral properties of a graph with two balanced groups.2 Spectral properties of a graph with three balanced groups.3 Spectral properties of a graph with two unbalanced groups.4 Spectral properties of a graph with three unbalanced groups.5 Spectral properties of a near-bipartite graph.6 Illustration of Ward and single linkage agglomerative clustering on a Walktrap matrix.7 Testing Walktrap on graphs with two balanced groups.8 Testing Walktrap on graphs with three balanced groups.9 Testing Walktrap on graphs with two unbalanced groups.10 Testing Walktrap on graphs with two unbalanced groups with different densities.11 Testing Walktrap on graphs with three unbalanced groups.12 Testing Walktrap on graphs with three unbalanced groups with different densities.13 Testing Walktrap on near-bipartite graphs.14 Testing the consistency of Walktrap. 61 6 Introduction Due to the rise of Big Data phenomenon and interdisciplinary research, network science emerged and has drawn enormous interest from both academia and industry. Dividing a network into smaller groups of similar nodes - a task called community detection - is one direction that has yielded valuable insights about complex network data. In this master’s thesis, we study two topics in the field of community detection: a quality function called modularity, and clustering properties of the random walk eigenvectors of a graph. This thesis contains four chapters and one appendix. The main content is in Chap- ter 2 and Chapter 3. ˆ Chapter 1 briefly discusses some notable features of network science and commu- nity structure in order to situate the main topics of the thesis. ˆ Chapter 2 is a detailed exposition of modularity - a popular clustering quality function.1 defines modularity and gives the standard interpretation based on a random graph model.2 presents basic properties of modu- larity, including modularity of some special graphs (cycles, complete multipartite graphs, .3 explains several shortcomings of modularity when used in the practical context of community detection. ˆ Chapter 3 studies the spectral properties of the random walk matrix and a cluster- ing algorithm based on those properties.1 introduces the random walk matrix and its spectrum.2 explains why the top eigenvectors of that ma- trix inherit the clustering structure of the graph and illustrates the phenomenon visually.3 presents the Walktrap algorithm and performs experiments on some random graphs to investigate the effect of step size and linkage method in the algorithm. ˆ Chapter 4 summarizes the main content of the thesis and introduces some further directions. ˆ Appendix A provides a simple Python implementation of the Walktrap algorithm 7 introduced in Chapter 3. This is an expository thesis. Our main contribution lies in collecting and organizing several results scattered in the literature; we try to provide more detail in theoretical explanations and proofs, and illustrate various ideas using our own experiments imple- mented in the Python programming language (more detail can be found in Chapter 4). We hope this document could be a useful starting point for people studying the two main topics mentioned above. 8 Notations and conventions In this thesis, ‘graph’ and ‘network’ are used interchangeably. Unless stated otherwise, we work with simple undirected graphs, i. undirected graphs with no parallel edges and no self-loops. For a graph G, let V (G) and E(G) be the vertex set and edge set of G; sometimes we simply use V and E if the underlying graph G is clear from context. For a vertex subset P ⊆ V (G), let E(P ) be the set of edges lying inside P and let e(P ) := |E(P )|. We also define the volume of P to be the sum of the degrees of the vertices inside P : X vol(P ) := deg(v). v∈P In case there are many graphs under consideration, we put G in the subscripts, like eG (P ), volG (P ), . , Pk } of a set V is a collection of disjoint non-empty subsets whose union is V , that is Pi ∩ Pj = ∅ for all i ̸= j and ⊔ki=1 Pi = V . All vectors are column vectors. The transpose of matrix M is denoted by M ⊤ , and similarly the transpose of vector x is x⊤ (which is a row vector). We use 1 to denote a vector with all entries equal to 1, whose dimension should be clear from context. In many places we use subscripts to index vectors, so round brackets are used for vector entries: xi (u) is the u-th entry of vector xi . 9 Chapter 1 NETWORKS AND COMMUNITIES This short chapter introduces some notable features of network science and community structure in order to set the background for the main topics of the thesis.1 On network science Network science has grown to an enormous discipline, and it is certainly outside of this chapter’s scope to even attempt a small survey. Instead, we only explain a few features that can be confusing for beginners. There are currently several good textbooks on network science; among them, we mention [1] with a broad coverage, and [2] with a unique focus on modeling, interpretation, and data quality. One attempt at defining network science can be found in the editorial [3]: network science is the study of network models. A network model is a network representation of something, comprising two main components: abstraction from real phenomena to network concepts, and representation of those concepts by network data. What distin- guishes network data from traditional tabular data is that there is some dependency (or relationship) built in, most easily visualized as links (or edges) in a graph. Whether a relationship should be represented by a network, and then how it can be represented, depend a lot on the problem being studied; see Chapters 5 and 11 of [2] for more detailed introduction. There are several reasons, both commercial and scientific, for the increased interest in network science in recent decades. A popular reason, which is also the one most easily capturing the public imagination, is the rise of the Internet and big social media 10 networks, whose links are given concrete names like ‘tag’, ‘friend’, ‘follower’, . Another big spur to the study of networks is how they can be used to tackle complexity in various scientific disciplines. This approach introduced a new paradigm in science, called topological explanations by philosophers [4], complementing existing kinds of explanations like mechanistic, causal, probabilistic, . See the surveys [5, 6] for more details on how networks can be used to model complexity. One notable feature of network science is how scattered the literature is (as can be shown by a brief look at the bibliography of this thesis). Outside from a few recent network-specific journals, network science articles appear in journals and conferences of physics, computer science, mathematics, statistics, as well as sociology. Inevitably, there are different cultures and methods. The traditional divide is between social scien- tists coming from social network analysis, and natural scientists coming from physics. Social scientists study small, carefully curated networks in very specific contexts. They have very rich notions of links and care about the motivation of actors in the networks. In contrast, physicists are inspired by statistical physics and complexity, hence they search for ‘universal laws’ in large collections of large networks, abstracted from those networks’ context. This divide is discussed in [7, 8] [2, Chapter 2]. A slightly dif- ferent but related contrast is between those searching for universality independent of particular objects, and statisticians who focus on testable properties in real data. The division leads to the controversy of power-law degree distribution, carefully recounted in [9]. Finally, there are also computer scientists and mathematicians, each with their own approaches [10]. All of this make network science a ‘trading zone’ [9], where cross-fertilization of ideas as well as cultural clashes happen.2 Community structure Given a network, it is natural to find groups of similar nodes, and we say those groups form a community structure. That description is certainly vague, because we do not (and probably should not) have precise conditions for when nodes form a community. The task of discerning those groups in a network is called community detection or graph clustering; those two terms are used interchangeably in this thesis. Graph clustering is closely related to tabular data clustering. Indeed, one popular way of clustering tabular data is spectral clustering: we create a graph where nodes represent data points, connect two nodes if they are ‘close’ enough, then use spectral properties of the graph to cluster data (see the surveys mentioned in Section 3.2 of 11 this thesis). Conversely, graph embedding is a method of handling very large graphs by embedding vertices in low dimensional euclidean spaces before applying standard techniques of tabular data (see [11] for a recent survey of this big field). Defining communities There is no single, unified concept of a community; see [12, III.B] and [13, II] for many definitions. Some define communities using numerical characteristics like edge density or a quality function. Other take a procedural approach and define communities as results of community detection algorithms; in other words, the algorithms become implicit models of communities. There are also the issues of whether communities can be overlapped, and difference between global (discovering all communities) and local (finding communities in a small region only) methods.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ