Luận án tiến sĩ mining indexing and similarity search in large graph data sets

Luận án tiến sĩ khám phá phương pháp khai thác, lập chỉ mục và tìm kiếm tương đồng trong tập dữ liệu đồ thị lớn, ứng dụng hiệu quả trong khoa học dữ liệu.

Trường đại học

University of Illinois at Urbana-Champaign

Chuyên ngành

Computer Science

Người đăng

Ẩn danh

Thể loại

thesis

2006

172

Phí lưu trữ

45 Point

Mục lục chi tiết

CERTIFICATE OF COMMITTEE APPROVAL

Abstract

Acknowledgments

1. CHƯƠNG 1: INTRODUCTION

1.1. Motivation

2. CHƯƠNG 2: GRAPH PATTERN MINING

2.1. Apriori-based Mining

2.2. Pattern Growth-based Mining

2.3. Right-Most Extension

2.4. DFS Lexicographic Order

2.5. Closed Graph Pattern

2.6. Failure of Early Termination

2.7. Detecting the Failure of Early Termination

2.8. Variant Graph Patterns

2.8.1. Contrast Graph Pattern

2.8.2. Coherent Graph Pattern

2.8.3. Discriminative Graph Pattern

2.8.4. Dense Graph Pattern

2.8.5. Approximate Graph Pattern

2.9. Relevance-Aware Top-K

2.10. Pattern-Based Classification

2.11. Automated Software Bug Isolation

2.11.1. Uncover “Backtrace” for Noncrashing Bugs

2.12. Graph Patterns with Constraints

2.12.1. Highly Connected Graph Patterns

2.12.2. CloseCut: A Pattern Growth Approach

2.12.3. SPLAT: A Pattern Reduction Approach

2.12.4. Pruning Patterns

2.12.5. Gene Relevance Network Analysis

3. CHƯƠNG 3: GRAPH INDEXING

3.1. Graph Query Processing

3.2. Path-based Graph Indexing

3.3. Discriminative Fragment Selection

3.4. Insert/Delete Maintenance

4. CHƯƠNG 4: GRAPH SIMILARITY SEARCH

4.1. Substructure Similarity Search

4.2. Feature-Graph Matrix

4.3. Feature Miss Estimation

4.4. Feature Set Selection

4.4.1. Complexity of Optimal Feature Set Selection

4.4.2. Clustering based Feature Set Selection

4.5. Substructure Search with Superimposed Distance

4.6. Framework of Partition-Based Index and Search

4.7. Fragment-based Index

4.8. Partition-based Search

List of Figures

List of Tables

Glossary of Notation

Tóm tắt

I. Giới thiệu

Luận án tiến sĩ của Xifeng Yan tập trung vào việc khai thác dữ liệu, lập chỉ mục, và tìm kiếm tương đồng trong tập dữ liệu đồ thị lớn. Nghiên cứu này nhấn mạnh sự cần thiết của các thuật toán phân tích có khả năng mở rộng để xử lý các đồ thị lớn, đặc biệt trong các lĩnh vực như sinh học tính toán và kỹ thuật phần mềm. Đồ thị được sử dụng rộng rãi để biểu diễn các cấu trúc phức tạp như protein, hợp chất hóa học, và luồng chương trình. Việc phân tích thủ công các đồ thị lớn là không khả thi, do đó, các công cụ tự động hóa là cần thiết.

1.1. Động lực nghiên cứu

Nghiên cứu này xuất phát từ nhu cầu thực tế trong việc phân tích đồ thị và tìm kiếm tương đồng trong các tập dữ liệu đồ thị lớn. Các ứng dụng như phân tích mạng protein, tìm kiếm hợp chất hóa học, và phân tích luồng chương trình đòi hỏi các giải pháp hiệu quả. Việc khai thác các mẫu đồ thị và xây dựng chỉ mục là hai vấn đề cốt lõi được đề cập trong luận án.

II. Khai thác mẫu đồ thị

Luận án đề xuất một hệ thống gán nhãn đồ thị, gSpan, để giải quyết vấn đề khai thác dữ liệu trong tập dữ liệu đồ thị lớn. gSpan loại bỏ sự cần thiết của các phép nối đồ thị, giúp giảm chi phí tính toán. Nghiên cứu cũng chỉ ra rằng việc khai thác các mẫu đồ thị thường xuyên là một bài toán NP-đầy đủ, đòi hỏi các giải pháp tối ưu hóa.

2.1. Phương pháp khai thác dựa trên Apriori

Phương pháp này dựa trên nguyên lý Apriori để khai thác các mẫu đồ thị thường xuyên. Tuy nhiên, nó tạo ra nhiều ứng viên không cần thiết, dẫn đến chi phí tính toán cao.

2.2. Phương pháp khai thác dựa trên tăng trưởng mẫu

Phương pháp này tập trung vào việc mở rộng các mẫu đồ thị hiện có thay vì tạo ra các ứng viên mới, giúp giảm chi phí tính toán và cải thiện hiệu suất.

III. Lập chỉ mục đồ thị

Luận án giới thiệu gIndex, một phương pháp lập chỉ mục hiệu quả cho tập dữ liệu đồ thị lớn. gIndex sử dụng các mẫu đồ thị thường xuyên và có tính phân biệt để xây dựng chỉ mục, giúp giảm kích thước chỉ mục và cải thiện tốc độ tìm kiếm. Phương pháp này vượt trội so với các phương pháp truyền thống về cả kích thước và hiệu suất.

3.1. Chọn lọc các đoạn đồ thị phân biệt

Phương pháp này tập trung vào việc chọn lọc các đoạn đồ thị có tính phân biệt cao để xây dựng chỉ mục, giúp giảm số lượng mục chỉ mục và cải thiện hiệu suất tìm kiếm.

3.2. Duy trì chỉ mục khi thêm xóa đồ thị

Luận án cũng đề xuất các phương pháp để duy trì chỉ mục khi có sự thay đổi trong tập dữ liệu đồ thị, đảm bảo tính nhất quán và hiệu suất của chỉ mục.

IV. Tìm kiếm tương đồng trong đồ thị

Luận án nghiên cứu các phương pháp tìm kiếm tương đồng trong tập dữ liệu đồ thị lớn, bao gồm tìm kiếm chính xác và tìm kiếm gần đúng. Các phương pháp này được áp dụng trong các lĩnh vực như sinh học tính toán và hóa học, giúp tăng tốc quá trình phân tích và khám phá tri thức.

4.1. Tìm kiếm tương đồng dựa trên cấu trúc con

Phương pháp này tập trung vào việc tìm kiếm các cấu trúc con tương đồng trong đồ thị, giúp xác định các mẫu đồ thị có ý nghĩa trong các ứng dụng thực tế.

4.2. Tìm kiếm tương đồng dựa trên khoảng cách chồng chéo

Phương pháp này sử dụng khoảng cách chồng chéo để đo lường sự tương đồng giữa các đồ thị, giúp cải thiện độ chính xác của kết quả tìm kiếm.

V. Ứng dụng thực tiễn

Luận án cung cấp các nghiên cứu chuyên sâu về phân tích đồ thị, phân loại dựa trên mẫu, và khai thác mẫu có ràng buộc. Các ứng dụng thực tiễn bao gồm phân tích mạng gene, phân tích luồng chương trình để tự động phát hiện lỗi phần mềm, và phân tích hợp chất hóa học trong nghiên cứu dược phẩm.

5.1. Phân tích mạng gene

Nghiên cứu này áp dụng các phương pháp khai thác đồ thị để phân tích mạng gene, giúp xác định các mạng con quan trọng trong các quá trình sinh học.

5.2. Phân tích luồng chương trình

Luận án đề xuất các phương pháp phân tích luồng chương trình để tự động phát hiện lỗi phần mềm, giúp cải thiện chất lượng phần mềm.

21/02/2025

Bạn đang xem trước tài liệu:

Luận án tiến sĩ mining indexing and similarity search in large graph data sets

Tải đầy đủ

Trích đoạn nội dung tài liệu

MINING, INDEXING AND SIMILARITY SEARCH IN LARGE GRAPH DATA SETS BY XIFENG YAN B., State University of New York at Stony Brook, 2001 DISSERTATION Submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2006 Urbana, Illinois UMI Number: 3243031 INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

® UMI UMI Microform 3243031 Copyright 2007 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.

Box 1346 Ann Arbor, MI 48106-1346 ©by Xifeng Yan, 2006. All rights reserved. CERTIFICATE OF COMMITTEE APPROVAL University of Illinois at Urbana-Champaign Graduate College July 31, 2006 We hereby recommend that the thesis by: XIFENG YAN Entitled: MINING, INDEXING AND SIMILARITY SEARCH IN LARGE GRAPH DATA SETS Be accepted in partial fulfillment of the requirements for the degree of: Doctor of Philosophy eSoe 212.0/2 Director of Research- J IẠWEI HAN Head ofDepartment- Committee Member - Committee Member - * Required for doctoral degree but not for master’s degree Abstract Scalable analytical algorithms and tools for large graph data sets are in great demand across domains from software engineering to computational biology as it is very difficult, if not im- possible, for human beings to manually analyze any reasonably large collection of graphs due to their high complexity. In this dissertation, we investigate two long standing fundamental problems: Given a graph data set, what are the hidden structural patterns and how can we find them? and how can we index graphs and perform similarity search in large graph data sets? Graph pattern mining is an expensive computational problem since subgraph isomorphism is NP-complete.

Previous solutions generate inevitable overheads since they rely on joining two graphs to form larger candidates. We develop a graph canonical labeling system, gSpan, showing both theoretically and empirically that this kind of join operation is unnecessary. Graph indexing, the second problem addressed in this dissertation, may incur an exponential number of index entries if all of the substructures in a graph database are used for indexing. The solution, gIndex, proposes a novel, frequent and discriminative graph mining approach that leads to the development of a compact but effective graph index structure that is orders of magnitude smaller in size but an order of magnitude faster in performance than traditional approaches.

Besides graph mining and search, this dissertation provides thorough investigation of pat- tern summarization, pattern-based classification, constraint pattern mining, and graph similar- ity searching, which could leverage the usage of graph patterns. It also explores several critical applications in bioinformatics, computer systems and software engineering, including gene rel- evance network analysis for functional annotation, and program flow analysis for automated software bug isolation. The developed concepts, theories, and systems may significantly deepen the understanding of data mining principles in structural pattern discovery, interpretation and search. The for- mulation of a general graph information system through this study could provide fundamental supports to graph-intensive applications in multiple domains.

iii To my parents and sister iv Acknowledgments There are no words to express my gratitude to my adviser, Prof. The research presented in this dissertation would not have happened without his support, guidance, and encouragement. Nearly every aspect of my research has been improved due to his mentoring. I was fortunate to spend two summers with Dr.

Yu at IBM Research, who helped me define an important part of my doctoral work. Thanks also to Dr. Jasmine Xianghong Zhou who brought me into the fantastic field of computational biology. It was always inspiring and exciting to work with her.

I also felt honored to be a member in the Database and Information System Lab, where I found many dedicated collaborators: Chao Liu for automated software bug isolation, Hong Cheng and Dong Xin for pattern summarization and interpretation, and Feida Zhu for complexity analysis. It was my honor to have Dr. Christos Faloutsos, Dr. Marianne Winslett, and Dr.

Chengxi- ang Zhai as my Ph. I am very grateful to them for providing insightful comments regarding this dissertation. I am also greatly indebted to many teachers in the past who educated me and got me interested in scientific research. A special thank goes to my primary school teacher Jingzhi Sun and my middle school mathematics teacher Shanshan Wu.

I would like to thank my parents and sister for their love, trust, and encouragement through hard times and for their unconditional support which enables me pursue my interests overseas. This research is funded in part by the U. National Science Foundation grants NSF IIS- 0209199, IIS-0308215, CCR-0325603, and DBI-0515813. Table of Contents List of Figures.

ix List of Tables ca xii Glossary of Notation ©.ààặằaaAa a aaaa aẶRẶ da 1 1. ng gà nà kg vi v ki va 5 1.ee 8 2 Graph Pattern Mining 2. ok cà gà kg va va 10 2.1 Apriori-based Mining.2 Pattern Growth-based Mining.2 Right-Most Extension. cu ee kg xa 16 2.

Quà g v gi kg va 16 2.4 DFS Lexicographic Ôrder. uc cv rà kg va 18 "8s.3 Closed Graph Pattern. cv gà gà va 23 2.2 Failure of Early Termination. ch HH ko 25 2.3 Detecting the Failure of Early Termination.

gà kg kia 28 2.4 Variant Graph PatteTAS. HQ Hạ nàn và kg kia 32 2.1 Contrast Graph Pattern. nạ gà và và 32 2.2 Coherent Graph Pattern. Q Q ng và gà kg va 32 2.3 Discriminative Graph Pattern.4 Dense Graph Pattern 2.

cv Hà gà và ky T va 33 2.5 Approximate Graph Patlern. cu kg va 34 2.2 Relevance-Aware Top-K. và xà vàn a 39 2.6 Pattern-Based Classification. kg ky kg và 41 2.7 Automated Software Bug Isolation.

HQ Quà Tà sa 46 2.1 Uncover “Backtrace” for Noncrashing Bugs. 49 Graph Patterns with ConstrainiS. c c c c c c cv vn ng gà và an a 52 3.1 Highly Connected Graph Patterns. LH Q vn và v22 53 3.1 CloseCut: A Pattern Growth Approach .2 SPLAT: A Pattern Reduction Approach.

v g va k KT kia 60 3.21 Pruning Patte€rnS. cu ng gà kg KV v KÀ 64 3. ch ngà kg kg Nà ko 66 3.3 Gene Relevance Network Analysis. c L vn ng ee 69 Graph Indexing 2.1 Graph Query ProcesSling.

ch HH HH vu vợ kia và va 75 4. ng ng kg ga 77 4.2 Path-based Graph Indexing. ch ng kg ki va 81 `.1 Discriminative Fragment Selection. cà ga v v kg va 84 “.5 Insert/Delete Maintenance.

ch ng cv gi kg kg Và và kia 93 5 Graph Similarity Search. c c c c c cu ng à gg gi.1 Substructure Similarity Search. cu 2 kg ky 101 5. cu gà kg kg k kg 102 5.1 Feature-Graph Matrix.

HQ gà gà kia xa 103 5. - cv kg kg va 104 5.3 Feature Miss EstimatiOn. 0 pee kg va 108 5. HQ HH gu ky 111 5.3 Feature Set Selection.2 Complexity of Optimal Feature Set Seleclion.3 Clustering based Feature Set Selection.6 Substructure Search with Superimposed Distance.2 Framework of Partition-Based Index and Search .3 Fragment-based Index.

Hà kg va 134 5.4 Partition-based Search. c Q Q Q Q ng cu gà k cà ng gà v v v v kg V v v và 157 vill List of Figures 1.1 Program Flow, Protein and Chemical Compound .2 Protein-Protein Interaction Network .1 Program Caller/Callee Graphs.2 Frequent Graph Patterns. Q Q Q Q cu gu g kg kg kg va 11 2.8 Right-Most Extension. v v kg kg va 16 2.7 Lexicographic Search ÏTree.

cu kg ga kg kg 19 2.8 Extended Subgraph lsomorphism.9 Failure of Early Termination. kg ke k kg Ni kg va 26 2.11 Detect the Failure of Early TerminatioOn'.12 Pattern Generation ÔTdđeT. LH gà ky va 28 2.13 Mining Performance in Class CA Compounds.14 Discovered Patterns in Class CA Compounds.16 Pattern Summarization: Top-k, Clustering, and Relevance-aware Top-k .17 Software Behavior Graphs .18 Classification Accuracy Boost.19 Entrance Precision and Exit Precision .20 Precision Boost of Functions .1 Mining Relational Graphs .2 Search Space: Record or Discard .3 Splat: A Pattern Reduction Approach. eee ee ee es 59 3.

cà Và ee 62 3. 0 gà gà v Q k sa 64 3.7 Pruning Properties of Graph Constraints .8 Number of Highly Connected Patterns .9 Size of the Largest Patterns.10 Genes Related with Subtelomerically Encoded Proteins .11 Genes Having Helicase Activity. 0 2 nu ng ga kg kg xa 71 3.12 Genes Involved in Ribosomal Biogenesis .13 Genes Involved in rRNA Processing. 0002 epee eee eee 72 4.

kg kg kg vi k k k va 82 4.5 Size-increasing Support FUnetiOn§S. ee eee ee 83 4. gIndex: Index Size. vu gà xà 94 4.9 gIndex: Sensitivity and Scalability .10 Index Incremental Maintenance.

ee va va 96 4.11 Sampling-based Index Construction. gIndex: Performance on Synthetic Datasets. Q Q Q Q Q Q ng ng ng ga g và NT sa 99 5.4 A Sample Set of Features 2. va gà va 103 5.5 Feature-Graph Matrix Index.

LH ee va 108 5.6 Edge-Feature Matrix. ng Nà va 106 5. c c c Q c n Q ng vn và gà va 115 5.9 Weighted Set System .10 A Query Graph 1n ee 118 5.11 Hierarchical Agglomerative Clustering. cu ng kg kg k kg sa 126 5.13 Grafil: Performance on Chemical DafaAS€f§.

Q Q Q Q Q Q Quà gà kg kg kh kg kia 128 5. EDGE: Performance on Synthetic Datasels. nu ng và va 130 5.19 PIS: Index Construction. ng g kg k KV 135 5.20 The Index Components of PIS 2.

ru và va 136 5.21 Overlapping-Relation Graph. cv ng v v và và 138 5.22 Greedy Partition Selection. 0 ng ng ga gà va 140 5.23 PIS: Performance on Chemical Datasets.24 PIS: Parameter Sensitivity 2. HH kg và v va 144 xi List of Tables 2.1 DFS code for Figures 2.

ee ee eee 18 2.2 Parameters of Synthetic Graph Generator .3 Bug-Relevant Functions with 6 = 20% ©.1 Parameters of Synthetic Relational Graph Generator. ee ee eee 61 4.1 Sufficient Sample Size Given é, d,andp. ee hh hhh h hỢ 92 xii Glossary of Notation ú8AS7Dì set of graphs set of patterns set of real numbers empty set set minus vertex set of graph G edge set of graph G vertex label set edge label set data set supporting data set of pattern a support(a), 6(a) support of a min.support, 6 minimum support P(a) subpattern set of a Or edge extension xill Chapter 1 Introduction Data mining, as well as database systems research, is facing a new challenge raised by the emergence of large volumes of network and graph data, which are pervasive in bioinformatics, chem-informatics, the Web, and many other applications. Due to their adaptive capability of modeling complicated structures, such as proteins, images, documents, and other schemaless data, graph representation of data is well accepted in domains ranging from software engineer- ing to computational biology.

In computer vision, graphs are used to represent the organization of features in images, where the interlinks between features are critical in recognition of scenes and objects. In chemical informatics and bio-informatics, scientists use graphs to represent compounds and proteins. Systems for searching and registering chemical compounds have al- ready been developed.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ

Chủ đề

Khai thác dữ liệu trong đồ thị

Phương pháp lập chỉ mục hiệu quả

Tìm kiếm và phân tích tương đồng

Ứng dụng của dữ liệu lớn trong nghiên cứu

Luận án tiến sĩ mining indexing and similarity search in large graph data sets

CERTIFICATE OF COMMITTEE APPROVAL

Abstract

Acknowledgments

Table of Contents

1. CHƯƠNG 1: INTRODUCTION

1.1. Motivation

2. CHƯƠNG 2: GRAPH PATTERN MINING

2.1. Apriori-based Mining

2.2. Pattern Growth-based Mining

2.3. Right-Most Extension

2.4. DFS Lexicographic Order

2.5. Closed Graph Pattern

2.6. Failure of Early Termination

2.7. Detecting the Failure of Early Termination

2.8. Variant Graph Patterns

2.8.1. Contrast Graph Pattern

2.8.2. Coherent Graph Pattern

2.8.3. Discriminative Graph Pattern

2.8.4. Dense Graph Pattern

2.8.5. Approximate Graph Pattern

2.9. Relevance-Aware Top-K

2.10. Pattern-Based Classification

2.11. Automated Software Bug Isolation

2.11.1. Uncover “Backtrace” for Noncrashing Bugs

2.12. Graph Patterns with Constraints

2.12.1. Highly Connected Graph Patterns

2.12.2. CloseCut: A Pattern Growth Approach

2.12.3. SPLAT: A Pattern Reduction Approach

2.12.4. Pruning Patterns

2.12.5. Gene Relevance Network Analysis

3. CHƯƠNG 3: GRAPH INDEXING

3.1. Graph Query Processing

3.2. Path-based Graph Indexing

3.3. Discriminative Fragment Selection

3.4. Insert/Delete Maintenance

4. CHƯƠNG 4: GRAPH SIMILARITY SEARCH

4.1. Substructure Similarity Search

4.2. Feature-Graph Matrix

4.3. Feature Miss Estimation

4.4. Feature Set Selection

4.4.1. Complexity of Optimal Feature Set Selection

4.4.2. Clustering based Feature Set Selection

4.5. Substructure Search with Superimposed Distance

4.6. Framework of Partition-Based Index and Search

4.7. Fragment-based Index

4.8. Partition-based Search

List of Figures

List of Tables

Glossary of Notation

I. Giới thiệu

1.1. Động lực nghiên cứu

II. Khai thác mẫu đồ thị

2.1. Phương pháp khai thác dựa trên Apriori

2.2. Phương pháp khai thác dựa trên tăng trưởng mẫu

III. Lập chỉ mục đồ thị

3.1. Chọn lọc các đoạn đồ thị phân biệt

3.2. Duy trì chỉ mục khi thêm xóa đồ thị

IV. Tìm kiếm tương đồng trong đồ thị

4.1. Tìm kiếm tương đồng dựa trên cấu trúc con

4.2. Tìm kiếm tương đồng dựa trên khoảng cách chồng chéo

V. Ứng dụng thực tiễn

5.1. Phân tích mạng gene

5.2. Phân tích luồng chương trình

Tài liệu liên quan

THÔNG TIN CHI TIẾT

Tác giả: Xifeng Yan

Trường học: University of Illinois at Urbana-Champaign

Chuyên ngành: Computer Science

Đề tài: Mining, Indexing and Similarity Search in Large Graph Data Sets

Loại tài liệu: thesis

Năm xuất bản: 2006

Địa điểm: Urbana

Có thể bạn quan tâm