A Thesis entitled ACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce by Mahalakshmi Lakshminarayanan Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Masters of Science Degree in Engineering Dr. Vijay Devabhaktuni, Committee Chair Dr. Acosta, Committee Member Dr. Green II, Committee Member Dr.
Mansoor Alam, Committee Member Dr. Komuniecki, Dean College of Graduate Studies The University of Toledo December 2013 Copyright 2013, Mahalakshmi Lakshminarayanan This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. An Abstract of ACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce by Mahalakshmi Lakshminarayanan Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Masters of Science Degree in Engineering The University of Toledo December 2013 Similarity Join is an important operation for data mining, with a diverse range of real world applications.
Three efficient MapReduce Algorithms for performing Sim- ilarity Joins between multisets are proposed in this thesis. Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence, they are vital for improving the efficiency of the algorithm. Multisets represent real world data better, by considering the frequency of its elements. Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapRe- duce algorithms do not incorporate any filtering technique or inefficiently incorporate prefix filtering with poor scalability.
This work extends the filtering techniques, namely the prefix, size, positional and suffix filters to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model. Adeptly incorporating the filtering techniques in a strategic sequence minimizes the pairs generated and joined, resulting in I/O, network and computational efficiency. In the SSS algorithm, prefix, size and positional filtering are incorporated in the MapReduce Framework. The pairs that thrive filtering are joined suavely in the third Similarity Join Stage, utilizing a Multiset File generated in the second stage.
We also developed a rational and creative technique to enhance the scalability of the algorithm as a contingency need. iii In the ESSJ algorithm, all the filtering techniques, namely, prefix, size, positional as well as suffix filtering are incorporated in the MapReduce Framework. It is designed with a seamless and scalable Similarity Join Stage, where the similarity joins are performed without dependency to a file. In the EASE algorithm, all the filtering techniques, namely, prefix, size, positional and suffix are incorporated in the MapReduce Framework.
However it is tailored as a hybrid algorithm to exploit the strategies of both SSS and ESSJ for performing the joins. Some multiset pairs are joined utilizing the Multiset File similar to SSS, and some multisets are joined without utilizing it similar to ESSJ. The algorithm harvests the benefits of both the strategies. SSS and ESSJ algorithms were developed using Hadoop and tested using real- world Twitter data.
For both SSS and ESSJ, experimental results demonstrate phe- nomenal performance gains of over 70% in comparison to the competing state-of-the- art algorithm. iv I dedicate this work to the Almighty! Acknowledgments It is a pleasure beyond measure to acknowledge the people who have helped and supported me to complete the Master’s program. Acosta for his kind, patient, intelligent, meticulous and thorough guidance and support. He made my study at the University of Toledo, a very pleasant and memorable one! I thank Dr.
Rob for his timely, creative, elegant, prudent and thorough guidance and support. It was wonderful and comfortable working under him! Without the guidance of Dr. Acosta and Dr. Rob, this work would not have been possible.
Alam for his wise, kind and gracious support throughout my Master’s program! Special thanks to Dr. Vijay for his benevolent, erudite and gracious guidance and support! I thank Dr. Alam, EECS and the ET Departments for the financial support. I thank the EECS and ET faculty members and staff members who have helped me.
I thank my parents, grand parents, brothers, relatives and friends for their support, with special thanks to my mom! Ultimately, I thank God for showering His grace on us! vi Contents Abstract iii Acknowledgments vi Contents vii List of Tables x List of Figures xii 1 Introduction 1 2 Background 6 2.3 Multisets and Similarity Measures. 11 3 Strategic and Suave Processing for Similarity Joins Using MapRe- duce 14 3.1 Stage I - Map Phase .2 Stage I - Reduce Phase .3 Stage II - Map Phase .4 Stage II- Reduce Phase .2 Positional Filtering in Stage II-Reduce Phase .5 Stage III - Map Phase .6 Stage III - Reduce Phase .8 Comparison of SSS with SSJ-2R .10 Enhancing the Scalability of the Algorithm. 44 4 Adept and Agile Processing for Efficient and Scalable Similarity Joins Using MapReduce 46 4.1 Stage I - Map Phase .2 Stage I - Reduce Phase .3 Stage II - Map Phase .4 Stage II - Reduce Phase .2 Optimizing the minimum Prefix Hamming Distance, Hpmin : .3 Suffix Filtering in Stage II- Reduce Phase .5 Stage III - Map Phase .6 Stage III - Reduce Phase .7 Comparison of ESSJ with SSJ-2R. 73 5 Efficient, Adaptable and Scalable MapReduce Algorithm For Simi- larity Joins Using Hybrid Strategies 74 viii 5.1 Stage II - Reduce Phase .2 Stage III - Map Phase .3 Stage III - Reduce Phase.
79 6 Conclusion 81 References 83 ix List of Tables 2.1 Multiset Similarity Measures and their formulae .1 The number of pairs for which similarity joins are performed in SSJ-2R and SSS algorithms.2 Running times of the Stages of SSS algorithm, for 16,000 records and a similarity threshold of 0.3 Running times of the Stages of SSJ-2R algorithm, for 16,000 records and a similarity threshold of 0.4 Running times of SSS and SSJ-2R algorithms, for varying number of in- put records and corresponding Performance Improvement for a similarity threshold of 0.5 Running times of SSS and SSJ-2R algorithms, for varying number of in- put records and corresponding Performance Improvement for a similarity threshold of 0.6 Running times of the Waves of Stage III of SSS-SE algorithm, for 16,000 records and a similarity threshold of 0.1 The number of pairs for which similarity joins are performed in SSJ-2R and ESSJ algorithms.2 Running times of the Stages of ESSJ algorithm, for 16,000 records and a similarity threshold of 0.3 Running times of the Stages of SSJ-2R algorithm, for 16,000 records and a similarity threshold of 0.4 Running times of the Stages of ESSJ algorithm, for 16,000 records and a similarity threshold of 0.5 Running times of the Stages of SSJ-2R algorithm, for 16,000 records and a similarity threshold of 0.6 Running times of ESSJ and SSJ-2R algorithms, for varying number of input records and corresponding Performance Improvement for a similarity threshold of 0.7 Running times of ESSJ and SSJ-2R algorithms, for varying number of input records and corresponding Performance Improvement for a similarity threshold of 0. 72 xi List of Figures 2-1 MapReduce Model. 7 3-1 Mapper and Reducer Instances of Stage I. 19 3-2 Type I and Type II Mapper instances of Stage II.
20 3-3 Reducer Instance of Stage II. 24 3-4 Partitioning a Multiset Mi for Positional Filtering. 25 3-5 Type I and Type II Mapper Instances of Stage III. 29 3-6 Reducer Instance of Stage III.
30 3-7 Running times of the algorithms vs Number of Records, for a similarity threshold of 0. 37 3-8 Running times of the algorithms vs Number of Records, for a similarity threshold of 0. 38 3-9 Running times of the algorithms vs Number of Records, for a similarity threshold of 0. 38 3-10 Running times of the algorithms vs Number of Records, for a similarity threshold of 0.
42 4-1 Mapper and Reducer Instances of Stage I. 48 4-2 Type I and Type II Mapper Instances of Stage II. 54 4-3 Reducer Instance of Stage II. 56 4-4 Partitioning a Multiset Mi for Suffix Filtering.
59 4-5 Type I and Type II Mapper Instance of Stage III. 63 xii 4-6 Reducer Instance of Stage III. 65 4-7 Running times of the algorithms vs Number of Records, for a similarity threshold of 0. 71 4-8 Running times of the algorithms vs Number of Records, for a similarity threshold of 0.
72 5-1 Mapper Instances of Stage III. 76 5-2 Reducer Instance of Stage III. 78 xiii Chapter 1 Introduction This era has visualized massive growth of online applications and their users, which has resulted in an enormous increase in the volume of data that needs to be processed. Besides, there are numerous applications that require big data processing by their nature.
This includes processing large corpora, environmental and medical data that are gathered over a period, data from Smart Grids, and so on. Simple, yet effective and essential operations are always the need of the moment for any application. Similarity Joins are vital operations of that nature, which are essential for a diverse range of applications. Similarity joins are all time handy, and in the current scenario, big data is omnipresent.
Some interesting applications of similarity joins include – Duplicate detection [1–5], Plagiarism Detection [6, 7], Data Cleaning [8, 9], Record Linkage [10–13], String Searching [14–19], Community Discovery [20, 21], Internet Traffic Anomalies Detection and Advertisement Targeting [22, 23] and Collaborative Filtering for Recommendation Systems [24]. As the size of data in such applications is typically very large, distributed processing is generally a necessity. The MapReduce framework [25] and Hadoop [26] are very popular tools that are used in this study for accomplishing these purposes. In this thesis, the focus is on the creation of ACE (Agile, Contingent and Efficient) MapReduce algorithms that effectively handle similarity joins between multisets.
1 Stated concisely, the issue addressed through this study is as follows: Given a collec- tion of multisets, S = {Mi ,. , M|S| }, where Mi represents a multiset, and a similarity threshold, t, all pairs of multisets (< Mi , Mj >), whose similarity Sim(Mi , Mj ) ex- ceeds t must be discovered. In addressing this issue, the entirety of the presented work focuses on a trilogy of challenges involved in efficiently performing similarity joins in the MapReduce paradigm including: 1. In a naive implementation, all of the possible pairs of entities must be joined.
In an efficient implementation, the initial application of filtering techniques to filter out the possible pairs that must be joined is preferred. Real world data can be better represented using multisets because the frequency of an entity is taken into account. Thus, filtering techniques must be developed for multisets, though existing work have designed filtering techniques only for sets; 2. These filtering techniques must be designed in a distributed way suitable to the MapReduce framework; and 3.
Similarity Joins must be performed for the pairs that survive filtering. The challenge is to bring together the data corresponding to the surviving entity pairs in the MapReduce style work flow. This thesis proposes three algorithms to address the above mentioned trilogy of concerns and names them as SSS (Strategic and Suave Processing for Similarity Joins Using MapReduce), ESSJ (Adept and Agile Processing for Efficient and Scalable Similarity Joins using MapReduce) and EASE (Efficient, Adaptable and Scalable MapReduce Algorithm For Similarity Joins Using Hybrid Strategies). The second and third problems listed above are particularly challenging as it requires designing the algorithm to suit the shared-nothing MapReduce model.
2 The prior MapReduce similarity join algorithms have incorporated no filtering techniques or have attempted to incorporate the filtering techniques, but that has resulted in inefficiency and poor scalability, due to a large quantity of data generated causing I/O and network bottlenecks. However, the algorithms in this study achieve this task by adeptly applying the prefix, size, positional and suffix filtering techniques in a strategic sequence, which minimizes the candidate pairs generated, resulting in I/O and network efficiency.