Luận Văn: Phương Pháp ACE Agile Cho Phép Kết Nối Tương Đồng Hiệu Quả Sử Dụng MapReduce

Luận văn ACE Agile Contingent and Efficient Similarity Joins sử dụng MapReduce, tập trung vào cải thiện hiệu suất xử lý dữ liệu lớn với phương pháp kết hợp tương tự hiệu quả.

Trường đại học

The University of Toledo

Chuyên ngành

Engineering

Người đăng

Ẩn danh

Thể loại

thesis

2013

103

Phí lưu trữ

35 Point

Mục lục chi tiết

Abstract

Acknowledgments

1. Introduction

2. Background

2.3. Multisets and Similarity Measures

3. Strategic and Suave Processing for Similarity Joins Using MapReduce

3.1. Stage I - Map Phase

3.2. Stage I - Reduce Phase

3.3. Stage II - Map Phase

3.4. Stage II - Reduce Phase

3.4.2. Positional Filtering in Stage II-Reduce Phase

3.5. Stage III - Map Phase

3.6. Stage III - Reduce Phase

3.8. Comparison of SSS with SSJ-2R

3.10. Enhancing the Scalability of the Algorithm

4. Adept and Agile Processing for Efficient and Scalable Similarity Joins Using MapReduce

4.1. Stage I - Map Phase

4.2. Stage I - Reduce Phase

4.3. Stage II - Map Phase

4.4. Stage II - Reduce Phase

4.4.2. Optimizing the minimum Prefix Hamming Distance, Hpmin

4.4.3. Suffix Filtering in Stage II-Reduce Phase

4.5. Stage III - Map Phase

4.6. Stage III - Reduce Phase

4.7. Comparison of ESSJ with SSJ-2R

5. Efficient, Adaptable and Scalable MapReduce Algorithm For Similarity Joins Using Hybrid Strategies

5.1. Stage II - Reduce Phase

5.2. Stage III - Map Phase

5.3. Stage III - Reduce Phase

6. Conclusion

References

Tóm tắt

I. Phương pháp ACE Agile

Phương pháp ACE Agile là trọng tâm của luận văn, tập trung vào việc tối ưu hóa quá trình kết nối tương đồng trong xử lý dữ liệu lớn. Phương pháp này kết hợp Agile với MapReduce để đạt hiệu quả cao trong việc xử lý các tập dữ liệu phức tạp. ACE Agile đề xuất ba thuật toán chính: SSS, ESSJ và EASE, nhằm giải quyết các thách thức trong việc thực hiện kết nối tương đồng giữa các tập dữ liệu đa hợp. Các thuật toán này không chỉ áp dụng cho tập hợp mà còn mở rộng cho các vector và tập dữ liệu đa hợp, mang lại tính linh hoạt cao.

1.1. Thuật toán SSS

Thuật toán SSS (Strategic and Suave Processing for Similarity Joins Using MapReduce) tập trung vào việc áp dụng các kỹ thuật lọc như prefix, size và positional filtering trong mô hình MapReduce. Các cặp dữ liệu vượt qua quá trình lọc được kết nối một cách hiệu quả thông qua việc sử dụng Multiset File. Thuật toán này cũng đề xuất một kỹ thuật sáng tạo để tăng khả năng mở rộng, đảm bảo hiệu suất cao khi xử lý các tập dữ liệu lớn.

1.2. Thuật toán ESSJ

Thuật toán ESSJ (Adept and Agile Processing for Efficient and Scalable Similarity Joins using MapReduce) kết hợp tất cả các kỹ thuật lọc, bao gồm prefix, size, positional và suffix filtering. Điểm nổi bật của ESSJ là việc thực hiện kết nối tương đồng mà không cần phụ thuộc vào Multiset File, giúp giảm thiểu tải I/O và mạng. Thuật toán này được thiết kế để đạt hiệu quả cao và khả năng mở rộng tốt.

1.3. Thuật toán EASE

Thuật toán EASE (Efficient, Adaptable and Scalable MapReduce Algorithm For Similarity Joins Using Hybrid Strategies) là sự kết hợp giữa SSS và ESSJ. EASE sử dụng cả hai chiến lược: kết nối dữ liệu thông qua Multiset File và không sử dụng file. Điều này giúp thuật toán tận dụng được ưu điểm của cả hai phương pháp, mang lại hiệu quả cao trong việc xử lý các cặp dữ liệu phức tạp.

II. Kết nối tương đồng

Kết nối tương đồng là một thao tác quan trọng trong khai thác dữ liệu, với nhiều ứng dụng thực tế như phát hiện trùng lặp, làm sạch dữ liệu và hệ thống đề xuất. Luận văn tập trung vào việc cải thiện hiệu quả của kết nối tương đồng thông qua việc áp dụng các kỹ thuật lọc trong mô hình MapReduce. Các kỹ thuật này giúp giảm số lượng cặp dữ liệu cần kết nối, từ đó tối ưu hóa hiệu suất tính toán và giảm tải mạng.

2.1. Kỹ thuật lọc trong MapReduce

Các kỹ thuật lọc như prefix, size, positional và suffix filtering được áp dụng để giảm số lượng cặp dữ liệu cần kết nối. Việc áp dụng các kỹ thuật này trong mô hình MapReduce giúp tối ưu hóa quá trình xử lý, giảm thiểu tải I/O và mạng. Các thuật toán trong luận văn đã chứng minh hiệu quả vượt trội so với các phương pháp hiện có.

2.2. Ứng dụng thực tế

Kết nối tương đồng có nhiều ứng dụng thực tế, từ phát hiện trùng lặp dữ liệu đến hệ thống đề xuất. Luận văn đã thử nghiệm các thuật toán trên dữ liệu thực tế từ Twitter, cho thấy hiệu suất cải thiện đáng kể, với mức tăng hiệu suất lên đến 70% so với các thuật toán hiện có.

III. Hiệu quả sử dụng MapReduce

Hiệu quả sử dụng MapReduce là một trong những mục tiêu chính của luận văn. Các thuật toán được đề xuất không chỉ tối ưu hóa quá trình kết nối tương đồng mà còn đảm bảo hiệu quả cao trong việc sử dụng tài nguyên tính toán và mạng. Việc áp dụng các kỹ thuật lọc và chiến lược kết nối linh hoạt giúp giảm thiểu tải I/O và mạng, từ đó cải thiện hiệu suất tổng thể.

3.1. Tối ưu hóa tài nguyên

Các thuật toán trong luận văn được thiết kế để tối ưu hóa việc sử dụng tài nguyên tính toán và mạng. Việc giảm số lượng cặp dữ liệu cần kết nối thông qua các kỹ thuật lọc giúp giảm thiểu tải I/O và mạng, từ đó cải thiện hiệu suất tổng thể.

3.2. Khả năng mở rộng

Các thuật toán được đề xuất có khả năng mở rộng tốt, đảm bảo hiệu suất cao khi xử lý các tập dữ liệu lớn. Việc áp dụng các chiến lược kết nối linh hoạt giúp các thuật toán thích ứng với sự gia tăng kích thước dữ liệu và quy mô cụm máy tính.

01/03/2025

Bạn đang xem trước tài liệu:

Luận văn ace agile contingent and efficient similarity joins using mapreduce

Tải đầy đủ

Nội dung chính

Tổng quan nghiên cứu

Trong bối cảnh sự phát triển mạnh mẽ của các ứng dụng trực tuyến và lượng người dùng ngày càng tăng, khối lượng dữ liệu cần xử lý cũng tăng lên đáng kể. Các ứng dụng như phát hiện trùng lặp, phát hiện đạo văn, làm sạch dữ liệu, liên kết bản ghi, tìm kiếm chuỗi, phát hiện bất thường lưu lượng Internet và hệ thống đề xuất đều đòi hỏi xử lý dữ liệu lớn hiệu quả. Một trong những thao tác quan trọng trong khai phá dữ liệu là phép toán Similarity Join (ghép nối tương đồng), nhằm tìm các cặp dữ liệu có mức độ tương đồng vượt ngưỡng cho trước. Với dữ liệu lớn, việc xử lý phân tán là cần thiết, trong đó khung MapReduce và nền tảng Hadoop được sử dụng phổ biến.

Luận văn tập trung phát triển ba thuật toán MapReduce hiệu quả cho phép thực hiện Similarity Join giữa các multisets (tập đa), là cấu trúc dữ liệu mô tả tần suất xuất hiện của các phần tử, phản ánh thực tế dữ liệu tốt hơn so với tập hợp thông thường. Các thuật toán này là SSS (Strategic and Suave Processing), ESSJ (Adept and Agile Processing) và EASE (Efficient, Adaptable and Scalable Hybrid Algorithm). Mục tiêu chính là giảm thiểu số lượng cặp dữ liệu cần ghép nối thông qua các kỹ thuật lọc như prefix, size, positional và suffix filtering, đồng thời thiết kế thuật toán phù hợp với mô hình MapReduce phân tán, đảm bảo hiệu quả về I/O, mạng và tính toán.

Phạm vi nghiên cứu áp dụng trên dữ liệu thực tế từ Twitter với khoảng 60 GB dữ liệu tweet, tập trung vào việc phát hiện người dùng có nội dung tweet tương đồng. Kết quả thực nghiệm cho thấy các thuật toán đề xuất cải thiện hiệu suất hơn 70% so với thuật toán hiện đại nhất trước đó, đồng thời giải quyết được các vấn đề về khả năng mở rộng và hiệu quả xử lý trong môi trường phân tán.

Cơ sở lý thuyết và phương pháp nghiên cứu

Khung lý thuyết áp dụng

Luận văn dựa trên các lý thuyết và mô hình sau:

MapReduce Framework: Mô hình xử lý song song và phân tán, trong đó dữ liệu được xử lý qua các hàm Map và Reduce, đảm bảo cân bằng tải, chịu lỗi và tối ưu truyền dữ liệu qua mạng.
Multisets và các phép đo tương đồng: Multiset là tập hợp các phần tử có thể xuất hiện nhiều lần, được mô tả bằng tần suất xuất hiện. Các phép đo tương đồng phổ biến gồm Ruzicka, Cosine, Dice và Overlap, được định nghĩa dựa trên phép giao và hợp của multisets.
Kỹ thuật lọc trong Similarity Join: Bao gồm prefix filtering (lọc theo phần đầu của tập), size filtering (lọc theo kích thước tập), positional filtering (lọc dựa trên vị trí phần tử trong tập) và suffix filtering (lọc theo phần cuối của tập). Các kỹ thuật này giúp giảm số lượng cặp cần so sánh, tăng hiệu quả tính toán.

Phương pháp nghiên cứu

Nguồn dữ liệu: Dữ liệu thực tế thu thập từ Twitter, khoảng 60 GB dữ liệu tweet đã được tiền xử lý, loại bỏ stop words và chuyển đổi thành multisets biểu diễn tần suất từ trong tweet của từng người dùng.
Phương pháp phân tích: Thiết kế và triển khai ba thuật toán MapReduce (SSS, ESSJ, EASE) trên nền tảng Hadoop. Thuật toán sử dụng chuỗi các bước Map và Reduce để thực hiện lọc và ghép nối tương đồng, tận dụng các kỹ thuật lọc đa dạng nhằm giảm thiểu số lượng cặp cần xử lý.
Timeline nghiên cứu: Quá trình nghiên cứu bao gồm giai đoạn tiền xử lý dữ liệu, phát triển thuật toán, triển khai trên Hadoop, thực nghiệm với dữ liệu Twitter và phân tích kết quả so sánh với thuật toán hiện đại SSJ-2R.

Kết quả nghiên cứu và thảo luận

Những phát hiện chính

Giảm số lượng cặp ghép nối: Thuật toán SSS giảm đáng kể số cặp cần thực hiện Similarity Join so với thuật toán SSJ-2R. Ví dụ, với ngưỡng tương đồng 0.3, số cặp giảm từ khoảng 7543 xuống còn 320, tương đương giảm hơn 95%.
Tiết kiệm thời gian xử lý: Thời gian thực hiện các giai đoạn của SSS thấp hơn nhiều so với SSJ-2R. Ví dụ, giai đoạn ghép nối (Reduce Phase) của SSS chỉ mất khoảng 81 giây, trong khi SSJ-2R mất tới 1404 giây với 16.000 bản ghi.
Hiệu suất tăng theo kích thước dữ liệu: Khi số lượng bản ghi tăng từ 7.543 lên 16.244, thời gian xử lý của SSS tăng từ 320 giây lên 594 giây, trong khi SSJ-2R tăng từ 479 giây lên 1102 giây, cho thấy SSS có khả năng mở rộng tốt hơn.
Khả năng mở rộng và hiệu quả mạng: Việc áp dụng các kỹ thuật lọc theo thứ tự chiến lược giúp giảm thiểu lưu lượng I/O và truyền tải mạng, đồng thời thuật toán SSS sử dụng kỹ thuật phân phối dữ liệu theo từng đợt (waves) để tránh quá tải bộ nhớ, khắc phục hạn chế của SSJ-2R.

Thảo luận kết quả

Nguyên nhân chính của sự cải thiện hiệu suất là do SSS kết hợp đồng thời nhiều kỹ thuật lọc (prefix, size, positional) trong một chuỗi hợp lý, giúp loại bỏ sớm các cặp không tiềm năng. So với SSJ-2R chỉ sử dụng prefix filtering và không áp dụng size filtering do giả định vector chuẩn hóa, SSS tận dụng triệt để thông tin kích thước và vị trí phần tử trong multisets. Kết quả này phù hợp với các nghiên cứu trước đây về hiệu quả của các kỹ thuật lọc trong Similarity Join.

Việc phân phối dữ liệu theo từng đợt trong SSS giúp giảm áp lực bộ nhớ và tăng khả năng mở rộng trên các cụm máy lớn, điều mà SSJ-2R chưa giải quyết triệt để. Các biểu đồ so sánh thời gian xử lý theo số lượng bản ghi minh họa rõ ràng sự vượt trội của SSS, đặc biệt khi kích thước dữ liệu tăng cao.

Đề xuất và khuyến nghị

Áp dụng kỹ thuật lọc đa chiều trong xử lý dữ liệu lớn: Các tổ chức và nhà phát triển nên tích hợp đồng thời các kỹ thuật prefix, size, positional và suffix filtering để tối ưu hiệu quả xử lý Similarity Join, giảm thiểu tài nguyên tính toán và băng thông mạng.
Triển khai thuật toán MapReduce theo chiến lược phân phối dữ liệu theo đợt: Để đảm bảo khả năng mở rộng và tránh quá tải bộ nhớ, nên áp dụng kỹ thuật phân phối dữ liệu theo từng đợt (waves) khi xử lý dữ liệu lớn trên cụm máy phân tán.
Tối ưu tiền xử lý dữ liệu theo thứ tự tần suất toàn cục: Việc sắp xếp các phần tử trong multisets theo tần suất xuất hiện toàn cục giúp tăng hiệu quả lọc prefix, do đó cần xây dựng quy trình tiền xử lý dữ liệu hiệu quả, có thể áp dụng trên Hadoop.
Phát triển thuật toán lai (hybrid) kết hợp ưu điểm của các phương pháp: Thuật toán EASE kết hợp chiến lược sử dụng file và không sử dụng file trong MapReduce cho phép tận dụng ưu điểm của cả hai, nên được nghiên cứu và áp dụng rộng rãi trong các bài toán tương tự.

Các giải pháp trên nên được thực hiện trong vòng 6-12 tháng, với sự phối hợp giữa các nhóm phát triển phần mềm, nhà quản lý dữ liệu và chuyên gia phân tích để đảm bảo hiệu quả và khả năng ứng dụng thực tế.

Đối tượng nên tham khảo luận văn

Nhà nghiên cứu và sinh viên ngành Khoa học Máy tính, Kỹ thuật phần mềm: Luận văn cung cấp kiến thức sâu về thuật toán phân tán, kỹ thuật lọc dữ liệu và MapReduce, hỗ trợ phát triển các nghiên cứu liên quan đến xử lý dữ liệu lớn.
Chuyên gia phát triển hệ thống Big Data và Hadoop: Các thuật toán và kỹ thuật được trình bày giúp tối ưu hóa hiệu suất xử lý dữ liệu lớn, giảm thiểu chi phí hạ tầng và tăng khả năng mở rộng.
Nhà phân tích dữ liệu và kỹ sư dữ liệu trong các doanh nghiệp: Áp dụng các phương pháp này giúp cải thiện chất lượng và tốc độ xử lý dữ liệu, đặc biệt trong các ứng dụng như phát hiện trùng lặp, đề xuất sản phẩm, phân tích mạng xã hội.
Các tổ chức nghiên cứu và phát triển công nghệ AI, Machine Learning: Việc xử lý dữ liệu đầu vào hiệu quả là bước quan trọng trong xây dựng mô hình học máy, luận văn cung cấp giải pháp thực tiễn cho bài toán này.

Câu hỏi thường gặp

Similarity Join là gì và tại sao quan trọng?
Similarity Join là phép toán tìm các cặp dữ liệu có mức độ tương đồng vượt ngưỡng cho trước, quan trọng trong nhiều ứng dụng như phát hiện trùng lặp, đề xuất sản phẩm, và phân tích mạng xã hội. Ví dụ, phát hiện người dùng Twitter có nội dung tweet tương đồng giúp xây dựng cộng đồng hoặc hệ thống đề xuất.
Tại sao sử dụng multisets thay vì sets trong nghiên cứu này?
Multisets cho phép biểu diễn tần suất xuất hiện của phần tử, phản ánh chính xác hơn dữ liệu thực tế như số lần từ xuất hiện trong tweet. Điều này giúp các kỹ thuật lọc và tính toán tương đồng chính xác và hiệu quả hơn.
Các kỹ thuật lọc prefix, size, positional và suffix filtering hoạt động như thế nào?

Prefix filtering: Chỉ xét phần đầu của tập để tìm cặp tiềm năng.
Size filtering: Loại bỏ cặp có kích thước không phù hợp với ngưỡng tương đồng.
Positional filtering: Dựa trên vị trí phần tử chung để loại bỏ cặp không đủ điều kiện.
Suffix filtering: Dùng khoảng cách Hamming phần cuối để lọc thêm.
Kết hợp giúp giảm đáng kể số cặp cần so sánh.

Làm thế nào để thuật toán đảm bảo khả năng mở rộng trên cụm máy lớn?
Thuật toán sử dụng kỹ thuật phân phối dữ liệu theo từng đợt (waves), chỉ tải một phần dữ liệu vào bộ nhớ tại mỗi thời điểm, tránh quá tải và tăng khả năng xử lý song song trên nhiều nút.
Có thể áp dụng các thuật toán này cho các loại dữ liệu khác ngoài Twitter không?
Có, các thuật toán được thiết kế tổng quát cho multisets và vectors, phù hợp với nhiều loại dữ liệu lớn như văn bản, log hệ thống, dữ liệu cảm biến, giúp phát hiện tương đồng trong nhiều lĩnh vực.

Kết luận

Đã phát triển thành công ba thuật toán MapReduce (SSS, ESSJ, EASE) hiệu quả cho Similarity Join giữa multisets, cải thiện đáng kể hiệu suất so với thuật toán hiện đại SSJ-2R.
Mở rộng các kỹ thuật lọc prefix, size, positional và suffix filtering từ sets sang multisets, phù hợp với mô hình MapReduce phân tán.
Thiết kế chiến lược phân phối dữ liệu theo đợt giúp tăng khả năng mở rộng và giảm áp lực bộ nhớ trong xử lý dữ liệu lớn.
Thực nghiệm trên dữ liệu Twitter thực tế chứng minh hiệu quả vượt trội với mức cải thiện hiệu suất trên 70%.
Đề xuất các giải pháp ứng dụng và phát triển tiếp theo nhằm tối ưu hóa xử lý dữ liệu lớn trong các hệ thống phân tán.

Hành động tiếp theo: Áp dụng các thuật toán này trong các dự án xử lý dữ liệu lớn thực tế, đồng thời nghiên cứu mở rộng cho các loại dữ liệu và mô hình tương đồng khác. Đăng ký tham khảo luận văn để nắm bắt chi tiết kỹ thuật và triển khai hiệu quả.

Trích đoạn nội dung tài liệu

A Thesis entitled ACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce by Mahalakshmi Lakshminarayanan Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Masters of Science Degree in Engineering Dr. Vijay Devabhaktuni, Committee Chair Dr. Acosta, Committee Member Dr. Green II, Committee Member Dr.

Mansoor Alam, Committee Member Dr. Komuniecki, Dean College of Graduate Studies The University of Toledo December 2013 Copyright 2013, Mahalakshmi Lakshminarayanan This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. An Abstract of ACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce by Mahalakshmi Lakshminarayanan Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Masters of Science Degree in Engineering The University of Toledo December 2013 Similarity Join is an important operation for data mining, with a diverse range of real world applications.

Three efficient MapReduce Algorithms for performing Sim- ilarity Joins between multisets are proposed in this thesis. Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence, they are vital for improving the efficiency of the algorithm. Multisets represent real world data better, by considering the frequency of its elements. Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapRe- duce algorithms do not incorporate any filtering technique or inefficiently incorporate prefix filtering with poor scalability.

This work extends the filtering techniques, namely the prefix, size, positional and suffix filters to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model. Adeptly incorporating the filtering techniques in a strategic sequence minimizes the pairs generated and joined, resulting in I/O, network and computational efficiency. In the SSS algorithm, prefix, size and positional filtering are incorporated in the MapReduce Framework. The pairs that thrive filtering are joined suavely in the third Similarity Join Stage, utilizing a Multiset File generated in the second stage.

We also developed a rational and creative technique to enhance the scalability of the algorithm as a contingency need. iii In the ESSJ algorithm, all the filtering techniques, namely, prefix, size, positional as well as suffix filtering are incorporated in the MapReduce Framework. It is designed with a seamless and scalable Similarity Join Stage, where the similarity joins are performed without dependency to a file. In the EASE algorithm, all the filtering techniques, namely, prefix, size, positional and suffix are incorporated in the MapReduce Framework.

However it is tailored as a hybrid algorithm to exploit the strategies of both SSS and ESSJ for performing the joins. Some multiset pairs are joined utilizing the Multiset File similar to SSS, and some multisets are joined without utilizing it similar to ESSJ. The algorithm harvests the benefits of both the strategies. SSS and ESSJ algorithms were developed using Hadoop and tested using real- world Twitter data.

For both SSS and ESSJ, experimental results demonstrate phe- nomenal performance gains of over 70% in comparison to the competing state-of-the- art algorithm. iv I dedicate this work to the Almighty! Acknowledgments It is a pleasure beyond measure to acknowledge the people who have helped and supported me to complete the Master’s program. Acosta for his kind, patient, intelligent, meticulous and thorough guidance and support. He made my study at the University of Toledo, a very pleasant and memorable one! I thank Dr.

Rob for his timely, creative, elegant, prudent and thorough guidance and support. It was wonderful and comfortable working under him! Without the guidance of Dr. Acosta and Dr. Rob, this work would not have been possible.

Alam for his wise, kind and gracious support throughout my Master’s program! Special thanks to Dr. Vijay for his benevolent, erudite and gracious guidance and support! I thank Dr. Alam, EECS and the ET Departments for the financial support. I thank the EECS and ET faculty members and staff members who have helped me.

I thank my parents, grand parents, brothers, relatives and friends for their support, with special thanks to my mom! Ultimately, I thank God for showering His grace on us! vi Contents Abstract iii Acknowledgments vi Contents vii List of Tables x List of Figures xii 1 Introduction 1 2 Background 6 2.3 Multisets and Similarity Measures. 11 3 Strategic and Suave Processing for Similarity Joins Using MapRe- duce 14 3.1 Stage I - Map Phase .2 Stage I - Reduce Phase .3 Stage II - Map Phase .4 Stage II- Reduce Phase .2 Positional Filtering in Stage II-Reduce Phase .5 Stage III - Map Phase .6 Stage III - Reduce Phase .8 Comparison of SSS with SSJ-2R .10 Enhancing the Scalability of the Algorithm. 44 4 Adept and Agile Processing for Efficient and Scalable Similarity Joins Using MapReduce 46 4.1 Stage I - Map Phase .2 Stage I - Reduce Phase .3 Stage II - Map Phase .4 Stage II - Reduce Phase .2 Optimizing the minimum Prefix Hamming Distance, Hpmin : .3 Suffix Filtering in Stage II- Reduce Phase .5 Stage III - Map Phase .6 Stage III - Reduce Phase .7 Comparison of ESSJ with SSJ-2R. 73 5 Efficient, Adaptable and Scalable MapReduce Algorithm For Simi- larity Joins Using Hybrid Strategies 74 viii 5.1 Stage II - Reduce Phase .2 Stage III - Map Phase .3 Stage III - Reduce Phase.

79 6 Conclusion 81 References 83 ix List of Tables 2.1 Multiset Similarity Measures and their formulae .1 The number of pairs for which similarity joins are performed in SSJ-2R and SSS algorithms.2 Running times of the Stages of SSS algorithm, for 16,000 records and a similarity threshold of 0.3 Running times of the Stages of SSJ-2R algorithm, for 16,000 records and a similarity threshold of 0.4 Running times of SSS and SSJ-2R algorithms, for varying number of in- put records and corresponding Performance Improvement for a similarity threshold of 0.5 Running times of SSS and SSJ-2R algorithms, for varying number of in- put records and corresponding Performance Improvement for a similarity threshold of 0.6 Running times of the Waves of Stage III of SSS-SE algorithm, for 16,000 records and a similarity threshold of 0.1 The number of pairs for which similarity joins are performed in SSJ-2R and ESSJ algorithms.2 Running times of the Stages of ESSJ algorithm, for 16,000 records and a similarity threshold of 0.3 Running times of the Stages of SSJ-2R algorithm, for 16,000 records and a similarity threshold of 0.4 Running times of the Stages of ESSJ algorithm, for 16,000 records and a similarity threshold of 0.5 Running times of the Stages of SSJ-2R algorithm, for 16,000 records and a similarity threshold of 0.6 Running times of ESSJ and SSJ-2R algorithms, for varying number of input records and corresponding Performance Improvement for a similarity threshold of 0.7 Running times of ESSJ and SSJ-2R algorithms, for varying number of input records and corresponding Performance Improvement for a similarity threshold of 0. 72 xi List of Figures 2-1 MapReduce Model. 7 3-1 Mapper and Reducer Instances of Stage I. 19 3-2 Type I and Type II Mapper instances of Stage II.

20 3-3 Reducer Instance of Stage II. 24 3-4 Partitioning a Multiset Mi for Positional Filtering. 25 3-5 Type I and Type II Mapper Instances of Stage III. 29 3-6 Reducer Instance of Stage III.

30 3-7 Running times of the algorithms vs Number of Records, for a similarity threshold of 0. 37 3-8 Running times of the algorithms vs Number of Records, for a similarity threshold of 0. 38 3-9 Running times of the algorithms vs Number of Records, for a similarity threshold of 0. 38 3-10 Running times of the algorithms vs Number of Records, for a similarity threshold of 0.

42 4-1 Mapper and Reducer Instances of Stage I. 48 4-2 Type I and Type II Mapper Instances of Stage II. 54 4-3 Reducer Instance of Stage II. 56 4-4 Partitioning a Multiset Mi for Suffix Filtering.

59 4-5 Type I and Type II Mapper Instance of Stage III. 63 xii 4-6 Reducer Instance of Stage III. 65 4-7 Running times of the algorithms vs Number of Records, for a similarity threshold of 0. 71 4-8 Running times of the algorithms vs Number of Records, for a similarity threshold of 0.

72 5-1 Mapper Instances of Stage III. 76 5-2 Reducer Instance of Stage III. 78 xiii Chapter 1 Introduction This era has visualized massive growth of online applications and their users, which has resulted in an enormous increase in the volume of data that needs to be processed. Besides, there are numerous applications that require big data processing by their nature.

This includes processing large corpora, environmental and medical data that are gathered over a period, data from Smart Grids, and so on. Simple, yet effective and essential operations are always the need of the moment for any application. Similarity Joins are vital operations of that nature, which are essential for a diverse range of applications. Similarity joins are all time handy, and in the current scenario, big data is omnipresent.

Some interesting applications of similarity joins include – Duplicate detection [1–5], Plagiarism Detection [6, 7], Data Cleaning [8, 9], Record Linkage [10–13], String Searching [14–19], Community Discovery [20, 21], Internet Traffic Anomalies Detection and Advertisement Targeting [22, 23] and Collaborative Filtering for Recommendation Systems [24]. As the size of data in such applications is typically very large, distributed processing is generally a necessity. The MapReduce framework [25] and Hadoop [26] are very popular tools that are used in this study for accomplishing these purposes. In this thesis, the focus is on the creation of ACE (Agile, Contingent and Efficient) MapReduce algorithms that effectively handle similarity joins between multisets.

1 Stated concisely, the issue addressed through this study is as follows: Given a collec- tion of multisets, S = {Mi ,. , M|S| }, where Mi represents a multiset, and a similarity threshold, t, all pairs of multisets (< Mi , Mj >), whose similarity Sim(Mi , Mj ) ex- ceeds t must be discovered. In addressing this issue, the entirety of the presented work focuses on a trilogy of challenges involved in efficiently performing similarity joins in the MapReduce paradigm including: 1. In a naive implementation, all of the possible pairs of entities must be joined.

In an efficient implementation, the initial application of filtering techniques to filter out the possible pairs that must be joined is preferred. Real world data can be better represented using multisets because the frequency of an entity is taken into account. Thus, filtering techniques must be developed for multisets, though existing work have designed filtering techniques only for sets; 2. These filtering techniques must be designed in a distributed way suitable to the MapReduce framework; and 3.

Similarity Joins must be performed for the pairs that survive filtering. The challenge is to bring together the data corresponding to the surviving entity pairs in the MapReduce style work flow. This thesis proposes three algorithms to address the above mentioned trilogy of concerns and names them as SSS (Strategic and Suave Processing for Similarity Joins Using MapReduce), ESSJ (Adept and Agile Processing for Efficient and Scalable Similarity Joins using MapReduce) and EASE (Efficient, Adaptable and Scalable MapReduce Algorithm For Similarity Joins Using Hybrid Strategies). The second and third problems listed above are particularly challenging as it requires designing the algorithm to suit the shared-nothing MapReduce model.

2 The prior MapReduce similarity join algorithms have incorporated no filtering techniques or have attempted to incorporate the filtering techniques, but that has resulted in inefficiency and poor scalability, due to a large quantity of data generated causing I/O and network bottlenecks. However, the algorithms in this study achieve this task by adeptly applying the prefix, size, positional and suffix filtering techniques in a strategic sequence, which minimizes the candidate pairs generated, resulting in I/O and network efficiency.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ

Luận Văn: Phương Pháp ACE Agile Cho Phép Kết Nối Tương Đồng Hiệu Quả Sử Dụng MapReduce là một nghiên cứu chuyên sâu về việc áp dụng phương pháp Agile trong việc kết nối tương đồng dữ liệu, sử dụng công nghệ MapReduce để tối ưu hóa hiệu suất. Tài liệu này tập trung vào việc cải thiện quy trình xử lý dữ liệu lớn, giúp các hệ thống trở nên linh hoạt và hiệu quả hơn. Đặc biệt, phương pháp ACE Agile được đề xuất như một giải pháp sáng tạo để giải quyết các thách thức trong việc kết nối và xử lý dữ liệu phân tán. Độc giả sẽ nhận được những hiểu biết sâu sắc về cách tích hợp Agile với MapReduce, đồng thời khám phá các lợi ích thực tiễn như tăng tốc độ xử lý, giảm chi phí và nâng cao độ chính xác của dữ liệu.

Để mở rộng kiến thức về các phương pháp xử lý dữ liệu tiên tiến, bạn có thể tham khảo Luận văn tốt nghiệp hệ thống thông tin OpenK: Data Cleansing System - A Clustering Based Approach for Detecting Data Anomalies, nghiên cứu về hệ thống làm sạch dữ liệu và phát hiện bất thường. Ngoài ra, Luận văn thạc sĩ khoa học máy tính: Research and Implement a Preprocessor for Network Intrusion Detection System (NIDS) cung cấp cái nhìn chi tiết về việc triển khai bộ tiền xử lý cho hệ thống phát hiện xâm nhập mạng. Cuối cùng, Luận án tiến sĩ: Một số phương pháp xử lý truy vấn mới trên cơ sở dữ liệu hướng đối tượng mờ sẽ giúp bạn hiểu rõ hơn về các phương pháp xử lý truy vấn hiện đại. Mỗi tài liệu này là cơ hội để bạn khám phá sâu hơn về các chủ đề liên quan, từ đó nâng cao kiến thức và kỹ năng trong lĩnh vực xử lý dữ liệu.

#xử lý dữ liệu lớn

#luận văn công nghệ

#Phương pháp ACE Agile

#Kết nối tương đồng

#MapReduce hiệu quả

#Agile trong công nghệ

Chủ đề

Xử lý dữ liệu