VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY LẠI TRUNG MINH ĐỨC PRESERVING PRIVACY FOR PUBLISHING-TIME-SERIES DATA WITH DIFFERENTIAL PRIVACY Major: COMPUTER SCIENCE Major code: 8480101 MASTER’S THESIS HO CHI MINH CITY, July 2023 ii THIS THESIS IS COMPLETED AT HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM Supervisor 1: Assoc. DANG TRAN KHANH Supervisor 2: PhD. LE LAM SON Examiner 1: Assoc. TRAN TUAN DANG Examiner 2: PhD.
DANG TRAN TRI This master’s thesis is defended at HCM City University of Technology, VNU- HCM City on 10 July 2023. Master’s Thesis Committee: 1. TRAN MINH QUANG 2. NGUYEN THI AI THAO 3.
TRAN TUAN DANG 4. DANG TRAN TRI 5. TRUONG THI AI MINH Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis being corrected (If any). CHAIRMAN OF THESIS COMMITEE HEAD OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING iii VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY SOCIALIST REPUBLIC OF VIETNAM HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY Independence – Freedom - Happiness THE TASK SHEET OF MASTER’S THESIS Full name: LẠI TRUNG MINH ĐỨC Student ID: 2070686 Date of birth: 24 May 1996 Place of birth: Ho Chi Minh City Major: Computer Science Major ID: 8480101 I.
THESIS TITLE: Preserving Privacy for Publishing Time-series Data with Differential Privacy (Duy trì quyền riêng tư cho thời gian xuất bản dữ liệu chuỗi với quyền riêng tư khác biệt) II. TASKS AND CONTENTS: Week Task Time - Conduct the literature review and methodology to conduct the study. W1 to W2 2 weeks - Define scope of work for the main research of the thesis - Write up the report - Research of related works/projects on Differential Privacy, W3 to W4 Time-series privacy 2 weeks - Write up the report (cont.) - Implementing the algorithms of Differential Privacy on Time series data W5 to 10 - Comparing those algorithms with data utility metrics W14 weeks - Finding the data characteristics to choose the best algorithms - Write up the report (cont.) - Finalize the solution package W15 to - Finalize the document 2 weeks W16 - Prepare the presentation Thesis Submission – June 2023 iv III. THESIS START DAY: (According to the decision on assignment of Master’s thesis) 05 September 2022 IV.
THESIS COMPLETION DAY: (According to the decision on the assignment of the Master’s thesis) 12 June 2023 V. Prof DANG TRAN KHANH – PGS. TS ĐẶNG TRẦN KHÁNH PhD. LE LAM SON – TS.
LÊ LAM SƠN Ho Chi Minh City, 09 June 2023 SUPERVISOR 1 SUPERVISOR 2 (Full name and signature) (Full name and signature) Assoc. Prof DANG TRAN KHANH PhD. LE LAM SON CHAIR OF PROGRAM COMMITTEE (Full name and signature) DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING (Full name and signature) Note: Student must pin this task sheet as the first page of the Master’s Thesis booklet v ACKNOWLEDGEMENT I would like to express my profound gratitude to all those who have supported me throughout the journey of completing this master's thesis. First and foremost, I extend my heartfelt thanks to my supervisors, Assoc.
DANG TRAN KHANH and PhD. LE LAM SON, for your invaluable guidance, patience, and expertise. Working under your supervision has been an honor, and your insightful feedback and encouragement have been instrumental in shaping the outcome of this research. I would also like to extend my sincere appreciation to the team at Unilever Vietnam, particularly Mr.
ERIC FRANCIS CHEN – Head of UniOps and Data & Analytics SEA&I, Mr. IAN LOW - my line manager, and the awesome Unilever Data & Analytics Vietnam team. Your willingness to assist, and insightful discussions have significantly contributed to the successful completion of this thesis. Furthermore, I want to acknowledge my friends, classmates, doctors, and psychologists for your mental supports and contributions.
Your motivation, discussions, treatments with diverse perspectives have supported me to live positively. Special appreciation goes to Ms. HOA NINH, Ms. LINH PHAM, Ms.
XUAN NGUYEN, Mr. RYAN NGUYEN for your huge support and encouragement throughout this journey. Lastly, I am indebted to my family: mom THAI LINH PHAM, sister MARY PHUC LAI for your unconditional love, and constant encouragement. Your sacrifices and understanding have been the bedrock of my achievements.
I am profoundly grateful for your presence and support. To all those mentioned above, as well as those who have contributed in immeasurable ways, I offer my sincerest thanks. Your efforts, belief, and contributions have made this thesis possible. vi ABSTRACT This thesis explores the crucial domain of data privacy, encompassing the rights of individuals to maintain control over the collection, usage, sharing, and storage of their personal data.
Within the realm of personal data, time-series data poses distinct challenges and sensitivities when it comes to privacy protection. Time-series data comprises information with temporal attributes that can unveil patterns, trends, and behaviors of individuals or groups over time, and carries inherent risks in terms of privacy breaches. The primary objectives of this thesis are as follows: first, to review traditional methods of privacy-preserving data publishing, with a specific focus on efforts made for protecting time-series data; second, to gain a comprehensive understanding of the theories and principles of Differential Privacy, a promising approach for privacy preservation; third, to explore notable mechanisms within Differential Privacy that are applicable to time-series data; fourth, to investigate and address privacy challenges in data partnerships through the integration of Differential Privacy and other relevant techniques; and finally, to develop a process for the application of privacy techniques within the context of business collaborations. The contribution of this thesis is twofold.
Firstly, it aims to make the concept of Differential Privacy more accessible and comprehensible, particularly for non- academic and corporate audiences who may not have a deep technical background. By presenting Differential Privacy in a clear and straightforward manner, this research facilitates its adoption and implementation in real-world scenarios. Secondly, this thesis proposes a guideline for the practical application and evaluation of Differential Privacy on time-series data, specifically within the context of data collaboration among multiple parties. The guideline serves as a valuable resource for organizations seeking to protect privacy while engaging in collaborative data initiatives.
vii TÓM TẮT LUẬN VĂN Luận văn này nghiên cứu về lĩnh vực quan trọng của quyền riêng tư dữ liệu, bao gồm quyền của cá nhân để duy trì sự kiểm soát về việc thu thập, sử dụng, chia sẻ và lưu trữ dữ liệu cá nhân của họ. Trong lĩnh vực dữ liệu cá nhân, dữ liệu chuỗi thời gian đặt ra những thách thức và nhạy cảm riêng biệt khi đến việc bảo vệ quyền riêng tư. Dữ liệu chuỗi thời gian bao gồm thông tin có thuộc tính thời gian có thể tiết lộ các mẫu, xu hướng và hành vi của cá nhân hoặc nhóm qua thời gian, và mang theo các rủi ro về việc vi phạm quyền riêng tư. Các mục tiêu chính của luận văn này như sau: thứ nhất, xem xét các phương pháp truyền thống về việc xuất bản dữ liệu bảo vệ quyền riêng tư, với tập trung đặc biệt vào các nỗ lực để bảo vệ dữ liệu chuỗi thời gian; thứ hai, để hiểu rõ về các lý thuyết và nguyên tắc của Sự khác biệt về Quyền riêng tư, một phương pháp hứa hẹn để bảo vệ quyền riêng tư; thứ ba, khám phá các cơ chế đáng chú ý trong Sự khác biệt về Quyền riêng tư mà có thể áp dụng cho dữ liệu chuỗi thời gian; thứ tư, điều tra và đối mặt với những thách thức về quyền riêng tư trong các đối tác dữ liệu thông qua việc tích hợp Sự khác biệt về Quyền riêng tư và các kỹ thuật liên quan khác; và cuối cùng, phát triển quy trình để áp dụng các kỹ thuật bảo vệ quyền riêng tư trong bối cảnh hợp tác kinh doanh.
Đóng góp của luận văn này là hai phần. Thứ nhất, nó nhằm mục tiêu làm cho khái niệm về Sự khác biệt về Quyền riêng tư trở nên dễ tiếp cận và dễ hiểu, đặc biệt là đối với các đối tượng không học thuật và doanh nghiệp có thể không có nền tảng kỹ thuật sâu. Bằng cách trình bày Sự khác biệt về Quyền riêng tư một cách rõ ràng và dễ hiểu, nghiên cứu này hỗ trợ việc áp dụng và thực hiện nó trong các tình huống thực tế. Thứ hai, luận văn này đề xuất một hướng dẫn cho việc áp dụng thực tiễn và đánh giá Sự khác biệt về Quyền riêng tư trong dữ liệu chuỗi thời gian, đặc biệt trong bối cảnh hợp tác dữ liệu giữa nhiều bên.
viii THE COMMITMENT OF THESIS’ AUTHOR I hereby declare that this master thesis is my own original work and has not been submitted before to any institution for assessment purposes. Further, I have acknowledged all sources used and have cited these in the reference section. ……………………… LAI TRUNG MINH DUC Date ix TABLE OF CONTENTS CHAPTER 1: OVERVIEW OF THE THESIS. Background and Context.
Data Publishing and Privacy Preserving Data Publishing. Challenges of Privacy Preserving Data Publishing (PPDP) for Time-series data3 4. Differential Privacy as a powerful player. 5 CHAPTER 2: PRIVACY MODELS RESEARCHS.
Attack models and notable privacy models. Record linkage attack and k-Anonymity privacy model. Attribute linkage attack and l-diversity and t-closeness privacy model. Table linkage and δ-presence privacy model.
Probabilistic linkage and Differential Privacy model. 12 CHAPTER 3: THE INVESTIGATION ON DIFFERENTIAL PRIVACY. The need for Differential Privacy principle. No need to model the attack model in detail.
Quantifiable privacy loss. Multiple mechanisms composition. The promise (and not promised) of Differential Privacy. The not promise.
Formal definition of Differential Privacy. Terms and notations. Important concepts of Differential Privacy. Foundation mechanisms of Differential Privacy.
Local Differential Privacy and Global Differential Privacy. Notable mechanisms for Time-series data. Discrete Fourier Transform (DFT) with Laplace mechanism (FPA – Fourier Perturbation Algorithm). Temporal perturbation mechanism.
STL-DP – Perturbed time-series by applying DFT with Laplace mechanism on trends and seasonality. 37 CHAPTER 4: EXPERIMENT DESIGNS. Data structure aligns with data provider. Concerns and constraints.
Revisit the GDPR related terms for data sharing. Potential attack models and countermeasures. Define scope of work. Privacy protection proposal.
47 CHAPTER 5: EXPERIMENT IMPLEMENTATIONS. Data exploration analysis (EDA). Maximum data domain estimation. Differential Privacy mechanisms implementation.
Data perturbed evaluation. RFM Analysis for dataset. Forecasting trendline at categories, consumer-groups, and store level. Recommendation for using Differential Privacy in data partnership use- cases.
77 CHAPTER 6: CONCLUSION AND FUTURE WORKS. 83 Table: Descriptive Statistics table of the data perturbation output. 83 Table: Accuracy result of RFM analysis between data perturbations and original data. 85 Table: Accuracy (RMSE) result from the forecast of data perturbations and original data.
88 Table: Accuracy (RMSE) result of the adjusted forecast version. 90 xiii TABLE OF TABLES Table 1. Original Shopping data table structure. Shopping Data table structure for the use-case.
Data structure of the synthesis dataset. Maximum domain value estimation for each Category. Descriptive Statistics table of the data perturbation output (detail table in the Appendix). Accuracy result of RFM analysis between data perturbations and original data (detail in Appendix).
Data Utility and Data Privacy improving ratio. 74 xiv TABLE OF FIGURES Figure 1. A fictitious database with hospital data, taken from [14]. Visualize how Sequential and Parallel Composition works – take from [25].
Visualize how Local Privacy and Global Privacy works - take from [25] 27 Figure 4. The Laplace mechanism with multiple scales. The visualization process of LPA and DFT (or FPA) - take from [19] .