Khung Tạo Dữ Liệu Tổng Hợp: Lý Thuyết, Kỹ Thuật và Ứng Dụng của Joseph E. Hoag

Luận án tiến sĩ của Joe Hoag năm 2007 về lý thuyết, kỹ thuật và ứng dụng của việc sinh dữ liệu tổng hợp. Khám phá các phương pháp tiên tiến.

Trường đại học

University of Arkansas

Chuyên ngành

Computer Science

Người đăng

Ẩn danh

Thể loại

dissertation

2007

361

Phí lưu trữ

75 Point

Mục lục chi tiết

1. Introduction

1.1. Commercial Data Generators

1.2. Synthetic Data Generation Research

2. SDDL: A Language for Describing Synthetic Data Sets

2.1. Plug-in Functions

2.2. Field/Variable Elements

2.3. Min/Max Generation Constraint

2.4. Dist (Distribution) Generation Constraint

3. SDG Software Architecture

3.1. Top-Level Classes

3.2. Tables and Fields

3.3. Pools and PoolChoices

4. Parallel Data Generation

4.1. Parallel Generation Algorithms

5. Descriptive Adequacy of SDDL

5.1. Single-Attribute Keys

5.2. Multi-Attribute Keys

5.3. Entity Integrity Constraints

6. Random but Deterministic

6.1. Definitions of Terms

6.2. A Discussion of Streams

7. Generating Obfuscated Data

7.1. A Spectrum for Data Obfuscation

7.2. Other Types of Obfuscation

7.3. SDDL Support for Schema and Cardinality Obfuscation

8. Application: Generating Ten Years of Store/Item/Sales Data

8.1. Schema for Sales Table

8.2. Enforcing Constraints with SDDL

9. Application: Generating The Semantically Complex Hallux Application

9.1. Description of Hallux Database

9.2. Logical Challenges Encountered

9.3. Generating the Data

10. Application: Generating Legal Strings for Context-Free Languages

10.1. Definition of a Context-Free Grammar

11. Application: Generating a Mailing List

11.1. Auxiliary Pool Files

11.2. Generating a Realistic Population

11.3. Generating a Master Population List

11.4. Generating a Realistic Mailing List

A SDDL Document Type Definition (DTD)

B Formula Grammar and Semantics

B.1. Formula Expression Grammar

B.2. Formula Expression Semantics

C Details of Hallux Problem and Solution

LIST OF FIGURES

Tóm tắt

I. Tổng quan về Khung Tạo Dữ Liệu Tổng Hợp Lý Thuyết và Ứng Dụng

Khung tạo dữ liệu tổng hợp đang trở thành một công cụ quan trọng trong lĩnh vực công nghệ thông tin. Với sự gia tăng nhanh chóng của các ứng dụng doanh nghiệp, nhu cầu về dữ liệu tổng hợp ngày càng cao. Khung này không chỉ giúp tạo ra dữ liệu mà còn hỗ trợ trong việc kiểm tra và phát triển ứng dụng. Việc sử dụng dữ liệu tổng hợp cho phép các nhà phát triển có thể thử nghiệm mà không cần đến dữ liệu thực tế, từ đó giảm thiểu rủi ro về bảo mật.

1.1. Khung Tạo Dữ Liệu Tổng Hợp Định Nghĩa và Vai Trò

Khung tạo dữ liệu tổng hợp là một hệ thống cho phép mô tả và tạo ra các tập dữ liệu phức tạp. Nó đóng vai trò quan trọng trong việc cung cấp dữ liệu cho các ứng dụng mà không cần đến thông tin nhạy cảm.

1.2. Lý Do Cần Thiết Khung Tạo Dữ Liệu Tổng Hợp

Nhu cầu về dữ liệu tổng hợp gia tăng do các ứng dụng yêu cầu dữ liệu lớn và phức tạp. Khung này giúp tiết kiệm thời gian và chi phí trong việc phát triển ứng dụng.

II. Vấn đề và Thách thức trong Tạo Dữ Liệu Tổng Hợp

Mặc dù khung tạo dữ liệu tổng hợp mang lại nhiều lợi ích, nhưng vẫn tồn tại một số thách thức. Việc tạo ra dữ liệu có tính chính xác và thực tế là một trong những vấn đề lớn nhất. Ngoài ra, việc đảm bảo tính bảo mật và quyền riêng tư cũng là một thách thức không nhỏ.

2.1. Thách Thức Về Tính Chính Xác của Dữ Liệu

Dữ liệu tổng hợp cần phải phản ánh đúng các đặc điểm của dữ liệu thực tế để có thể sử dụng hiệu quả trong các ứng dụng. Việc này đòi hỏi các kỹ thuật phức tạp để đảm bảo tính chính xác.

2.2. Vấn Đề Bảo Mật và Quyền Riêng Tư

Khi tạo ra dữ liệu tổng hợp, cần phải đảm bảo rằng không có thông tin nhạy cảm nào bị lộ ra. Điều này đòi hỏi các biện pháp bảo mật nghiêm ngặt trong quá trình tạo dữ liệu.

III. Phương Pháp Tạo Dữ Liệu Tổng Hợp Hiệu Quả

Có nhiều phương pháp để tạo dữ liệu tổng hợp, bao gồm việc sử dụng ngôn ngữ mô tả dữ liệu và các thuật toán tạo dữ liệu. Những phương pháp này giúp tối ưu hóa quy trình tạo dữ liệu và đảm bảo tính chính xác.

3.1. Ngôn Ngữ Mô Tả Dữ Liệu Tổng Hợp SDDL

Ngôn ngữ SDDL cho phép mô tả các tập dữ liệu phức tạp một cách dễ dàng. Nó cung cấp các công cụ để xác định các ràng buộc và đặc điểm của dữ liệu.

3.2. Thuật Toán Tạo Dữ Liệu Song Song

Sử dụng các thuật toán song song giúp tăng tốc độ tạo dữ liệu, cho phép tạo ra các tập dữ liệu lớn trong thời gian ngắn hơn.

IV. Ứng Dụng Thực Tiễn của Khung Tạo Dữ Liệu Tổng Hợp

Khung tạo dữ liệu tổng hợp có nhiều ứng dụng thực tiễn trong các lĩnh vực như kiểm thử phần mềm, phát triển ứng dụng và nghiên cứu dữ liệu. Việc sử dụng dữ liệu tổng hợp giúp tiết kiệm thời gian và chi phí cho các doanh nghiệp.

4.1. Kiểm Thử Phần Mềm và Dữ Liệu Tổng Hợp

Dữ liệu tổng hợp cho phép các nhà phát triển kiểm thử phần mềm mà không cần đến dữ liệu thực tế, từ đó giảm thiểu rủi ro và chi phí.

4.2. Phát Triển Ứng Dụng với Dữ Liệu Tổng Hợp

Các ứng dụng có thể được phát triển và thử nghiệm trên dữ liệu tổng hợp, giúp đảm bảo tính khả thi trước khi triển khai thực tế.

V. Kết Luận và Tương Lai của Khung Tạo Dữ Liệu Tổng Hợp

Khung tạo dữ liệu tổng hợp đang mở ra nhiều cơ hội mới trong lĩnh vực công nghệ thông tin. Với sự phát triển không ngừng của công nghệ, tương lai của khung này hứa hẹn sẽ mang lại nhiều cải tiến và ứng dụng mới.

5.1. Tương Lai của Công Nghệ Tạo Dữ Liệu Tổng Hợp

Công nghệ tạo dữ liệu tổng hợp sẽ tiếp tục phát triển, với nhiều cải tiến về tốc độ và tính chính xác, đáp ứng nhu cầu ngày càng cao của thị trường.

5.2. Ảnh Hưởng Đến Ngành Công Nghiệp

Khung tạo dữ liệu tổng hợp có thể thay đổi cách thức mà các doanh nghiệp phát triển và kiểm thử ứng dụng, từ đó tạo ra những bước tiến mới trong ngành công nghiệp.

25/07/2025

Bạn đang xem trước tài liệu:

2007 12 phd joe hoag synthetic data generation theory techniques and applications dissertation

Tải đầy đủ

Trích đoạn nội dung tài liệu

SYNTHETIC D ATA GENERATION: THEORY, TECHNIQUES AND APPLICATIONS SYNTHETIC D ATA GENERATION: THEORY, TECHNIQUES AND APPLICATIONS A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosphy in Computer Science By Joseph E. in Computer Science University of California at Davis, 1991 December 2007 University of Arkansas ABSTRACT The need for synthetically generated data is growing rapidly as the size of enterprise applications increases. Situations requiring this technology include regression testing of database applications, data mining applications, and the need to supply “realistic but not real” data for third party application development. The common approach today to supplying this need involves the manual creation of special-purpose data generators for specific data sets.

This dissertation describes a general purpose synthetic data generation framework. Such a framework significantly speeds up the process of describing and generating synthetic data. The framework includes a language called SDDL that is capable of describing complex data sets and a generation engine called SDG which supports parallel data generation. Related theory in the areas of the relational model, E-R diagrams, randomness and data obfuscation is explored.

Finally, the power and flexibility of the SDG/SDDL framework are demonstrated by applying the framework to a collection of applications. This dissertation is approved for recommendation to the Graduate Council. Dissertation Director: ____________________________________ Craig W. Thompson Dissertation Committee: ____________________________________ Wing-Ning Li ____________________________________ Brajendra Panda ____________________________________ Bill C.

Hardgrave ©2007 by Joseph E. Hoag All Rights Reserved DISSERTATION DUPLICATION RELEASE I hereby authorize the University of Arkansas Libraries to duplicate this dissertation when needed for research and/or scholarship. Agreed ________________________________________ Refused ________________________________________ ACKNOWLEDGEMENTS I thank my advisor, Dr. Craig Thompson, for his guidance and for his efforts to help me finish this dissertation.

I also thank the remainder of my dissertation committee, Dr. Wing-Ning Li, Dr. Brajendra Panda and Dr. Bill Hardgrave, for their advice and guidance.

I also thank my lovely wife, Solveig, for the sacrifices that she made so that I could return to school and complete this degree. vi TABLE OF CONTENTS 1 Introduction .1 Commercial Data Generators .2 Synthetic Data Generation Research.3 Data Inference and Privacy Issues .6 Organization of this Dissertation. 10 2 SDDL: A Language for Describing Synthetic Data Sets .2 Plug-in Functions .5 Field/Variable Elements .2 Min/Max Generation Constraint .3 Dist (Distribution) Generation Constraint .4 Cardinality in Iteration Elements. 41 3 SDG Software Architecture .1 Top-Level Classes .2 Tables and Fields .3 Pools and PoolChoices .1 Simple Generation Classes .2 Complex Generation Classes .1 Distributions and Tiers.3 Parser-related Classes .3 Pool Reference Processing .4 Random Number Generation Classes.6 Plugin Function Interface .7 The SaxDBUnmarshaller Class .8 The Globals Class .9 The Main Class .10 Selected Sequence Diagrams .2 Parsing a Pool Element .3 Parsing a Table Element .4 Parsing a Formula Field.

89 4 Parallel Data Generation .1 Parallel Generation Algorithms .2 Iteration Variables and Parallelism .103 5 Descriptive Adequacy of SDDL .1 Single-Attribute Keys .2 Multi-Attribute Keys .3 Entity Integrity Constraints .3 Referential Integrity Constraints.1 One-to-One Relationships .1 LHS partially participating, RHS partially participating .2 LHS fully participating, RHS partially participating .3 LHS partially participating, RHS fully participating .4 LHS fully participating, RHS fully participating .2 One-to-Many Relationships .1 LHS partially participating, RHS partially participating .2 LHS fully participating, RHS partially participating .3 LHS partially participating, RHS fully participating .4 LHS fully participating, RHS fully participating .3 Many-to-Many Relationships .1 LHS partially participating, RHS partially participating .2 Guaranteeing Full Participation in a Many-to-Many Relationship .5 Weak Entity Support .6 A Word on Circular Data Dependencies .147 6 Random but Deterministic .1 Definitions of Terms .2 A Discussion of Streams .1 The Linear Streaming Mechanism (LSM).2 CMRG Streaming Mechanism (CSM) .2 A Test for Randomness .3 Randomness Test Results.1 Producing the Random Number Streams .4 The Test Code .2 Performance of Linear Streaming Mechanism .3 Performance of CMRG Streaming Mechanism .164 7 Generating Obfuscated Data.1 A Spectrum for Data Obfuscation .1 Partial Knowledge Reversible Obfuscation (Level 1) .2 Process Reversible Obfuscation (Level 2).3 Combination Reversible Obfuscation (Level 3) .1 Statistically Accurate or Dependency Preserving Irreversible Obfuscation (Level 4) .2 Domain Preserving Irreversible Obfuscation (Level 5).3 Opaque Irreversible Obfuscation (Level 6) .2 Other Types of Obfuscation .2 Generic Attribute Naming .3 SDDL Support for Schema and Cardinality Obfuscation.3 A Warning about Data Inference .192 8 Application: Generating Ten Years of Store/Item/Sales Data.1 Schema for Sales Table .3 Enforcing Constraints with SDDL .1 The item_dist Pool .2 The Sales Table .1 HP Enterprise Technology Center (ETC), Atlanta, GA .2 HP Performance Center, Cupertino, CA .205 9 Application: Generating The Semantically Complex Hallux Application .1 Description of Hallux Database .2 Logical Challenges Encountered .1 Order of Table Generation .2 Band/Album/Song Naming .4 Differences in SQL Interpreters .5 Using Temporary Attributes .6 Ensuring Uniqueness in Key Values .7 Using Scalable SQL Operations .3 Generating the Data .225 10 Application: Generating Legal Strings for Context-Free Languages .1 Definition of a Context-Free Grammar .3 Representing CFGs as Pools .4 SDDL Generation Code for CFGs.5 Example: Strings with an Even Number of 0s and 1s .7 Example: Mathematical Expressions .8 An Alternative Method for Generating Strings .247 11 Application: Generating a Mailing List .1 Auxiliary Pool Files .2 Generating a Realistic Population .3 Generating a Master Population List .4 Generating a Realistic Mailing List .1 Veracity of Thesis Statement .2 Real-World Impact of Synthetic Data Generation Technology .279 A SDDL Document Type Definition (DTD) .284 xiv B Formula Grammar and Semantics .1 Formula Expression Grammar .2 Formula Expression Semantics .291 C Details of Hallux Problem and Solution .337 xix LIST OF FIGURES Figure 2-4: SDDL Element Hierarchy. 15 Figure 2-5: Sample Directed Graph. 39 Figure 3-1: Database Class. 44 Figure 3-2: Table and Field Classes.

47 Figure 3-3: Pool and PoolChoice Classes. 49 Figure 3-4: Constant Class. 50 Figure 3-5: Simple Generation Classes. 52 Figure 3-6: Distribution, Formula and QueryPool Classes.

58 Figure 3-7: Iteration Classes. 63 Figure 3-10: Parser-related Classes. 66 Figure 3-11: Parse Tree for Formula F1*10+4. 68 Figure 3-12: Parse Tree for Query "SELECT name, salary FROM employees WHERE age = [F2] ORDER BY salary".

69 Figure 3-13: Parse Tree for Formula 0+states[state]. 70 Figure 3-14: Parse tree for formula states[“New”+” “+”York”]+5. 71 Figure 3-15: Random Number Generation Classes. 73 Figure 3-16: Environment Classes.

75 Figure 3-18: SaxDBUnmarshaller Class. 77 xx Figure 3-19: Globals Class. 78 Figure 3-20: Main Class. 80 Figure 3-21: Program Execution Sequence Diagram.

83 Figure 3-22: Sequence Diagram for Parsing a Pool. 84 Figure 3-23: Sequence Diagram for Parsing a Table. 86 Figure 3-24: Sequence Diagram for Parsing a Formula. 89 Figure 3-25: Sequence Diagram for Generating Data.

90 Figure 4-1: Parallel Data Generation. 93 Figure 4-2: Map modeled by "map" Pool .105 Figure 5-1: E-R Diagram for "Sells In" Relation .121 Figure 5-2: Weak Entity Example.143 Figure 5-3: Weak Entity Specified Through Cardinality Constraints .143 Figure 5-4: Circular Dependencies in TPC-H Tables .145 Figure 6-1: Badly chosen stream starting points .152 Figure 6-2: Optimally chosen starting points for 4 streams .153 Figure 6-3: Hyperplaning in 2-dimensional space .155 Figure 9-1: Hallux Database E-R Diagram .208 Figure 10-1: "Even Number of Zeroes and Even Number of Ones" DFA .1 Problem There is an increasing need for synthetic data in the IT industry. Applications that benefit from synthetic data include (but are not limited to):  Regression testing. Repeatedly generate the same large data set for testing enterprise applications.

A synthetic data generator allows one to “grow” exactly the same large data set repeatedly when needed, rather than storing it between tests. This is especially important for data sets in the giga- or terabyte range.  Secure application development. Develop enterprise applications using generated data that is realistic but not real.

A synthetic data generator can produce a “sanitized” copy of a table, which retains many of the original table’s data characteristics but does not contain personally identifiable information. Third parties can develop software and test it on realistic data sets without exposing real data.  Try-before-you-buy. A vendor wants to demonstrate an enterprise application to a potential customer.

For instance, Oracle wants to demonstrate Oracle 10g (grid-based Oracle) to a customer who has not yet moved to a grid. The customer does not want to commit to buying a grid and 10g until they can see how their applications would run in that environment.  Testing of data mining applications. Generate data sets with known characteristics to gauge whether or not data mining tools can discover those characteristics.

Of course, one can always use a general-purpose programming language like C++ or Java as a generator to construct a special-purpose data set or alternately can use SQL to extract scrubbed subsets from a database management system. But, can a general- purpose description language be developed that contains powerful description and constraint mechanisms? And can a generation engine be developed that could read descriptions in such a language and produce data sets from them? Finally, can this generation engine produce data in a timely fashion? 1.2 Thesis Statement A synthetic data generator can be constructed to represent and generate a wide variety of simple-to-complex structured data sets including not only traditional tabular but also grammar and graph-based data sets. Data generation constraints can be captured in a well-defined description language. The synthetic data generator can be designed to execute efficiently and run in parallel to generate very large data sets.

Synthetic data generation has theoretical underpinnings in the areas of random number theory, database representational concepts and data obfuscation.3 Approach An overview of the research methodology follows:  Identify related work in synthetic data generation along with its limitations. 2  Develop a language for describing and constraining synthetic data. This language will be called Synthetic Data Description Language (SDDL).  Develop an engine that reads SDDL files as input and produces the described data sets as output.

This engine needs to be able to run in parallel, to take advantage of the parallelism available in modern cluster computing technology and produce data sets in a timely fashion. Also, the engine needs to be deterministic in that it will produce the same output (given the same input) every time regardless of the degree of parallelism employed during generation.  Investigate theoretical foundations of synthetic data generation.  Show the flexibility and power of the synthetic data generation framework by applying with clear benefit to several application problems.4 Potential Impact High-level programming languages like C/C++, C# and Java provide useful abstractions for software developers, and result in a significant time savings over programming in assembly language.

Likewise, a synthetic data description language could provide nice abstractions for those responsible for generating synthetic data sets, saving them the time associated with writing an application-specific data generator “from scratch.” To achieve maximal impact, a synthetic data generation framework should have the following properties:  Descriptiveness and Realism. The description language should provide mechanisms sufficient to model a wide range of data sets. Also, a 3 description language should allow for various levels of obfuscation of an existing table. The degree of descriptiveness associated with the description language directly affects the degree of realism possible in the generated data.

The data set generated from an input file should be the same from run to run, regardless of the platform or the degree of parallelism used during data generation. This property is necessary for supporting regression testing applications. Speed is always associated with convenience. To achieve maximum speed, a generation engine should support parallel data generation.

The user should be able to extend existing builtin or pre- existing libraries with new data dictionaries, operations, and table descriptions. The user should understand how to use the constructs of the generator to specify potentially complex, large datasets faster using the generator than if they accomplish the same thing with some other manual method.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ