SYNTHETIC D ATA GENERATION: THEORY, TECHNIQUES AND APPLICATIONS SYNTHETIC D ATA GENERATION: THEORY, TECHNIQUES AND APPLICATIONS A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosphy in Computer Science By Joseph E. in Computer Science University of California at Davis, 1991 December 2007 University of Arkansas ABSTRACT The need for synthetically generated data is growing rapidly as the size of enterprise applications increases. Situations requiring this technology include regression testing of database applications, data mining applications, and the need to supply “realistic but not real” data for third party application development. The common approach today to supplying this need involves the manual creation of special-purpose data generators for specific data sets.
This dissertation describes a general purpose synthetic data generation framework. Such a framework significantly speeds up the process of describing and generating synthetic data. The framework includes a language called SDDL that is capable of describing complex data sets and a generation engine called SDG which supports parallel data generation. Related theory in the areas of the relational model, E-R diagrams, randomness and data obfuscation is explored.
Finally, the power and flexibility of the SDG/SDDL framework are demonstrated by applying the framework to a collection of applications. This dissertation is approved for recommendation to the Graduate Council. Dissertation Director: ____________________________________ Craig W. Thompson Dissertation Committee: ____________________________________ Wing-Ning Li ____________________________________ Brajendra Panda ____________________________________ Bill C.
Hardgrave ©2007 by Joseph E. Hoag All Rights Reserved DISSERTATION DUPLICATION RELEASE I hereby authorize the University of Arkansas Libraries to duplicate this dissertation when needed for research and/or scholarship. Agreed ________________________________________ Refused ________________________________________ ACKNOWLEDGEMENTS I thank my advisor, Dr. Craig Thompson, for his guidance and for his efforts to help me finish this dissertation.
I also thank the remainder of my dissertation committee, Dr. Wing-Ning Li, Dr. Brajendra Panda and Dr. Bill Hardgrave, for their advice and guidance.
I also thank my lovely wife, Solveig, for the sacrifices that she made so that I could return to school and complete this degree. vi TABLE OF CONTENTS 1 Introduction .1 Commercial Data Generators .2 Synthetic Data Generation Research.3 Data Inference and Privacy Issues .6 Organization of this Dissertation. 10 2 SDDL: A Language for Describing Synthetic Data Sets .2 Plug-in Functions .5 Field/Variable Elements .2 Min/Max Generation Constraint .3 Dist (Distribution) Generation Constraint .4 Cardinality in Iteration Elements. 41 3 SDG Software Architecture .1 Top-Level Classes .2 Tables and Fields .3 Pools and PoolChoices .1 Simple Generation Classes .2 Complex Generation Classes .1 Distributions and Tiers.3 Parser-related Classes .3 Pool Reference Processing .4 Random Number Generation Classes.6 Plugin Function Interface .7 The SaxDBUnmarshaller Class .8 The Globals Class .9 The Main Class .10 Selected Sequence Diagrams .2 Parsing a Pool Element .3 Parsing a Table Element .4 Parsing a Formula Field.
89 4 Parallel Data Generation .1 Parallel Generation Algorithms .2 Iteration Variables and Parallelism .103 5 Descriptive Adequacy of SDDL .1 Single-Attribute Keys .2 Multi-Attribute Keys .3 Entity Integrity Constraints .3 Referential Integrity Constraints.1 One-to-One Relationships .1 LHS partially participating, RHS partially participating .2 LHS fully participating, RHS partially participating .3 LHS partially participating, RHS fully participating .4 LHS fully participating, RHS fully participating .2 One-to-Many Relationships .1 LHS partially participating, RHS partially participating .2 LHS fully participating, RHS partially participating .3 LHS partially participating, RHS fully participating .4 LHS fully participating, RHS fully participating .3 Many-to-Many Relationships .1 LHS partially participating, RHS partially participating .2 Guaranteeing Full Participation in a Many-to-Many Relationship .5 Weak Entity Support .6 A Word on Circular Data Dependencies .147 6 Random but Deterministic .1 Definitions of Terms .2 A Discussion of Streams .1 The Linear Streaming Mechanism (LSM).2 CMRG Streaming Mechanism (CSM) .2 A Test for Randomness .3 Randomness Test Results.1 Producing the Random Number Streams .4 The Test Code .2 Performance of Linear Streaming Mechanism .3 Performance of CMRG Streaming Mechanism .164 7 Generating Obfuscated Data.1 A Spectrum for Data Obfuscation .1 Partial Knowledge Reversible Obfuscation (Level 1) .2 Process Reversible Obfuscation (Level 2).3 Combination Reversible Obfuscation (Level 3) .1 Statistically Accurate or Dependency Preserving Irreversible Obfuscation (Level 4) .2 Domain Preserving Irreversible Obfuscation (Level 5).3 Opaque Irreversible Obfuscation (Level 6) .2 Other Types of Obfuscation .2 Generic Attribute Naming .3 SDDL Support for Schema and Cardinality Obfuscation.3 A Warning about Data Inference .192 8 Application: Generating Ten Years of Store/Item/Sales Data.1 Schema for Sales Table .3 Enforcing Constraints with SDDL .1 The item_dist Pool .2 The Sales Table .1 HP Enterprise Technology Center (ETC), Atlanta, GA .2 HP Performance Center, Cupertino, CA .205 9 Application: Generating The Semantically Complex Hallux Application .1 Description of Hallux Database .2 Logical Challenges Encountered .1 Order of Table Generation .2 Band/Album/Song Naming .4 Differences in SQL Interpreters .5 Using Temporary Attributes .6 Ensuring Uniqueness in Key Values .7 Using Scalable SQL Operations .3 Generating the Data .225 10 Application: Generating Legal Strings for Context-Free Languages .1 Definition of a Context-Free Grammar .3 Representing CFGs as Pools .4 SDDL Generation Code for CFGs.5 Example: Strings with an Even Number of 0s and 1s .7 Example: Mathematical Expressions .8 An Alternative Method for Generating Strings .247 11 Application: Generating a Mailing List .1 Auxiliary Pool Files .2 Generating a Realistic Population .3 Generating a Master Population List .4 Generating a Realistic Mailing List .1 Veracity of Thesis Statement .2 Real-World Impact of Synthetic Data Generation Technology .279 A SDDL Document Type Definition (DTD) .284 xiv B Formula Grammar and Semantics .1 Formula Expression Grammar .2 Formula Expression Semantics .291 C Details of Hallux Problem and Solution .337 xix LIST OF FIGURES Figure 2-4: SDDL Element Hierarchy. 15 Figure 2-5: Sample Directed Graph. 39 Figure 3-1: Database Class. 44 Figure 3-2: Table and Field Classes.
47 Figure 3-3: Pool and PoolChoice Classes. 49 Figure 3-4: Constant Class. 50 Figure 3-5: Simple Generation Classes. 52 Figure 3-6: Distribution, Formula and QueryPool Classes.
58 Figure 3-7: Iteration Classes. 63 Figure 3-10: Parser-related Classes. 66 Figure 3-11: Parse Tree for Formula F1*10+4. 68 Figure 3-12: Parse Tree for Query "SELECT name, salary FROM employees WHERE age = [F2] ORDER BY salary".
69 Figure 3-13: Parse Tree for Formula 0+states[state]. 70 Figure 3-14: Parse tree for formula states[“New”+” “+”York”]+5. 71 Figure 3-15: Random Number Generation Classes. 73 Figure 3-16: Environment Classes.
75 Figure 3-18: SaxDBUnmarshaller Class. 77 xx Figure 3-19: Globals Class. 78 Figure 3-20: Main Class. 80 Figure 3-21: Program Execution Sequence Diagram.
83 Figure 3-22: Sequence Diagram for Parsing a Pool. 84 Figure 3-23: Sequence Diagram for Parsing a Table. 86 Figure 3-24: Sequence Diagram for Parsing a Formula. 89 Figure 3-25: Sequence Diagram for Generating Data.
90 Figure 4-1: Parallel Data Generation. 93 Figure 4-2: Map modeled by "map" Pool .105 Figure 5-1: E-R Diagram for "Sells In" Relation .121 Figure 5-2: Weak Entity Example.143 Figure 5-3: Weak Entity Specified Through Cardinality Constraints .143 Figure 5-4: Circular Dependencies in TPC-H Tables .145 Figure 6-1: Badly chosen stream starting points .152 Figure 6-2: Optimally chosen starting points for 4 streams .153 Figure 6-3: Hyperplaning in 2-dimensional space .155 Figure 9-1: Hallux Database E-R Diagram .208 Figure 10-1: "Even Number of Zeroes and Even Number of Ones" DFA .1 Problem There is an increasing need for synthetic data in the IT industry. Applications that benefit from synthetic data include (but are not limited to): Regression testing. Repeatedly generate the same large data set for testing enterprise applications.
A synthetic data generator allows one to “grow” exactly the same large data set repeatedly when needed, rather than storing it between tests. This is especially important for data sets in the giga- or terabyte range. Secure application development. Develop enterprise applications using generated data that is realistic but not real.
A synthetic data generator can produce a “sanitized” copy of a table, which retains many of the original table’s data characteristics but does not contain personally identifiable information. Third parties can develop software and test it on realistic data sets without exposing real data. Try-before-you-buy. A vendor wants to demonstrate an enterprise application to a potential customer.
For instance, Oracle wants to demonstrate Oracle 10g (grid-based Oracle) to a customer who has not yet moved to a grid. The customer does not want to commit to buying a grid and 10g until they can see how their applications would run in that environment. Testing of data mining applications. Generate data sets with known characteristics to gauge whether or not data mining tools can discover those characteristics.
Of course, one can always use a general-purpose programming language like C++ or Java as a generator to construct a special-purpose data set or alternately can use SQL to extract scrubbed subsets from a database management system. But, can a general- purpose description language be developed that contains powerful description and constraint mechanisms? And can a generation engine be developed that could read descriptions in such a language and produce data sets from them? Finally, can this generation engine produce data in a timely fashion? 1.2 Thesis Statement A synthetic data generator can be constructed to represent and generate a wide variety of simple-to-complex structured data sets including not only traditional tabular but also grammar and graph-based data sets. Data generation constraints can be captured in a well-defined description language. The synthetic data generator can be designed to execute efficiently and run in parallel to generate very large data sets.
Synthetic data generation has theoretical underpinnings in the areas of random number theory, database representational concepts and data obfuscation.3 Approach An overview of the research methodology follows: Identify related work in synthetic data generation along with its limitations. 2 Develop a language for describing and constraining synthetic data. This language will be called Synthetic Data Description Language (SDDL). Develop an engine that reads SDDL files as input and produces the described data sets as output.
This engine needs to be able to run in parallel, to take advantage of the parallelism available in modern cluster computing technology and produce data sets in a timely fashion. Also, the engine needs to be deterministic in that it will produce the same output (given the same input) every time regardless of the degree of parallelism employed during generation. Investigate theoretical foundations of synthetic data generation. Show the flexibility and power of the synthetic data generation framework by applying with clear benefit to several application problems.4 Potential Impact High-level programming languages like C/C++, C# and Java provide useful abstractions for software developers, and result in a significant time savings over programming in assembly language.
Likewise, a synthetic data description language could provide nice abstractions for those responsible for generating synthetic data sets, saving them the time associated with writing an application-specific data generator “from scratch.” To achieve maximal impact, a synthetic data generation framework should have the following properties: Descriptiveness and Realism. The description language should provide mechanisms sufficient to model a wide range of data sets. Also, a 3 description language should allow for various levels of obfuscation of an existing table. The degree of descriptiveness associated with the description language directly affects the degree of realism possible in the generated data.
The data set generated from an input file should be the same from run to run, regardless of the platform or the degree of parallelism used during data generation. This property is necessary for supporting regression testing applications. Speed is always associated with convenience. To achieve maximum speed, a generation engine should support parallel data generation.
The user should be able to extend existing builtin or pre- existing libraries with new data dictionaries, operations, and table descriptions. The user should understand how to use the constructs of the generator to specify potentially complex, large datasets faster using the generator than if they accomplish the same thing with some other manual method.