Most organizations want to get value out of consumer data, without risking customer privacy. Since synthetic data retains none of the personally identifiable information, while maintaining the data’s utility, it’s an attractive, privacy-protecting alternative.
The team at Hazy has developed a hierarchical GAN or generative adversarial network that overcomes the usual barriers to unlocking the benefits of synthetic data in a modular and scalable way.
The challenge of generating synthetic data from a relational database
Most organizations store their data in traditional relational databases or RDBM systems. A key feature of these databases is referential integrity — elements in tables are tightly related with each other though identifiers, i.e., foreign keys.
Referential integrity is a structural element that ensures that the relationships among different sets of data are accurate. In order for synthetic data — completely artificial and private production data — to be useful, it must maintain these relationships among the data, including preserving the data links that can stretch across multiple tables.
This makes generating synthetic data with referential integrity a real challenge, since the data is not presented in a standard tabular format where each row is independent of each other.
In generating synthesised data, normally we use the finest granularity. For instance, order_id would represent a store managing orders, or person_id could represent a population. However, when we have multiple tables linked by foreign keys, then different levels of granularity emerge and the concept of finest granularity becomes ambiguous.
Let's consider the concrete example of the Northwind database which refers to product orders. The schema, Table A, is presented below. Note that there are several one-to-many and many-to-one relations that would have to be preserved in any synthetic version.
The following Table B shows a cropped snapshot of a partial extract from the database into a flattened CSV format, using the order_id as the finest granularity level. The table contains 812 unique orders that are translated into 2,155 rows and 31 columns:
Table 1: trimmed snapshot of the flatten Northwind database.
|10248||Paul Henriot||14||11||Queso Cabrales||5|
|10248||Paul Henriot||9.8||42||Singaporean Hokkien Fried Mee||20|
|10248||Paul Henriot||34.8||72||Mozzarella di Giovanni||14|
|10249||Karin Josephs||42.4||51||Manjimup Dried Apples||24|
|10250||Mario Pontes||7.7||41||Jack's New England Clam Chowder||19|
|10250||Mario Pontes||42.4||51||Manjimup Dried Apples||24|
Note that there are several repetitions at the “parent level” of Customer_name. Due to the prevalence of these one-to-many relationships, for finest granularity we took was the order_id.
Why will simply flattening the data not work?
The simple solution of left-joining the customer table with the order table, consolidating all the information and then synthesising this big table will not work. For two reasons:
Each synthesised row will be a new customer and order. But in reality we have to fix the customer and only then generate details at the order level — in other words, fix the parent and generate the child. This would leave us unable to constrain the number of products in any order, which is a key feature of the data that needs to be preserved.
The assumption of independence of rows is no longer valid for the customer table — each customer is independent of each other. However, the order details — products bought — are not independent. Some of these fields may even have dependencies on time and should be treated as a sequence: A given Product X is dependent on the presence of other products in the basket.
To address this particular case, let's consider two levels:
- Customer level (name, address, etc) - that we call the “parent level”
- Order details level (products, suppliers, etc) - that we call the “child level”
It is important to preserve this structure in the synthetic version because otherwise there could be misalignments and information leaks such as orders without customers or customers having unrealistic orders. This can be seen as a particular case of sequential data.
Some other examples where sequential data is common include:
- electronic health records (EHR) data - diagnostics, exams
- messages sent and received between two or many agents
- measurements of physical systems taken over time
- credit card transactions
Note that the finest grain may not be a sequence but the key insight is that data has a structure that has to be preserved — rows are not independent.
Synthesising data at multiple granularity levels
The synthetic data that will populate the tables have to retain the properties of the original data. This is an additional challenge for the model since we have to preserve referential integrity, meaning that the foreign key — the column or group of columns that provide the link between two tables — in Table A has to match the corresponding items in Table B (if a relation is one-to-many).
A possible solution to this problem is to synthesise data at multiple granularity levels:
- Use unsupervised machine learning to cluster data at parent level (customer).
- Synthesise this table, including the cluster identifier.
- Randomly assign a synthesised customer to a real order sequence.
- Finally synthesise the remaining variables (sequences at the child level) conditioned on the previous data.
The problem with this solution is that it does not scale for very large databases.
Another solution is to use a technique developed by Hazy called hierarchical GAN. GANs — or generative adversarial networks — are well-known for being highly successful in generating synthetic images. At Hazy, we’ve pioneered adapting this technique to tabular data with good success.
Here we present an extension of the original GAN architecture to be more modular and hierarchical. In its simplest form, it consists of two modules, each composed of a generator and a discriminator, coupled in tandem, as illustrated in the figure below.
Training of this architecture is done alternating:
- Use Generator 1 to generate the common fields (customer name, location, etc) + Discriminator 1.
- Use results from Generator 1 plus noise to conditionally generate the order details (product, quantity, etc) using a long short term memory network (LSTM) or convolutional neural network (CNN).
The advantage of this method is that, contrary to the previous method, we don’t need to perform clustering and it is scalable to any number of hierarchical levels. Also, the approach is very flexible since we can have one or many parent tables and as many children as necessary.
The first step in data processing is to remove all personally identifiable information — like names and addresses — and any free-text fields. We end up with the following attributes for the parent and child components:
|freight||Numerical||[0.02 - 1007]|
|order_lenght||Numerical||[1 - 22]|
|unit_price||Numerical||[2 - 263.5]|
|quantity_per_unit||Numerical||[0 - 70]|
All data is converted into a one-hot encoding, including datetime and numerical types (previously binned into 20 levels). One-hot encoding converts categorical variables (occupation, city names, etc) into a numeric format to be ingested by machine learning algorithms.
The final generated tabular data is converted back into tables to recreate the original database.
Training and Results
Machine learning training is performed using an optimised set of hyperparameters for the GAN architecture. Instead of alternating the training between Module 1 and 2, we opted to first train Module 1 (GAN1) and then Module 2 (GAN2). The loss function used was the Wasserstein loss.
Below are the similarity metrics for the Northwind database (The values range from 0 to 1. The closer to 1 the better.):
A note about these metrics is that they were designed for strict tabular data, so they will not correctly reflect the quality of synthetic data for this type of data.
Note 1: The metrics are not as high as it could be because this example is relatively small — it contains only about 1000 rows.
Note 2: The process can be differentially private by adding an appropriate amount of noise to the gradient of the discriminator.
Conclusion and future work
Generating synthetic data from a relational database is a challenging problem as businesses may want to leverage synthetic data to preserve the relational form of the original data, while ensuring consumer privacy.
There are many ways of dealing with this problem using different levels of granularity. We presented one that can be easily extended to handle temporal sequences. Its main advantages are modularity and scalability as it can be extended to multiple granularity levels.
The main drawback of this approach is that it may become difficult to implement if the database contains too many one to many relations and nested hierarchical levels.