Generating synthetic data with referential integrity using GANs

Generating synthetic data with referential integrity using GANs

By on 27 Apr 2020.

Most organizations want to get value out of consumer data, without risking customer privacy. Since synthetic data retains none of the personally identifiable information, while maintaining the data’s utility, it’s an attractive, privacy-protecting alternative.

The team at Hazy has developed a hierarchical GAN or generative adversarial network that overcomes the usual barriers to unlocking the benefits of synthetic data in a modular and scalable way.

The challenge of generating synthetic data from a relational database

Most organizations store their data in traditional relational databases or RDBM systems. A key feature of these databases is referential integrity — elements in tables are tightly related with each other though identifiers, i.e., foreign keys.

Referential integrity is a structural element that ensures that the relationships among different sets of data are accurate. In order for synthetic data — completely artificial and private production data — to be useful, it must maintain these relationships among the data, including preserving the data links that can stretch across multiple tables.

This makes generating synthetic data with referential integrity a real challenge, since the data is not presented in a standard tabular format where each row is independent of each other.

In generating synthesised data, normally we use the finest granularity. For instance, order_id would represent a store managing orders, or person_id could represent a population. However, when we have multiple tables linked by foreign keys, then different levels of granularity emerge and the concept of finest granularity becomes ambiguous.

Let’s consider the concrete example of the Northwind database which refers to product orders. The schema, Table A, is presented below. Note that there are several one-to-many and many-to-one relations that would have to be preserved in any synthetic version.

Figure 2: Mutual information matrix for the variables being generated
Figure 2: Mutual information matrix for the variables being generated

The following Table B shows a cropped snapshot of a partial extract from the database into a flattened CSV format, using the order_id as the finest granularity level. The table contains 812 unique orders that are translated into 2,155 rows and 31 columns:

Table 1: trimmed snapshot of the flatten Northwind database.

Table 1: Trimmed snapshot of the flatten Northwind database
order_id Customer_name unit_price product_id product_name supplier_id
10248 Paul Henriot 14 11 Queso Cabrales 5
10248 Paul Henriot 9.8 42 Singaporean Hokkien Fried Mee 20
10248 Paul Henriot 34.8 72 Mozzarella di Giovanni 14
10249 Karin Josephs 18.6 14 Tofu 6
10249 Karin Josephs 42.4 51 Manjimup Dried Apples 24
10250 Mario Pontes 7.7 41 Jack's New England Clam Chowder 19
10250 Mario Pontes 42.4 51 Manjimup Dried Apples 24

Note that there are several repetitions at the “parent level” of Customer_name. Due to the prevalence of these one-to-many relationships, for finest granularity we took was the order_id.

Why will simply flattening the data not work?

The simple solution of left-joining the customer table with the order table, consolidating all the information and then synthesising this big table will not work. For two reasons:

  1. Each synthesised row will be a new customer and order. But in reality we have to fix the customer and only then generate details at the order level — in other words, fix the parent and generate the child. This would leave us unable to constrain the number of products in any order, which is a key feature of the data that needs to be preserved.

  2. The assumption of independence of rows is no longer valid for the customer table — each customer is independent of each other. However, the order details — products bought — are not independent. Some of these fields may even have dependencies on time and should be treated as a sequence: A given Product X is dependent on the presence of other products in the basket.

To address this particular case, let’s consider two levels:

  1. Customer level (name, address, etc) - that we call the “parent level”
  2. Order details level (products, suppliers, etc) - that we call the “child level”

It is important to preserve this structure in the synthetic version because otherwise there could be misalignments and information leaks such as orders without customers or customers having unrealistic orders. This can be seen as a particular case of sequential data.

Some other examples where sequential data is common include:

Note that the finest grain may not be a sequence but the key insight is that data has a structure that has to be preserved — rows are not independent.

Synthesising data at multiple granularity levels

The synthetic data that will populate the tables have to retain the properties of the original data. This is an additional challenge for the model since we have to preserve referential integrity, meaning that the foreign key — the column or group of columns that provide the link between two tables —  in Table A has to match the corresponding items in Table B (if a relation is one-to-many).

A possible solution to this problem is to synthesise data at multiple granularity levels:

  1. Use unsupervised machine learning to cluster data at parent level (customer).
  2. Synthesise this table, including the cluster identifier.
  3. Randomly assign a synthesised customer to a real order sequence.
  4. Finally synthesise the remaining variables (sequences at the child level) conditioned on the previous data.

The problem with this solution is that it does not scale for very large databases.

Another solution is to use a technique developed by Hazy called hierarchical GAN. GANs — or generative adversarial networks — are well-known for being highly successful in generating synthetic images. At Hazy, we’ve pioneered adapting this technique to tabular data with good success.

Here we present an extension of the original GAN architecture to be more modular and hierarchical. In its simplest form, it consists of two modules, each composed of a generator and a discriminator, coupled in tandem, as illustrated in the figure below.

Training of this architecture is done alternating:

The advantage of this method is that, contrary to the previous method, we don’t need to perform clustering and it is scalable to any number of hierarchical levels. Also, the approach is very flexible since we can have one or many parent tables and as many children as necessary.

Data processing

The first step in data processing is to remove all personally identifiable information — like names and addresses — and any free-text fields. We end up with the following attributes for the parent and child components:

Parent attributes
Name Type Description
Employee_id Categorical 78 levels
order_date Datetime 1996-1998
required_date Datetime 1996-1998
shipped_date Datetime 1996-1998
ship_via Categorical 3 levels
freight Numerical [0.02 - 1007]
ship_region Categorical 19 levels
ship_country Categorical 21 levels
customer_city Categorical 70 levels
customer_region Categorical 19 levels
customer_country Categorical 21 levels
order_lenght Numerical [1 - 22]
Table 3: Child attributes
Name Type Description
product_id Categorical 77 levels
supplier_id Categorical 29 levels
category_id Categorical 8 levels
unit_price Numerical [2 - 263.5]
quantity_per_unit Numerical [0 - 70]

All data is converted into a one-hot encoding, including datetime and numerical types (previously binned into 20 levels). One-hot encoding converts categorical variables (occupation, city names, etc) into a numeric format to be ingested by machine learning algorithms.

The final generated tabular data is converted back into tables to recreate the original database.

Training and Results

Machine learning training is performed using an optimised set of hyperparameters for the GAN architecture. Instead of alternating the training between Module 1 and 2, we opted to first train Module 1 (GAN1) and then Module 2 (GAN2). The loss function used was the Wasserstein loss.

Below are the similarity metrics for the Northwind database (The values range from 0 to 1. The closer to 1 the better.):

Similarity Metric Parent Overall
Histogram overlap 0.78 0.73
Histogram 2D 0.74 0.69
Mutual Information 0.41 0.45

A note about these metrics is that they were designed for strict tabular data, so they will not correctly reflect the quality of synthetic data for this type of data.

Figure 1: Histogram similarity for category_id
Figure 1: Histogram similarity for category_id
Figure 2: Mutual information matrix for the variables being generated
Figure 2: Mutual information matrix for the variables being generated

Note 1: The metrics are not as high as it could be because this example is relatively small — it contains only about 1000 rows.

Note 2: The process can be differentially private by adding an appropriate amount of noise to the gradient of the discriminator.

Conclusion and future work

Generating synthetic data from a relational database is a challenging problem as businesses may want to leverage synthetic data to preserve the relational form of the original data, while ensuring consumer privacy.

There are many ways of dealing with this problem using different levels of granularity. We presented one that can be easily extended to handle temporal sequences. Its main advantages are modularity and scalability as it can be extended to multiple granularity levels.

The main drawback of this approach is that it may become difficult to implement if the database contains too many one to many relations and nested hierarchical levels.


Synthetic data newsletter

Sign up for our sporadic newsletter to keep up to date on synthetic data, privacy matters and machine learning.