Generating synthetic data with referential integrity using GANs

Most organizations want to get value out of consumer data, without risking customer privacy. Since synthetic data retains none of the personally identifiable information, while maintaining the data’s utility, it’s an attractive, privacy-protecting alternative.

The team at Hazy has developed a hierarchical GAN or generative adversarial network that overcomes the usual barriers to unlocking the benefits of synthetic data in a modular and scalable way.

The challenge of generating synthetic data from a relational database

Most organizations store their data in traditional relational databases or RDBM systems. A key feature of these databases is referential integrity — elements in tables are tightly related with each other though identifiers, i.e., foreign keys.

Referential integrity is a structural element that ensures that the relationships among different sets of data are accurate. In order for synthetic data — completely artificial and private production data — to be useful, it must maintain these relationships among the data, including preserving the data links that can stretch across multiple tables.

This makes generating synthetic data with referential integrity a real challenge, since the data is not presented in a standard tabular format where each row is independent of each other.

In generating synthesised data, normally we use the finest granularity. For instance, order_id would represent a store managing orders, or person_id could represent a population. However, when we have multiple tables linked by foreign keys, then different levels of granularity emerge and the concept of finest granularity becomes ambiguous.

Let's consider the concrete example of the Northwind database which refers to product orders. The schema, Table A, is presented below. Note that there are several one-to-many and many-to-one relations that would have to be preserved in any synthetic version.

Figure 2: Mutual information matrix for the variables being generated
Figure 2: Mutual information matrix for the variables being generated

The following Table B shows a cropped snapshot of a partial extract from the database into a flattened CSV format, using the order_id as the finest granularity level. The table contains 812 unique orders that are translated into 2,155 rows and 31 columns:

Table 1: trimmed snapshot of the flatten Northwind database.

Table 1: Trimmed snapshot of the flatten Northwind database
order_id Customer_name unit_price product_id product_name supplier_id
10248 Paul Henriot 14 11 Queso Cabrales 5
10248 Paul Henriot 9.8 42 Singaporean Hokkien Fried Mee 20
10248 Paul Henriot 34.8 72 Mozzarella di Giovanni 14
10249 Karin Josephs 18.6 14 Tofu 6
10249 Karin Josephs 42.4 51 Manjimup Dried Apples 24
10250 Mario Pontes 7.7 41 Jack's New England Clam Chowder 19
10250 Mario Pontes 42.4 51 Manjimup Dried Apples 24

Note that there are several repetitions at the “parent level” of Customer_name. Due to the prevalence of these one-to-many relationships, for finest granularity we took was the order_id.

Why will simply flattening the data not work?

The simple solution of left-joining the customer table with the order table, consolidating all the information and then synthesising this big table will not work. For two reasons:

  1. Each synthesised row will be a new customer and order. But in reality we have to fix the customer and only then generate details at the order level — in other words, fix the parent and generate the child. This would leave us unable to constrain the number of products in any order, which is a key feature of the data that needs to be preserved.

  2. The assumption of independence of rows is no longer valid for the customer table — each customer is independent of each other. However, the order details — products bought — are not independent. Some of these fields may even have dependencies on time and should be treated as a sequence: A given Product X is dependent on the presence of other products in the basket.

To address this particular case, let's consider two levels:

  1. Customer level (name, address, etc) - that we call the “parent level”
  2. Order details level (products, suppliers, etc) - that we call the “child level”

It is important to preserve this structure in the synthetic version because otherwise there could be misalignments and information leaks such as orders without customers or customers having unrealistic orders. This can be seen as a particular case of sequential data.

Some other examples where sequential data is common include:

  • electronic health records (EHR) data - diagnostics, exams
  • messages sent and received between two or many agents
  • measurements of physical systems taken over time
  • credit card transactions

Note that the finest grain may not be a sequence but the key insight is that data has a structure that has to be preserved — rows are not independent.

Synthesising data at multiple granularity levels

The synthetic data that will populate the tables have to retain the properties of the original data. This is an additional challenge for the model since we have to preserve referential integrity, meaning that the foreign key — the column or group of columns that provide the link between two tables —  in Table A has to match the corresponding items in Table B (if a relation is one-to-many).

A possible solution to this problem is to synthesise data at multiple granularity levels:

  1. Use unsupervised machine learning to cluster data at parent level (customer).
  2. Synthesise this table, including the cluster identifier.
  3. Randomly assign a synthesised customer to a real order sequence.
  4. Finally synthesise the remaining variables (sequences at the child level) conditioned on the previous data.

The problem with this solution is that it does not scale for very large databases.

Another solution is to use a technique developed by Hazy called hierarchical GAN. GANs — or generative adversarial networks — are well-known for being highly successful in generating synthetic images. At Hazy, we’ve pioneered adapting this technique to tabular data with good success.

Here we present an extension of the original GAN architecture to be more modular and hierarchical. In its simplest form, it consists of two modules, each composed of a generator and a discriminator, coupled in tandem, as illustrated in the figure below.

Training of this architecture is done alternating:

  • Use Generator 1 to generate the common fields (customer name, location, etc) + Discriminator 1.
  • Use results from Generator 1 plus noise to conditionally generate the order details (product, quantity, etc) using a long short term memory network (LSTM) or convolutional neural network (CNN).

Subscribe to our newsletter

For the latest news and insights