Synthetic Data Generation of Many-to-Many Datasets via Random Graph Generation

ICLR 2023 Paper Acceptance

I'm excited to announce the acceptance of our paper to ICLR 2023 – "Synthetic Data Generation of Many-to-Many Datasets via Random Graph Generation".

ICLR is a top tier machine learning conference, so this is a huge achievement from the team at Hazy!

The team

Kai Xu was the lead author of the paper – he has a proven track record of generating novel research ideas and publishing papers at top machine learning conferences. He's also a lovely fellow and all-round great person to work with. Georgi Ganev provided great technical insight and was another key collaborator. Emile Joubert and Rees Davison worked on the engineering implementation still in use at Hazy today.

History

Hazy generates synthetic relational tabular data like you'd see in any SQL database.

Initially, Hazy's focus was on generating privacy preserving high utility data for data analysis use cases within a single table. Over time our customers began to present us with more complex database schemas. Our initial approach was to transform the data into a single table first before ingesting it into Hazy's synthesiser for training.

However, over time it became clear that customers don't always know the exact transforms they'd want to do up front, and organisationally, having the team that needs the synthetic data perform the transformations meant providing more employees inside the company access to the real data for training purposes.

Multi Table Version 0

Out of this desire to maintain our customers’ customers’ (not a typo) privacy, Multi Table Synthesiser Version 0 was born. It had a simple aim – maintain the intra table metrics we'd already built into our product, ensure accurate row counts and degree distributions would be maintained and ensure referential integrity is maintained.

Degree distribution

In the study of graphs and networks, the degree of a node in a network is the number of connections it has to other nodes. In the context of a relational database if you have two tables customers and accounts – you must ensure the number of accounts per customer is maintained in the synthetic data.

Referential integrity

Maintaining referential integrity means we do not violate any foreign key links. As in, if a table has a foreign key pointing to another table, a record with that ID must exist in the referenced table. In the example of customers and accounts, if the accounts table has a customer_id column. Every customer_id must point to a real record in the customers table. If this condition is violated the data has no utility. The data will violate the application code, the analysis of the data and any tests the customer may have. So this was vital.

The result

This version of the product was trialled with a major financial services provider and through perseverance from both teams, was a major success. 7 tables were generated with high statistical similarity. Both Hazy's metrics and a set of the customer’s metrics were used to check the data produced was up to their standards.

Limitations

During the course of the work, a number of limitations became clear. Hazy could only handle one-to-many relationships. In the case of customers and accounts it means we could only handle the case of a customer having multiple accounts, but couldn't handle the case of that same account belonging to multiple customers.

Performance was also an issue, due to the way data between tables was joined together meant the solution was memory hungry.

Multi Table Version 1

Off the back of this, version 1 of the multi-table synthesiser was conceived based on the paper we're announcing today.

It naturally handles many-to-many relationships and solves the performance issues we had with version 0. I'll go a bit more into the paper and design now.

The problem

Let's start with the simple case of customers and accounts. In our new world a customer can have many accounts. And an account can belong to many customers. In a relational database this would be represented by 3 tables.

A diagram representing a relational database

In this case customer_accounts in relational database terminology is what's referred to as a join table. It has only 2 columns in our example – customer_id and account_id. If you imagine the rows of the customers table as a set of nodes, and the rows of accounts as a set of nodes. The customer_accounts table represents the edges between the nodes. This forms a bipartite graph as you can see below.

Synthetic data generation - bipartite graph

In generative modelling we are trying to represent the joint distribution. In our case this can be represented by p(C, A, E). Where C is customers, A is accounts and E are the edges between them. By drawing samples from this distribution we can generate synthetic records.

Design in paper

While a number of different factorisations of the joint distribution were looked into, which can be viewed in section 4.1. The factorisation that Hazy went with was p(C, A, E) = p(E)p(C | E)p(A | C, E) which corresponds to a set of sub models.

* p(E) is the edge model and edges (i.e. the bipartite graph) are modelled unconditionally.

* p(C | E) is the edge-conditional table model and requires us to generate one of the tables given the topology of edges. One way to achieve such conditioning is by using a node embedding to condition on.

* p(A | C, E) is the conditional table model. It requires us to generate each account based on the customer table and all the connections in E. Every account is connected to a subset of customers through the connections in E. We use an aggregator function to produce a fixed length encoding of all customers connected to each account which is used as the conditioning.

Differential privacy in this framework

We rely on existing differentially privacy mechanisms to extend each sub model to be differentially private.

We also rely on the sequential composition property of differential privacy to set the value of epsilon for the overall synthetic data generation process. A more extensive description of this can be found in the paper.

Framework extension to N tables

With our enterprise customers, their relational databases contain many hundreds of tables. We build a directed acyclic graph of the schema, and traverse the structure with our collection of sub models, using conditional distributions to recreate the overall joint distribution for the entire database.


For more details, please refer to the full version of the paper.

If you’d like to know more about the financial services synthetic data use case mentioned above, get in touch.


Subscribe to our newsletter

For the latest news and insights