Synthetic Data Generation of Many-to-Many Datasets via Random Graph Generation
ICLR 2023 Paper Acceptance
I'm excited to announce the acceptance of our paper to ICLR 2023 – "Synthetic Data Generation of Many-to-Many Datasets via Random Graph Generation".
ICLR is a top tier machine learning conference, so this is a huge achievement from the team at Hazy!
The team
Kai Xu was the lead author of the paper – he has a proven track record of generating novel research ideas and publishing papers at top machine learning conferences. He's also a lovely fellow and all-round great person to work with. Georgi Ganev provided great technical insight and was another key collaborator. Emile Joubert and Rees Davison worked on the engineering implementation still in use at Hazy today.
History
Hazy generates synthetic relational tabular data like you'd see in any SQL database.
Initially, Hazy's focus was on generating privacy preserving high utility data for data analysis use cases within a single table. Over time our customers began to present us with more complex database schemas. Our initial approach was to transform the data into a single table first before ingesting it into Hazy's synthesiser for training.
However, over time it became clear that customers don't always know the exact transforms they'd want to do up front, and organisationally, having the team that needs the synthetic data perform the transformations meant providing more employees inside the company access to the real data for training purposes.
Multi Table Version 0
Out of this desire to maintain our customers’ customers’ (not a typo) privacy, Multi Table Synthesiser Version 0 was born. It had a simple aim – maintain the intra table metrics we'd already built into our product, ensure accurate row counts and degree distributions would be maintained and ensure referential integrity is maintained.
Degree distribution
In the study of graphs and networks, the degree of a node in a network is the number of connections it has to other nodes. In the context of a relational database if you have two tables customers
and accounts
– you must ensure the number of accounts
per customer
is maintained in the synthetic data.
Referential integrity
Maintaining referential integrity means we do not violate any foreign key links. As in, if a table has a foreign key pointing to another table, a record with that ID must exist in the referenced table. In the example of customers and accounts, if the accounts
table has a customer_id
column. Every customer_id
must point to a real record in the customers
table. If this condition is violated the data has no utility. The data will violate the application code, the analysis of the data and any tests the customer may have. So this was vital.
The result
This version of the product was trialled with a major financial services provider and through perseverance from both teams, was a major success. 7 tables were generated with high statistical similarity. Both Hazy's metrics and a set of the customer’s metrics were used to check the data produced was up to their standards.
Limitations
During the course of the work, a number of limitations became clear. Hazy could only handle one-to-many relationships. In the case of customers
and accounts
it means we could only handle the case of a customer having multiple accounts, but couldn't handle the case of that same account belonging to multiple customers
.
Performance was also an issue, due to the way data between tables was joined together meant the solution was memory hungry.
Multi Table Version 1
Off the back of this, version 1 of the multi-table synthesiser was conceived based on the paper we're announcing today.
It naturally handles many-to-many relationships and solves the performance issues we had with version 0. I'll go a bit more into the paper and design now.
The problem
Let's start with the simple case of customers
and accounts
. In our new world a customer
can have many accounts
. And an account
can belong to many customers
. In a relational database this would be represented by 3 tables.
In this case customer_accounts
in relational database terminology is what's referred to as a join table. It has only 2 columns in our example – customer_id
and account_id
. If you imagine the rows of the customers
table as a set of nodes, and the rows of accounts
as a set of nodes. The customer_accounts
table represents the edges between the nodes. This forms a bipartite graph as you can see below.
In generative modelling we are trying to represent the joint distribution. In our case this can be represented by p(C, A, E)
. Where C
is customers
, A
is accounts
and E
are the edges between them. By drawing samples from this distribution we can generate synthetic records.
Design in paper
While a number of different factorisations of the joint distribution were looked into, which can be viewed in section 4.1. The factorisation that Hazy went with was p(C, A, E) = p(E)p(C | E)p(A | C, E)
which corresponds to a set of sub models.
* p(E)
is the edge model and edges (i.e. the bipartite graph) are modelled unconditionally.
* p(C | E)
is the edge-conditional table model and requires us to generate one of the tables given the topology of edges. One way to achieve such conditioning is by using a node embedding to condition on.
* p(A | C, E)
is the conditional table model. It requires us to generate each account
based on the customer
table and all the connections in E
. Every account
is connected to a subset of customers
through the connections in E
. We use an aggregator function to produce a fixed length encoding of all customers
connected to each account
which is used as the conditioning.
Differential privacy in this framework
We rely on existing differentially privacy mechanisms to extend each sub model to be differentially private.
We also rely on the sequential composition property of differential privacy to set the value of epsilon
for the overall synthetic data generation process. A more extensive description of this can be found in the paper.
Framework extension to N tables
With our enterprise customers, their relational databases contain many hundreds of tables. We build a directed acyclic graph of the schema, and traverse the structure with our collection of sub models, using conditional distributions to recreate the overall joint distribution for the entire database.
For more details, please refer to the full version of the paper.
If you’d like to know more about the financial services synthetic data use case mentioned above, get in touch.