ICLR 2023 Paper Acceptance
I'm excited to announce the acceptance of our paper to ICLR 2023 – "Synthetic Data Generation of Many-to-Many Datasets via Random Graph Generation".
ICLR is a top tier machine learning conference, so this is a huge achievement from the team at Hazy!
Kai Xu was the lead author of the paper – he has a proven track record of generating novel research ideas and publishing papers at top machine learning conferences. He's also a lovely fellow and all-round great person to work with. Georgi Ganev provided great technical insight and was another key collaborator. Emile Joubert and Rees Davison worked on the engineering implementation still in use at Hazy today.
Hazy generates synthetic relational tabular data like you'd see in any SQL database.
Initially, Hazy's focus was on generating privacy preserving high utility data for data analysis use cases within a single table. Over time our customers began to present us with more complex database schemas. Our initial approach was to transform the data into a single table first before ingesting it into Hazy's synthesiser for training.
However, over time it became clear that customers don't always know the exact transforms they'd want to do up front, and organisationally, having the team that needs the synthetic data perform the transformations meant providing more employees inside the company access to the real data for training purposes.
Multi Table Version 0
Out of this desire to maintain our customers’ customers’ (not a typo) privacy, Multi Table Synthesiser Version 0 was born. It had a simple aim – maintain the intra table metrics we'd already built into our product, ensure accurate row counts and degree distributions would be maintained and ensure referential integrity is maintained.
In the study of graphs and networks, the degree of a node in a network is the number of connections it has to other nodes. In the context of a relational database if you have two tables
accounts – you must ensure the number of
customer is maintained in the synthetic data.
Maintaining referential integrity means we do not violate any foreign key links. As in, if a table has a foreign key pointing to another table, a record with that ID must exist in the referenced table. In the example of customers and accounts, if the
accounts table has a
customer_id column. Every
customer_id must point to a real record in the
customers table. If this condition is violated the data has no utility. The data will violate the application code, the analysis of the data and any tests the customer may have. So this was vital.
This version of the product was trialled with a major financial services provider and through perseverance from both teams, was a major success. 7 tables were generated with high statistical similarity. Both Hazy's metrics and a set of the customer’s metrics were used to check the data produced was up to their standards.
During the course of the work, a number of limitations became clear. Hazy could only handle one-to-many relationships. In the case of
accounts it means we could only handle the case of a customer having multiple accounts, but couldn't handle the case of that same account belonging to multiple
Performance was also an issue, due to the way data between tables was joined together meant the solution was memory hungry.
Multi Table Version 1
Off the back of this, version 1 of the multi-table synthesiser was conceived based on the paper we're announcing today.
It naturally handles many-to-many relationships and solves the performance issues we had with version 0. I'll go a bit more into the paper and design now.
Let's start with the simple case of
accounts. In our new world a
customer can have many
accounts. And an
account can belong to many
customers. In a relational database this would be represented by 3 tables.
In this case
customer_accounts in relational database terminology is what's referred to as a join table. It has only 2 columns in our example –
account_id. If you imagine the rows of the
customers table as a set of nodes, and the rows of
accounts as a set of nodes. The
customer_accounts table represents the edges between the nodes. This forms a bipartite graph as you can see below.
In generative modelling we are trying to represent the joint distribution. In our case this can be represented by
p(C, A, E). Where
E are the edges between them. By drawing samples from this distribution we can generate synthetic records.
Design in paper
While a number of different factorisations of the joint distribution were looked into, which can be viewed in section 4.1. The factorisation that Hazy went with was
p(C, A, E) = p(E)p(C | E)p(A | C, E) which corresponds to a set of sub models.
p(E) is the edge model and edges (i.e. the bipartite graph) are modelled unconditionally.
p(C | E) is the edge-conditional table model and requires us to generate one of the tables given the topology of edges. One way to achieve such conditioning is by using a node embedding to condition on.
p(A | C, E) is the conditional table model. It requires us to generate each
account based on the
customer table and all the connections in
account is connected to a subset of
customers through the connections in
E. We use an aggregator function to produce a fixed length encoding of all
customers connected to each
account which is used as the conditioning.
Differential privacy in this framework
We rely on existing differentially privacy mechanisms to extend each sub model to be differentially private.
We also rely on the sequential composition property of differential privacy to set the value of
epsilon for the overall synthetic data generation process. A more extensive description of this can be found in the paper.
Framework extension to N tables
With our enterprise customers, their relational databases contain many hundreds of tables. We build a directed acyclic graph of the schema, and traverse the structure with our collection of sub models, using conditional distributions to recreate the overall joint distribution for the entire database.
For more details, please refer to the full version of the paper.
If you’d like to know more about the financial services synthetic data use case mentioned above, get in touch.