Incorporating additional differentially private models into the latest release

We are delighted to announce that we have introduced two more differentially private (DP) generative models into our latest product release 4.1.0. Namely, they are Ryan McKenna’s MST [1] and AIM [2].

Both models rank among the best performing models in academic benchmarking studies [3, 4], with MST winning the NIST Differential Privacy Synthetic Data Challenge [5]. Furthermore, MST has been used in practical settings by ONS England to publicly release census statistics [6].

How they work

Both models rely on the select-measure-generate paradigm [7] and mainly differ in the first step.

Precisely, in how they 1) select a collection of queries to measure. The other two steps are 2) measure the selected queries in a private way through a DP mechanism, usually the Gaussian mechanism, and 3) generate a synthetic dataset which is consistent with the noisy measurements.

On the one hand, MST, which stands for Maximum Spanning Tree, fits an optimal undirected graphical model (which maximizes the entropy) to the underlying distribution of the data. Then, it privately measures all 1-way and the selected 2-way marginals. On the other hand, AIM, or Adaptive and Iterative Mechanism, as the name suggests iteratively chooses the most useful measurements, taking into account how well they currently approximate the input data and their contribution to the workload.

MST vs. PrivBayes

Next, we display how well MST captures the pairwise mutual information between columns in comparison to PrivBayes, as done in [4]. PrivBayes [8] is a DP generative model relying on a directed graphical model and measuring k-way marginals (we set k to 3); PrivBayes can also be used in the Hazy platform.

PrivBayes and MST

In Figure 1, we visualize two example graphs that PrivBayes (in blue) and MST (in yellow) construct when modeling the Census dataset. We see that PrivBayes’ graph is more dense and has more connections. This could be beneficial as it makes more and more complex measurements but at the same time the DP privacy budget needs to be distributed among these measurements which could make them more noisy and therefore less accurate.

In Figure 2, we plot the average pairwise mutual information similarity between the real and synthetic datasets broken by connected nodes (left subplot) and non-connected (right subplot). We also vary the number of training data points from 1,000 to the original 199,523 and the level of privacy budgets. We see that for high degrees of epsilon (>= 100) PrivBayes performs better than MST, provided the models are trained on at least 32,000 data points. Reversely and more importantly, MST offers better utility for lower epsilons (<= 10), which is more commonly used in practical settings. Finally, we also observe that MST exhibits a larger drop in scores of connected vs. non-connected nodes compared to PrivBayes.

MST and AIM both offering great utility-privacy tradeoffs and state-of-the-art performance, are now part of the Hazy product.


[1] Winning the NIST Contest: A scalable and general approach to differentially private synthetic data,

[2] AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data,

[3] Benchmarking Differentially Private Synthetic Data Generation Algorithms, 

[4] Understanding how Differentially Private Generative Models Spend their Privacy Budget, 

[5] 2018 Differential Privacy Synthetic Data Challenge, 

[6] Synthesising the linked 2011 Census and deaths dataset while preserving its confidentiality, 

[7] A simple recipe for private synthetic data generation, 

[8] PrivBayes: Private Data Release via Bayesian Networks,

Subscribe to our newsletter

For the latest news and insights