Differential privacy

Hazy generates synthetic data that is private and safe to work with since it is:

  1. artificial data that does not contain any real data, customer or personal information.
  2. differentially private.
  3. evaluated against a set of privacy metrics.

Artificial data

The starting point for all safety and privacy guarantees with synthetic data is that it is not real data. Synthetic data is entirely artificial, and Hazy’s synthetic data, despite being based on and derived from source data, does not contain any records from the source data.

This is the baseline for Hazy’s safety and privacy guarantees. There is no real data and no customer information in the synthetic data. Synthetic data falls outside of GDPR and can be used, processed and stored much more freely than real data.

Differential privacy

Building on this, Hazy then layers on differential privacy, a gold standard of data privacy with formal mathematical guarantees. Simply put, differential privacy minimises the influence any individual’s data has on the trained Hazy Generator. The diagram below illustrates this -- if two Hazy Generators with differential privacy guarantees are trained, one including Alice’s data, the other excluding it, this will result in models differing at most by a small amount ε. In turn, Alice’s influence on the synthetic data sampled from a trained Generator with differential privacy guarantees is small, so the synthetic data cannot leak sensitive information about Alice.

Two runs of a differentially private training process, result in Hazy Generator models which differ at most by a small amount.

Differential privacy fits naturally with the way in which Hazy bases its synthetic data on distribution estimators, rather than individual data points. Because Hazy Generators learn and extract the patterns in the source data, data generated from them is always generated from an aggregated view of the source data, never from an individual record. It should be noted that distributions are based on the individual records in the source data, therefore removing individual records may impact the distribution. As a consequence, Hazy applies a degree of noise, or smoothing, to the distribution, in order to enforce a level of differential privacy.

Controlling privacy level with ε

In ε-differential privacy, the privacy parameter ε, also known as the "privacy budget", can be understood as the amount of noise, randomness or blurring "spent" during the training step. Consequently, lower values of ε lead to stricter privacy guarantees, while larger values to more relaxed guarantees. ε can be specified in the training configuration.

There is always a fundamental trade-off between privacy and utility. Safer, more private data could lose utility for a given use case.

Choosing the right value of ε

Choosing the value of ε is context specific and depends on a number of factors such as the risk tolerance of the data controller, the complexity of the downstream task, the desired utility level, the chosen generative technique and DP mechanism, the dimensions, trends, and characteristics of the dataset, etc.

On the one hand, the Hazy metrics can be used to determine a good trade-off between utility and privacy. On the other hand, looking at real world deployments of DP systems (CDEI UK repository, Ted's blog), the ε value varies between 0.1 and 10. However, more recent research demonstrates that higher values of ε such as 100 and even 1,000 are enough to protect vs certain privacy attacks such as reconstruction and multi-choice membership inference attacks.

Disclosure risk

In addition to being synthetic and differentially private, Hazy’s synthetic data is also evaluated for additional disclosure risks. This is measured using two individual metrics:

Both of these metrics calculate the risk of disclosing characteristics of the source data that could be used to infer private or sensitive information.

Trade-offs

As discussed above, there is always a fundamental trade-off between privacy and utility, where safer, more private data can lose utility for a given use case. This is reflected in the data limitations section and in the manual control provided over both ε level and the disclosure risk thresholds.

However, in practice, for many real world exploratory and model training use cases, Hazy’s smart synthetic data is an optimal way of maximising both privacy and utility. Preserving performance whilst still being safe to work with.