Hazy generates synthetic data that is private and safe to work with since:
- it is artificial data that does not contain any real data, customer or personal information
- it is differentially private
- it is evaluated against a set of privacy metrics
The starting point for all safety and privacy guarantees with synthetic data is that it is not real data. Synthetic data is entirely artificial, and Hazy's synthetic data -- despite being based on and derived from source data -- does not contain any records that are the same as, or even close to, the records in the source data.
This is the baseline for Hazy's safety and privacy guarantees - there is no real data and no customer information in the synthetic data. Synthetic data falls outside of GDPR and can be used, processed and stored much more freely than real data.
Building on this, Hazy then layers on differential privacy - a gold standard of data privacy. Simply put, differential privacy minimises the influence any individual's data has on queries run on the synthetic dataset. The diagram below illustrates this, in which two runs of a differentially private training process, one including Alice's data, the other excluding it, result in Hazy generator models which differ at most by a small amount
ε. In this way, Alice's influence on the synthetic data is small, so the synthetic data cannot leak sensitive information about Alice.
Differential privacy fits naturally with the way in which Hazy bases its synthetic data on distribution estimators, rather than individual data points. Because Hazy generators learn and extract the patterns in the source data, data generated from them is always generated from an aggregated view of the source data, never from an individual record. It should be noted that distributions are based on the individual records in the source data, therefore removing individual records may impact the distribution. As a consequence, Hazy applies a degree of noise, or smoothing, to the distribution, in order to enforce a level of differential privacy.
Controlling privacy level with
ε, also known as the "privacy loss budget", can be understood as the amount of noise or blurring applied to the data. There is always a fundamental trade-off between blurring and utility - safer, more private data can lose utility for a given use case.
ε can be specified in the training configuration.
In addition to being synthetic and differentially private, Hazy's synthetic data is also evaluated for additional disclosure risks. This is measured using two individual metrics:
- density disclosure risk
- presence disclosure risk
Both of these metrics calculate the risk of disclosing characteristics of the source data that could be used to infer private or sensitive information.
Density disclosure risk¶
Density disclosure risk is one of the measures used to estimate the risk of an adversary somehow constructing a mapping from the synthetic data points to the real data points.
Sometimes this notion is referred to as "reversibility". For clarity, it's worth recalling that Hazy does not construct synthetic data by applying a forward mapping to individual real data points (the approach taken by anonymisation techniques) as an adversary can easily exploit this by reversing the forward mapping. Despite there being no forward mapping to reverse, there is the logical possibility that a highly sophisticated adversary could construct a mapping from synthetic points to real points by other means.
Density disclosure risk quantifies this risk by counting how many real points exist in the neighbourhood of each synthetic point. The idea is that for a given synthetic point, if there are no real points in the neighbourhood, or if there are many, then it's not possible for an adversary to construct an unambiguous map from the synthetic point to a real point. This is because either there is no real point to map to, or there are many alternatives so that any attempt would be ambiguous at best.
Presence disclosure risk¶
Presence disclosure risk is closely related to differential privacy. Differential privacy provides a guarantee that the presence of any individual's data in a data set has a limited effect on the result of any query run against the data set. Presence disclosure risk measures the certainty with which a privacy adversary can infer whether an arbitrary test data point was present in the training set of a synthetic data generator. In other words, presence disclosure risk is high if the synthetic data pipeline transmits features that can be used to confidently distinguish data points that were present in the training set from those that were not. Low presence disclosure risk means that it is practically infeasible to deduce whether or not a given data point was in the training set.
Hazy calculates the presence disclosure risk by imagining an adversary that has access to the full synthetic data set and a set of data points that they want to classify as either present in or absent from the training data set. The classifier predicts presence if the Hamming distance of the test data point to its nearest synthetic point is below a certain threshold. The performance of this classifier is measured over a set of points drawn from the actual training data, and a test data set that was held back from training, and averaged over all threshold settings. It is quantified as follows:
As discussed above, there is always a fundamental trade-off between privacy and utility, where safer, more private data can lose utility for a given use case. This is reflected in the data limitations section and in the manual control provided over both
ε level and the disclosure risk thresholds.
However, in practice, for many real world exploratory and model training use cases, Hazy's smart synthetic data is an optimal way of maximising both privacy and utility: preserving performance whilst still being safe to work with.