Assessing the privacy of synthetic data is as important as measuring its similarity. There are many possible ways an attacker can exploit synthetic data to extract information about the original data - see ref for review. Traditional techniques to protect data privacy are based on data anonymisation. However, it has been proven that this technique is not effective and puts sensitive data at risk of re-identification.

Any data disclosure may carry some risks: attackers who intend to misuse the information may be able to identify users — called re-identification risk — or learn values of sensitive attributes from the released, synthetic data.

The two fundamental privacy metrics we use at Hazy to capture the safety of the synthetic data are presence disclosure risk and density disclosure.

Presence Disclosure

This metric quantifies the likelihood of identifying that a specific record was used to train the generator. Presence disclosure requires that an attacker has a complete set of records of individuals and check if any member of this set was used to generate synthetic data - this is also known as membership inference attack.

To quantify this risk, we split the original data into two sets: a training set and test set. Original and Synthetic data is binned into NN bins. Then, we trace Receive Operational Curve as a function of the Hamming distance of train and test data and evaluate the respective AUC. The metric is quantified by

Presence Disclosure score=12×AUC\text{Presence Disclosure score} = 1 - 2 \times AUC

This means that if AUC is 0.5 (a random guess) there is no risk of an attacker to identify points in the original data based on synthetic data. Note that, in general, the more complex the data (more columns and rows) the less risky synthetic data will be.

Density disclosure risk histogram

This shows the distribution of the density disclosure risk for the target column of synthetic records.

Disclosure risk is the risk of an individual’s presence in the source data being disclosed. Density disclosure risk is a specific way of calculating this based on how closely a synthetic data record can be mapped to individual records in the source data. The higher the density, the lower the risk of disclosure. The density disclosure threshold is the level of risk of disclosure that you’re willing to tolerate. A value of 6% means that all records which have more than a 6% risk are removed from the synthetic data. This can impact utility, as information may be lost when removing the sensitive data points.

Density Disclosure

Density disclosure measures the likelihood of finding original data points in a certain vicinity of synthetic points. Synthetic data is projected into a NN dimensional hyperspace that is overlapped with the original data. If the density of points is sufficiently high, then we assign a higher risk probability. For metrics to be meaningful we perform PCA on the data and retain the first components. This metric evaluates the privacy risk associated with a generated dataset in terms of the likelihood of identification of individuals in the data.

Note again, that the higher the dimensional space the more sparse it becomes and the less risk is associated.

Visualising the Density Disclosure risk. The darker areas correspond to points at higher risk.
Visualising the Density Disclosure risk. The darker areas correspond to points at higher risk.