Privacy
Assessing the privacy of synthetic data is as important as measuring its similarity. There are many possible ways an attacker can exploit synthetic data to extract information about the original data. Traditional techniques to protect data privacy are based on data anonymisation. However, it has been proven that this technique is not effective and puts sensitive data at risk of re-identification.
Any data disclosure may carry some risks. Attackers who intend to misuse the information may be able to identify users, called re-identification risk, or learn values of sensitive attributes from the released, synthetic data.
The three fundamental privacy metrics we use at Hazy to capture the safety of the synthetic data are presence disclosure risk, density disclosure risk, and distance to closest record. All three measure the risk of identifying a user in the database. This is only a problem if this is of value to an attacker. For example, identifying that a user is in the UK 2021 Census will not provide an attacker with much information. For more information on Privacy see the documentation on differential privacy.
Density disclosure risk¶
Density disclosure risk is one of the measures used to estimate the risk of an adversary somehow constructing a mapping from the synthetic data points to the real data points.
Sometimes this notion is referred to as "reversibility". For clarity, it is worth recalling that Hazy does not construct synthetic data by applying a forward mapping to individual real data points (the approach taken by anonymisation techniques) as an adversary can easily exploit this by reversing the forward mapping. Despite there being no forward mapping to reverse, there is the logical possibility that a highly sophisticated adversary could construct a mapping from synthetic points to real points by other means.
Density disclosure risk quantifies this risk by counting how many real points exist in the neighbourhood of each synthetic point. The idea is that for a given synthetic point, if there are no real points in the neighbourhood, or if there are many, then it is not possible for an adversary to construct an unambiguous map from the synthetic point to a real point. This is because either there is no real point to map to, or there are many alternatives so that any attempt would be ambiguous at best as well as there is the option for plausible deniability.
Use cases¶
This is a general privacy metric that should be used whenever privacy is a concern. It is recommended to use this metric whenever the data is to be shared with external third-parties or any untrusted group or individuals.
Interpretation¶
A risk score is calculated for each synthetic data point sampled, a histogram is then computed an given the low probability that a given synthetic data point has a high risk score (i.e., close to 1) a log scale is applied to the y axis to ensure the graph is more visible. A perfect score would mean a single bar on the left hand side of the graph, the more data is on the right side of the graph the higher the overall privacy risk.
Due to the random nature of the generation of synthetic data it is expected that there will always be some synthetic data points that happen to be near the real data points. The density disclosure score is the product of all the bars shown on the diagram.
Quality interpretation guide | ||||
---|---|---|---|---|
0% – 30% | 30% – 60% | 60% – 80% | 80% – 90% | 90% – 100% |
|
|
|
|
|
Troubleshooting¶
If this metric shows poor results for one or several columns, you can try to improve the results by doing one or several of the following:
- Decrease n_bins
- Decrease max_cat
- Decrease n_parents
- Decrease sample_parents
- Decrease epsilon
- Decrease n_bins
- Decrease max_cat
- Decrease n_parents
- Decrease sample_parents
- Set sort_visit to
false
- Select a simpler classifier or regressor :
LogisticRegression
,LinearRegression
,SVM
Presence disclosure risk¶
Presence disclosure risk is closely related to differential privacy. Simply put, differential privacy provides a provable guarantee that the presence of any individual’s data in a dataset has a limited effect on the result of a query run against the dataset.
Presence disclosure risk measures the certainty with which a privacy adversary could infer whether an arbitrary data point was present in the training set of a synthetic data generator. In other words, presence disclosure risk is high if the synthetic data pipeline transmits features that can be used to confidently distinguish data points that were present in the training set from those that were not. Low presence disclosure risk means that it is practically infeasible to deduce whether or not a given data point was in the training set.
Hazy calculates the presence disclosure risk by imagining an adversary that has access to the full synthetic dataset and a set of data points that they want to classify as either present in or absent from the training dataset. For a given data point, the classifier predicts presence, if the Hamming distance from the data point to its nearest synthetic point is below a certain threshold. The performance of this classifier at identifying which points were in the train and test data sets is measured over a set of points drawn from the actual training data, and the same number of points from a test dataset that was held back from training, for various different threshold values. The presence disclosure metric is calculated by averaging over all the threshold settings.
Use cases¶
This is a general privacy metric that should be used whenever privacy is a concern. It is recommended to use this metric whenever the data is to be shared with external third-parties or any untrusted group or individuals.
Interpretation¶
Quality interpretation guide | ||||
---|---|---|---|---|
0% – 30% | 30% – 60% | 60% – 80% | 80% – 90% | 90% – 100% |
|
|
|
|
|
Troubleshooting¶
If this metric shows poor results for one or several columns, you can try to improve the results by doing one or several of the following:
- Decrease n_bins
- Decrease max_cat
- Decrease n_parents
- Decrease sample_parents
- Decrease epsilon
- Decrease n_bins
- Decrease max_cat
- Decrease n_parents
- Decrease sample_parents
- Set sort_visit to
false
- Select a simpler classifier or regressor :
LogisticRegression
,LinearRegression
,SVM
Distance to closest record¶
Distance to closest record seeks to capture synthetic data points which are direct copies or minor perturbations of the real data records. The aim of the metric is to detect whether the generator has overfitted, memorised, and reproduced real data points into the synthetic data, which will be an apparent privacy violation.
In order to measure the risk, Hazy calculates and compares two sets of pairwise distances –
- between real and synthetic data records, and
- between real and test data records (data sampled from the source data but set aside for testing purposes and not used to train the generator).
The pairwise distances are measured in the following way: for every record in the synthetic, we find its closest neighbour in the real data using a Hamming distance. We do the same for the test data as well.
Use cases¶
This is a general privacy metric that should be used whenever privacy is a concern. It is recommended to use this metric whenever the data is to be shared with external third-parties or any untrusted group or individuals.
Interpretation¶
Quality interpretation guide | ||||
---|---|---|---|---|
0% – 30% | 30% – 60% | 60% – 80% | 80% – 90% | 90% – 100% |
|
|
|
|
|
Troubleshooting¶
If this metric shows poor results for one or several columns, you can try to improve the results by doing one or several of the following:
- Decrease n_bins
- Decrease max_cat
- Decrease n_parents
- Decrease sample_parents
- Decrease epsilon
- Decrease n_bins
- Decrease max_cat
- Decrease n_parents
- Decrease sample_parents
- Set sort_visit to
false
- Select a simpler classifier or regressor :
LogisticRegression
,LinearRegression
,SVM