What makes synthetic data privacy-preserving?

Synthetic data is artificial data that mimics the patterns and statistical properties of real data. Not each and every kind of synthetic data preserves the privacy of the individuals in the original dataset. Sometimes it is possible to reconstruct the original data the synthetic data was trained on.

In order for synthetic data to be privacy-preserving, it needs to be generated in a certain way, for example by combining it with another privacy-enhancing technology.

Privacy enhancing technologies (PETs) are “technologies and approaches that enable the derivation of useful results from data without providing full access to the data.”[1] PETs are exciting because they allow us to set data free and obtain insight from it without putting at risk the individuals whose data is in the original dataset. There are several different PETs, such as differential privacy, homomorphic encryption, and federated learning. Different PETs are each suitable for different use cases.

Differential privacy (DP) is at present the only PET that offers a quantifiable, mathematically robust way of assessing risk of privacy loss. 

DP is a mathematical definition which guarantees that when a statistic about a dataset is released, it is not possible to infer whether any particular individual was in the original data or not. DP is usually achieved by adding a carefully tuned level of noise to the original dataset.

Differential privacy, in comparison with other PETs, is particularly useful when we want to release statistics or derived information about a large dataset. DP was recently recommended by the ICO in their draft guidance encouraging organisations to implement a data protection by design approach. The ICO guidance suggests that differentially private data configured in the right way has the status of anonymous data.

Synthetic data that is combined with differential privacy (differentially private synthetic data) is therefore able to prevent disclosure about the participation of a particular datapoint in the original dataset.

This is the approach we have taken at Hazy to create privacy-preserving synthetic data. By combining synthetic data with differential privacy, we are creating a double layer of security which minimises risk of data leakage and meets the criteria for anonymous data as defined by the ICO guidelines. This also means GDPR does not apply to Hazy synthetic data.

As mentioned above, different PETs are appropriate for different use cases. If you want to know if our privacy-preserving synthetic data can make your data more available, you can explore our five use cases here.

 [i] The Royal Society report From privacy to partnership (January 2023)

Subscribe to our newsletter

For the latest news and insights