Synthetic data & GDPR compliance
Overview
- The obligations towards processing Personal Data under the UK GDPR do not apply to anonymous data.
- Synthetic data that has built-in differential privacy guarantees can meet the criteria for anonymous data as defined by the draft guidelines of the UK regulator, the Information Commissioner’s Office (ICO).
- Hazy’s Synthetic Data Platform generates synthetic data that is sufficiently anonymous that the UK GDPR does not apply to it.
The regulatory landscape
Let’s start with the GDPR. UK GDPR makes it clear that the principles within the legislation relating to data protection, do not apply to personal data rendered anonymous, provided this is achieved in such a manner, so that the data subject is not or no longer identifiable.
This is all well and good, but in practice:
- The actual identifiability of individuals can be highly context-specific;
- Different types of information have different levels of identifiability risk depending on the circumstances in which they are processed;
- The process of creating synthetic data using a dataset containing personal data requires the processing of personal data.
So, what constitutes a sufficient level of anonymisation in the resultant synthetic dataset? To answer this, we turn to the Information Commissioner’s Office (ICO) draft guidance:
“Effective anonymisation reduces identifiability risk to a sufficiently remote level”
ICO draft anonymisation, pseudonymisation and privacy enhancing technologies (PETs) guidance published September 2022
Based on this guidance, whether the resultant synthetic dataset constitutes personal data or anonymous information is a question to be determined based on an assessment of the identifiability risk.
Hazy’s comprehensive approach to privacy protection
The Hazy platform is a sophisticated privacy enhancing technology that combines two well-known privacy methods to produce synthetic datasets where the identifiability risk is sufficiently remote:
Generative Models
Generative models break the one-to-one data points mapping and greatly reduce singling out and linkability. The technical part: Hazy relies on generative machine learning models to generate synthetic data in a two-step process:
- Fitting: the generative model training algorithm takes the real data as an input, updates its internal parameters to learn a (lower-dimensional) representation of the probability distribution of the real data, and outputs a trained model.
- Generation: the trained model is sampled to produce a synthetic dataset, breaking the one-to-one mapping from a single real data point to a single synthetic data point.
Differential Privacy (DP)
Differential Privacy mechanisms are designed to eliminate singling-out, linkability, and other re-identifiability concerns even if faced with a resourceful and strategic adversary.
The technical part: We incorporate Differential Privacy in the generative model fitting step. Differentially Private mechanisms rely on randomness and noise perturbation.
Using differential privacy with synthetic data can protect any outlier records from linkage attacks with other data.
ICO draft anonymisation, pseudonymisation and privacy enhancing technologies (PETs) guidance published September 2022
By combining synthetic data with differential privacy, Hazy’s approach to sufficiently anonymising synthetic data with a sufficiently low risk of re-identification is consistent with ICO guidance, which takes into account the concept of identifiability in its broadest sense.
It does not simply focus on removing obvious information that relates to an individual. Instead, it is designed to sufficiently anonymise the data to reduce the possibility of:
- Singling out or individuation
- Linkability
- A motivated intruder re-identifying individuals
Hazy's platform generates synthetic data that is sufficiently anonymous that the UK GDPR does not apply to it.
For more information on any of the above, book some time with our team.