We are living in a data-driven world. Data-driven research, experimentation and innovation rests on the importance of highly accurate behavioural and socio-demographic data. But there are more malicious actors than ever trying to steal that data and take advantage of our customers’ identities.
As Alex Hern wrote for The Guardian, “An anonymised dataset is supposed to have had all personally identifiable information removed from it, while retaining a core of useful information for researchers to operate on without fear of invading privacy.”
Anonymised data in theory makes sense. It should help researchers, along with app developers, data scientists and architects, and QA testers, work with data without limiting privacy. And it should allow third-party partners to build onto and integrate with your data and tooling.
Two problems with this. The privacy isn’t guaranteed at all. And, when the privacy is preserved, you are left with data that has lost just about all of its utility. With traditional data anonymisation or de-identification, there’s a huge trade-off between true privacy protection and actually useful data.
In this article, we talk about the weaknesses of anonymised data and offer offer synthetic data as an alternative where both privacy and data utility can win.
The challenges of anonymised data
Classic data anonymisation applies techniques to try to destroy information. If done correctly, none of the data should be able to be connected to an individual. We’ve already written about the complicated challenge of anonymised data techniques. You need to experiment with a combination of four different kinds of data anonymisation techniques:
- Perturbation: injecting noise into the data so it’s secure without harming its statistical significance — this often leaves the data lacking accuracy.
- Permutation: “permuting” or randomising specifically the personally identifiable attributes within the data.
- Data generalisation: based on a hierarchically structured attribute of the data, it reduces information without making it inaccurate — here accuracy depletes when hierarchical complexity increases.
- Data suppression: like a shredder, it’s a complete redaction, losing most utility, but the safest option.
Modern data protection laws like European General Data Protection Regulation (GDPR) and the California Privacy Rights Act (CPRA) made the legal distinction between anonymous data, which completely masks or obfuscates identifying data, and pseudonymous data, which doesn’t contain obvious identifiers but may be re-identifiable.
Whether CPRA and GDPR compliant or not — and some would say anonymised data could never actually be compliant — many supposedly anonymous datasets have been since used, shared and sold. Everything from medical results to Netflix users’ sexual identities to New York City taxi rides have been released as supposedly anonymous open data, only to then be re-identified.
Open data is a great way for enterprises to partner and integrate with third-party innovators, but there’s no doubt that poor data anonymisation will find your business in the headlines and fast.
‘Anonymised’ data raises privacy red flags
PLOS One peer-reviewed open access journal brought together the research of 327 studies. Across these articles, 14 demographic attributes — gender, age, race and ethnicity, family, location, language, income, social class, education, religion, health, occupation, sexual orientation and political orientation — were able to be identified based on our digital traces. Specifically age and gender — which are often used to bias human loan and hiring choices, but are the widest studied demographics — were easily identifiable in 105 and 63 studies, respectively.
Combining basic data that’s open from social networks with any obscured or anonymised data can enable simple re-identification. This isn’t just the correlation of things like native language and gender, but the discovery that women are more likely to use more emotion words and emojis, while men use more assertion and curse words. So even if the binary gender was redacted, open data from social network usage could allow for fairly easy re-identification.
Anonymised data could be the biggest threat to identity in the post-GDPR world.
PLOS One peer journal wrote:
“The ability to use network data to infer attributes can be incredibly useful in identifying information that may not be disclosed directly by an individual. However, this has serious implications for privacy – individuals may want to keep their political beliefs, sexuality etc. private and may not realise they are inadvertently revealing them through their digital activity. Alternatively, the extent to which this is a concern is dependent on who the individual would want to conceal such information from – computer algorithms may be able to detect such information; however, it is unlikely that the average human or people within their network would be able to make such inferences accurately from looking at this type of data.”
According to a recent paper on re-identification in the journal Nature, a lot of companies are trying to combat this by only releasing a small portion of their dataset — from one to ten percent. Not only is it still possible to identify people, this data sampling and obfuscating is much less useful.
This research gave the following example:
“Imagine a health insurance company who decides to run a contest to predict breast cancer and publishes a de-identified dataset of 1000 people, 1% of their 100,000 insureds in California, including people’s birth date, gender, ZIP code, and breast cancer diagnosis. John Doe’s employer downloads the dataset and finds one (and only one) record matching Doe’s information: male living in Berkeley, CA (94720), born on January 2nd 1968, and diagnosed with breast cancer (self-disclosed by John Doe). This record also contains the details of his recent (failed) stage IV treatments. When contacted, the insurance company argues that matching does not equal re-identification: the record could belong to 1 of the 99,000 other people they insure or, if the employer does not know whether Doe is insured by this company or not, to anyone else of the 39.5M people living in California.”
Even when everything is anonymised, malicious actors can figure out identities. Each data footprint is quite unique. And anonymised information does little to blur it.
This Nature article portends that not only is it often easy to re-identify individuals from anonymised data, but that the uniqueness of our digital footprints will only increase. This uniqueness makes anonymisation nearly impossible. Add to this that machine learning models are only getting more accurate, even when trained on only one percent of a population the size of the U.S.
Effectively, anonymised data is a misnomer. It is a real threat to privacy and is on track to get even worse.
Smart synthetic data offers useful alternative to anonymised data
Another flaw in classic data anonymisation is that it lacks utility. Organisations are still pouring billions a year into their artificial intelligence and innovation programmes without getting results. In part, this is because they spend months trying to access data across silos and sources, and then, in order to protect privacy, they black out all the information that could make it useful.
In order to secure privacy, you could redact all the info about someone, but then you’re rendering that info fairly useless. You can’t really develop tools and features to better serve customers if you are building on top of an unrecognisable blob of generic info.
With classic data anonymisation strategies, you’re facing a huge trade-off between data privacy protection and actually being able to get any value out of the data.
Actually, when Luke, James and I started this company over three years ago, we started out trying to solve the flaws of traditional data anonymisation and obfuscation. We soon discovered anonymised data will always be problematic. We looked to an opportunity in applying machine learning to synthetic data and thus pivoted our focus completely.
Synthetic data has been around for a while, and simply means any applicable production data that are not obtained by direct measurement. Because it’s not real, it’s totally safe. And it’s been used for decades for things like fuzz testing. The problem is traditional synthetic data gives an incomplete view of bugs and security flaws.
But we didn’t see the opportunity in just plain synthetic data. We saw it in the recent possibility of smart synthetic data.
The last five years have seen an astonishing advancement in machine learning algorithms and technology. (From which we’ve been able to attract some of the most brilliant talent.) AI-generated synthetic data, like Hazy synthetic data, retains almost all of the utility of a data set, while it becomes impossible for re-identification. Smart synthetic data maps along the same curves and patterns of the source data, retaining the utility of the data via distributions and correlations. This means it is highly useful in data science and model training without risking any privacy.
While smart synthetic data is much more useful for testers, it also unclogs data bottlenecks across organisations. It brings useful, secure data to teams in data science, open innovation, application development, and cloud migration.
With smart synthetic data, you no longer have to choose between privacy and utility.