Anonymised Data

Anonymised Data

By on 01 May 2019.

What is data anonymisation?

Data anonymisation is a set of processes and methods that can be applied to a dataset in order to ensure that the data it contains cannot be connected to an individual or entity from which the data comes.

In order to efficiently and appropriately anonymise data, detailed knowledge of the data is critical. For example, psuedoanonymised information like anonymised addresses presents a different challenge than when dealing with names.

Putting anonymisation to use

There are various anonymisation techniques, but unfortunately, there's no magic bullet. To achieve better anonymisation you need to spend time finding an effective combination of the following techniques for each individual data set and use case.

Weaknesses of anonymisation

There is a wide range of academic literature demonstrating weaknesses of anonymised data and the ability to re-identify individuals even in datasets that have been comprehensively anonymised.

Psuedoanonymised information comes with a lot of holes. The main attack vector is linkage attacks, where attackers re-identify individuals in anonymised data by cross-comparing it with publicly available data (e.g. a voter registration list) or alternative sources.

Researchers, such as Luc Rocher at UCLouvain, have argued that linkage attacks mean that anonymisation does not meet GDPR (general data protection regulation) requirements.

What else can I do to enhance data privacy?

There are a number of privacy enhancing technologies on the market, some of them are expensive and require enterprise grade technical capacity, but others are fairly simple and affordable for SMEs and even young startups.

Selecting the right combination may be a challenge, despite the fact that the knowledge of different methods is quite evolved and there's a lot of available information.

Techniques include:

Use cases for synthetic data

Synthetic data has been around for a long time, in the form of test data and automatically generated data for fuzz testing, etc. However, a combination of advances in algorithms and machine learning technology have led to the emergence of a new generation of smart synthetic data that's suitable for a much wider range of use cases, including data science and model training.

There are many use cases for synthetic data. These vary from increased data portability to safe sharing with external vendors and partners.

Working with synthetic data, as opposed to real data, eliminates the vast majority of attack vectors, because synthetic data is fake data and so there's no real data or customer information to leak. In addition, synthetic data can be combined with other privacy techniques for additional guarantees.

For example, synthetic data generator models can be trained to be differentially private, and data can be generalised either before or after training a generator.

The challenge is making the synthetic data smart enough to carry through the insight and information from the source data. This starts with matching simple patterns like distributions of values in a column, through to complex modelling of implicit relations and sequential data.


Synthetic data newsletter

Sign up for our sporadic newsletter to keep up to date on synthetic data, privacy matters and machine learning.