What is data anonymisation?
Data anonymisation is a set of processes and methods that can be applied to a dataset in order to ensure that the data it contains cannot be connected to an individual or entity from which the data comes.
In order to efficiently and appropriately anonymise data, detailed knowledge of the data is critical. For example, psuedoanonymised information like anonymised addresses presents a different challenge than when dealing with names.
Putting anonymisation to use
There are various anonymisation techniques, but unfortunately, there's no magic bullet. To achieve better anonymisation you need to spend time finding an effective combination of the following techniques for each individual data set and use case.
Perturbation - improves privacy by modifying records through noise injection, in a non statistically significant manner. This ensures that the statistical usefulness remains the same. The main shortcoming is that the anonymised data is no longer accurate or truthful even if statistically useful; therefore any mechanism that relies on the accuracy of such anonymised data would be at jeopardy.
Permutation - a sub-technique of perturbation, useful when the values of a given attribute cannot easily map to a numerical or multi-dimensional space. In such an instance, an efficient way of adding perturbation to the data is to permute the personal attribute and the value of a similar record. This allows protection of the statistical significance of the data while increasing the anonymity of each record.
Generalisation - a method that relies on the existence for an underlying hierarchy, e.g. street > district > city > region > country. Generalisation uses a hierarchy to reduce the specificity of the record and therefore the amount of information that a record divulges without making it inaccurate. Depending on the granularity of the hierarchy, this can preserve both the statistical utility of the data as well as its accuracy.
Suppression - a method that ensures anonymity by deleting personal information from the record. This can be seen as a last resort where neither generalisation nor perturbation can be applied to partially preserve the usefulness.
Weaknesses of anonymisation
There is a wide range of academic literature demonstrating weaknesses of anonymised data and the ability to re-identify individuals even in datasets that have been comprehensively anonymised.
Psuedoanonymised information comes with a lot of holes. The main attack vector is linkage attacks, where attackers re-identify individuals in anonymised data by cross-comparing it with publicly available data (e.g. a voter registration list) or alternative sources.
Researchers, such as Luc Rocher at UCLouvain, have argued that linkage attacks mean that anonymisation does not meet GDPR (general data protection regulation) requirements.
What else can I do to enhance data privacy?
There are a number of privacy enhancing technologies on the market, some of them are expensive and require enterprise grade technical capacity, but others are fairly simple and affordable for SMEs and even young startups.
Selecting the right combination may be a challenge, despite the fact that the knowledge of different methods is quite evolved and there's a lot of available information.
Data Masking - a method of creating structurally similar but inauthentic versions of an organization's data that can be used for purposes such as software testing and user training. The purpose is to protect the actual data while having a functional substitute for occasions when the real data is not required, such as generating dummy data for software development.
Differential Privacy - allows companies to learn more about their users by maximizing the accuracy of search queries while minimizing the chances of identifying personal records. Differential privacy requires filtering data, adaptive sampling, adding noise by fuzzing certain features, analysing or blocking intrusive queries.
Homomorphic Encryption - a method for performing calculations on encrypted information without decrypting it first. Why should we care about arcane computer math? Because it could make cloud computing a lot more secure. It's not quite ready for your email though—right now it makes processes literally a million times slower if you use it. Plus, its use significantly limits the amount of operation or processing that can be done on the data whilst keeping the real data intact.
Synthetic data - this is Hazy's core capability: generating artificial data that contains no real data or customer information whilst preserving all the insight, patterns and correlations in the source data.
Use cases for synthetic data
Synthetic data has been around for a long time, in the form of test data and automatically generated data for fuzz testing, etc. However, a combination of advances in algorithms and machine learning technology have led to the emergence of a new generation of smart synthetic data that's suitable for a much wider range of use cases, including data science and model training.
There are many use cases for synthetic data. These vary from increased data portability to safe sharing with external vendors and partners.
Working with synthetic data, as opposed to real data, eliminates the vast majority of attack vectors, because synthetic data is fake data and so there's no real data or customer information to leak. In addition, synthetic data can be combined with other privacy techniques for additional guarantees.
For example, synthetic data generator models can be trained to be differentially private, and data can be generalised either before or after training a generator.
The challenge is making the synthetic data smart enough to carry through the insight and information from the source data. This starts with matching simple patterns like distributions of values in a column, through to complex modelling of implicit relations and sequential data.