Data anonymisation - what, why and how?
Data anonymisation is a set of processes and methods applied to a dataset in order to ensure that data cannot be attributed to an individual or entity from which the data comes.
In order to efficiently and appropriately anonymise data, detailed knowledge of the data is critical. For example, anonymising addresses presents a different challenge than when dealing with names. This is where automation can make a tangible difference. By automatically identifying the nature of the data, it becomes possible to anonymise large datasets without any manual input and human supervision.
Good for business
In today's digital world, both individuals and companies rely heavily on the ability to access and share data. Take bank accounts, to open a new account an individual has to share sensitive personal data with a bank, the bank will then in turn share this data with external developers, credit scoring companies, etc.
In these multiple exchanges, data is at risk of being leaked, putting the individual at personal risk and presenting a serious liability for the company. Anonymisation offers a solution for both of these concerns as the anonymised data is subsequently much safer to share whilst at the same time retaining it's usability.
Anonymisation or pseudonymisation?
Pseudonymisation is a technique that can improve user privacy by replacing or removing the most identifying fields in a data set. This may involve replacing names or other direct identifiers which can be easily attributed to individuals with, for example, a reference number. Pseudonymised data can reduce the risks of identification of data subjects and help companies meet some data protection obligations.
In essence, this is just a security measure and does not change the status of the data as personal data. Sharing any data that has undergone pseudonymisation still has to be done strictly in accordance with GDPR standards. For instance, if the data subject requests access and asks to be removed from the data, the appropriate procedures have to be followed and response should be fulfilled without undue delays, regardless of any attempt to pseudonymise any personally identifiable information
However, GDPR does not apply to personal data that has been anonymised if performed in a correct and robust way. Anonymising personal data can therefore provide a more effective method for businesses seeking to limit the risks when sharing data, whilst protecting identifiable data too.
Putting anonymisation to use
There are various anonymisation techniques, but unfortunately, there's no magic bullet. To achieve better anonymisation you need to spend time finding an effective combination of the following techniques for each individual data set and use case.
Perturbation - improves privacy by modifying records through noise injection, in a non statistically significant manner. This ensures that the statistical usefulness remains the same. The main shortcoming is that the anonymised data is no longer accurate or truthful even if statistically useful; therefore any mechanism that relies on the accuracy of such anonymised data would be at jeopardy.
Permutation - a sub-technique of perturbation, useful when the values of a given attribute cannot easily map to a numerical or multi-dimensional space. In such an instance, an efficient way of adding perturbation to the data is to permute the personal attribute and the value of a similar record. This allows protection of the statistical significance of the data while increasing the anonymity of each record.
Generalisation - a method that relies on the existence for an underlying hierarchy, e.g. street \> district \> city \> region \> country. Generalisation uses a hierarchy to reduce the specificity of the record and therefore the amount of information that a record divulges without making it inaccurate. Depending on the granularity of the hierarchy, this can preserve both the statistical utility of the data as well as its accuracy.
Suppression - a method that ensures anonymity by deleting personal information from the record. This can be seen as a last resort where neither generalisation nor perturbation can be applied to partially preserve the usefulness.
What else can I do to enhance data privacy?
There are a number of privacy enhancing technologies on the market, some of them are expensive and require enterprise grade technical capacity, but others are fairly simple and affordable for SMEs and even young startups.
Selecting the right combination may be a challenge, despite the fact that the knowledge of different methods is quite evolved and there's a lot of available information. So, we've included a selection to help discerning businesses.
Data Masking - a method of creating structurally similar but inauthentic versions of an organization's data that can be used for purposes such as software testing and user training. The purpose is to protect the actual data while having a functional substitute for occasions when the real data is not required, such as generating dummy data for software development.
Differential Privacy - allows companies to learn more about their users by maximizing the accuracy of search queries while minimizing the chances of identifying personal records. Differential privacy requires filtering data, adaptive sampling, adding noise by fuzzing certain features, analysing or blocking intrusive queries.
Homomorphic Encryption - a method for performing calculations on encrypted information without decrypting it first. Why should we care about arcane computer math? Because it could make cloud computing a lot more secure. It's not quite ready for your email though—right now it makes processes literally a million times slower if you use it. Plus, its use significantly limits the amount of operation or processing that can be done on the data whilst keeping the real data intact.
Photo credits Martin Adams from Unsplash