This is a comparison post to Gretel’s blog post. We replicated the same task on the same open data set in order to compare the quality of their synthetic data with synthetic data generated by Hazy.
Synthetic data and differential privacy can be abstract and complex if you’re unfamiliar with the mathematics driving them. In this article, we hope to help you understand the value of these two concepts when applied to common geographic data. We use maps and tables to help you visualise the trade-off between noise and differential privacy to find the balance between utility and privacy.
Geographic data is any data or information that’s related to location on Earth. Hazy can take the raw data from this geographic information and turn it into accurate, up-to-date and utterly synthetic data that preserves privacy.
We do this with differential privacy, which is a strong, mathematically provable guarantee of privacy protection. Machine learning and AI algorithms identify statistical patterns and properties of those raw datasets, and we use those to generate completely artificial data that is statistically representative to the raw data. This means that Hazy synthetic data is highly useful without exposing any private or individually identifying data.
For this geographic data use case, the source data is taken from Uber ride shares in Los Angeles over the course of one day. The original format of this public data was the General Bike-share Feed Specification, a standardized data feed for bike share system availability, based on an open-sourced extract.
|0||Hour||Integer (0-23)||Time of journey|
|1||Bike ID||Categorical identifier||Identifier of a bike, rotated to a random string, at minimum, after each trip to protect privacy.|
|2||SRC latitude||Floating point number||Journey start WGS84 latitude in decimal degrees|
|3||SRC longitude||Floating point number||Journey start WGS84 longitude in decimal degrees|
|4||DST latitude||Floating point number||Journey end WGS84 latitude in decimal degrees|
|5||DST longitude||Floating point number||Journey end WGS84 longitude in decimal degrees|
The extract contains the hour, bike ID and coordinates where the journey originated and terminated. Most values are numerical but the bike ID requires different handling. The bike ID is treated as a categorical value, similar to assigning a different number for each unique bike ID.
The numerical precision of the position data is high enough to match addresses. It may be feasible to join this data with third party data sources and infer private relations between people. This could place people at risk and exposes the data owner to legal liability.
The promise of synthetic data is to protect the privacy of data subjects while preserving the utility of the data. A possible use for this data is for forecasting periods of high demand. The synthetic data should preserve the aggregate features of the data set without revealing details of individual journeys.
Sample original sensitive data
A machine learning model is responsible for capturing the patterns, trends and relationships in the original data. The selection of which model is informed by the use case and depends in part on:
- Training time and resource budget
- Synthetic data quality requirements
- Differential privacy requirements
For this article we'll be using a model based on PrivBayes, which produces good results with minimal or no tuning. Joint distributions can be approximated with this model, given a sufficiently large network degree.
Synthetic data models guarantee differential privacy by adding noise to the captured summary of the original data. The objective is to obscure the presence of any individual row in the original data. Models each have their own way of managing noise and guaranteeing differential privacy.
The amount of noise is determined by the value of the ε (epsilon) parameter, which is set prior to training. Adding more noise provides better privacy, but this makes the synthetic data less similar to the original data. Below is an example of a typical utility / epsilon trade-off graph, used to select an ideal value for ε (epsilon).
Access to the original data is no longer required once a summary is produced. The captured summary is typically very compact, consuming only a fraction of the space. The summary can be transported and stand in as a proxy for the original data. Use cases obtain what would be impossible otherwise:
Data science teams are often denied access to potentially valuable, but sensitive data on regulatory grounds. Access to synthetic data unchains the so-called elephant and permits exploration and evaluation of potential value without compromising privacy.
It is possible to produce an arbitrary amount of synthetic data from the captured summary. Large volumes of realistic data is especially useful for load testing, model testing and soak testing.
Comparison of data sets
A model summary is convenient to transport and can traverse jurisdiction boundaries with greater ease. Synthetic data sets can be compared and joined in cases where the volume or ownership of the original data precludes this.
Sample generated synthetic data
At first glance the sample synthetic journey data looks superficially similar to the original. More extensive similarity measures will be presented below.
These maps illustrate journeys by joining the start and end coordinates. First illustrating the original data and the second illustrating the Hazy data mapped over that data. Journey starts are coloured violet, ends are cyan (real) or pink (synth) respectively. The accuracy of the positional data clearly aligns with the road network and it is highly likely that addresses could be inferred.
Two major clusters, some smaller clusters and outliers are clearly identifiable. We expect the synthetic journeys to preserve the macro-features of the journeys, while not being precise enough to infer addresses. The maps below can be zoomed to compare features at fine scale.
The added noise has generated some obviously infeasible journeys in the ocean. Increasing the value of ε (epsilon) will reduce this, but at the cost of privacy.
The journey start and end positions are clustered in a similar way to the original data. The alignment with the road network is less precise, while capturing the aggregate distribution of positional information.
Generic quality assessment
The synthetic data is subjected to a battery of tests after model training. These tests provide confidence that the synthetic data will be fit for purpose and that ε (epsilon) was not accidentally set too small.
This battery of tests forms part of the core Hazy product. Sign up for a demo to see the full set of tests, including:
- Predictive utility for classification
- Disclosure risk
- Bi-joint distribution similarity
- Feature importance rank relative to many regression algorithms
The next two charts illustrate examples of two such tests.
Marginal distributions are compared for each column of the original and synthetic data. The overlap is 93% in the case of journey starting longitude. The peaks of the distribution correspond to the two major clusters in the map.
The mutual information between pairs of columns is computed for the original and synthetic data. These are compared in the heatmap above, showing a minimum similarity of 0.79 in the mutual information between source latitude and hour.
Journey data specific comparison
The remaining similarities were generated specifically for the journey data.
A comparison between journey distances shows that synthetic data journeys have a similar median. Synthetic journeys shorter than the median are slightly under-represented, while those longer than the median are slightly over-represented.
The following histogram of journey counts by hour illustrates a similar distribution for synthetic and original data. The original data set lacks journeys at 19:00 and so does the synthetic data. The synthetic data has aggregated the original journeys at 12:00 and 13:00 into one larger bucket. A larger value of ε (epsilon) will increase the similarity of the two distributions
The number of journeys recorded per bike illustrates the differences between the synthetic and original data. The number of bikes making a single journey differs by a third. There are two options if a divergence compromises the utility of the data:
Adjusting the utility / privacy trade-off
Increasing the value of ε (epsilon) will reduce the amount of noise in the model and improve the similarity.
The model that was used in this illustration has tunable parameters that improves accuracy, but at the cost of longer training time, greater memory requirements and a larger model summary.