Success in the data economy is no longer about collecting information. Today’s organisations have to innovate faster than the competition, but can’t risk any data leaks. Industries like banking and healthcare have an incredible wealth of well-organised data, but most of this data is locked behind secure silos, isolated and impossible to access — even for their own employees. Strict privacy regulations make sharing data even more challenging. No innovation can come out of siloed data — and if it does, it takes months and even years to procure it.
Hazy has pioneered the use of synthetic data to solve this problem by providing a fully synthetic data twin that retains almost all of the value of the original data but removes all the personally identifiable information. This smart synthetic data is a breakthrough that allows companies to innovate, internally and externally, by leveraging the pool of resources, even in open source communities.
Smart synthetic data also enables testing, integration partnerships, cloud migration and data portability. Data in most organisations is increasingly complex and often contains some form of time-value to it, like credit card transactions where a sample consists of a set of transactions that may have originated months or even years ago.
This type of time-dependent data is usually called sequential data or time-series data. Up until now, Hazy has dedicated most of its efforts to creating synthetic versions of well-structured tabular data. Now, in response to client demand, we have made synthesising time-series data a priority.
Earlier this month, our Head of Data Science Armando Vieira published his research “Generating Synthetic Sequential Data using GANs” in the blog Towards Artificial Intelligence. Building on the work of the Carnegie Mellon University machine learning department, we have been able to take sequential synthetic data to the next level. In this piece, we summarise the challenges of synthetic sequential data and present Armando’s extended version of the powerful DoppelGANger generator.
At Hazy, we have escalated the research on how to generate time-series synthetic data that’s differentially private with high utility, and we’re excited to share here how we accomplished this.
The challenge of synthetic sequential data
As mentioned before, sequential data is any data which has some form of time dependency. This can be strings of transactions, medical records, stock market movements, weather patterns or anything where order matters.
Generating safe synthetic data that preserves timelines has dramatic potential to unlock cross-organisational and cross-industry collaboration to solve some of the biggest problems at a world scale. Synthetic time-series data could be applied to allow more open, but secure sharing of information, which can lead to faster detection of cancer and identification of money-laundering patterns — without risking privacy leaks.
But generating synthetic time-series data or sequential data is significantly harder than tabular data. Synthetic tabular data assumes that all information of a single individual is stored in a single row. Sequential data, on the other hand, has interpreted time-sensitive information spread across many rows and columns. The length of these sequences is often variable. It’s much harder to preserve the correlations between the rows — usually called ‘events’ — and the columns — usually called ‘variables’. And the longer the history, the harder it is for a machine learning algorithm to find the commonalities and then translate them into completely artificial new data.
With time-series data, each data point is conditional on a potentially very large history of events. The synthetic data generator is no longer just manufacturing a single point, but rather that single point dependent on a thousand previous points. Small errors can propagate throughout the entire sequence and introduce major deviations. Until now, figuring out how to generate sequential data has proven a real challenge.
Obviously, being such an in-demand resource, there are models that have attempted to achieve this before, but they always seem to fall short. Most of the models that have attempted synthesising time-series data either can’t handle the scope and complexity of enterprise data or can only work with a specific domain knowledge that’s not transferrable from one industry or even use case to another. Markov models, Bootstrapping and autoregressive models are all popular but lack the ability to capture long-term, complex dependencies. Dynamic stationary processes don’t work well with unknown correlations.
Furthermore, none of these models are differentially private, which makes them ineffective for modern organisations.
GAN-based methods or generative adversarial network models have emerged as the frontrunner for generating and augmenting datasets, particularly with images and video.
GANs involve training models using a generator and discriminator. The generator tries to create samples that fool the discriminator while the discriminator learns to distinguish between the real data and the synthetic data.
Up until now, however, GANs have failed to achieve high fidelity when there are complex correlations like a mix of both discrete features — which can only have certain values, like a shoe size can’t be 8.75 — and continuous features — which can change overtime, like the temperature.
GAN-based time series generation already exists, but so far couldn’t handle exponentially heavy-tailed and varied data distribution. This means you get poor autocorrelation scores on long sequences that are susceptible to mode collapse — a classical failure mode of GANs.
In lay terms, while GANs has been very successful in generating deep fake images, it has, up to now, been unable to capture correlations like age plus spending patterns, particularly when combined with transactional data.
Adding to these obstacles is the need to make the process differentially private without degrading the quality of the data. Differential privacy is the gold-standard mathematically provable guarantee of privacy protection, which compliantly allows for public sharing of information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset.
At Hazy, we work with several multinational financial service giants, and we often hear their desire to safely leverage time-series data. That’s why we were excited when we read the work of Zinan Lin, Alankar Jain, Guilia Fanti and Vyas Sekar from Carnegie Mellon and Chen Wang from IBM. Their DoppelGANger generator is the first we had come across that made it possible to generate synthetic data from complex sequential datasets.
DoppelGANger Generator: For generating high-fidelity, synthetic sequential datasets
The wittily named DoppelGANger generator is based on GANs. It’s designed to work for more complex time-series datasets that have both fixed discrete features and ever-changing continuous features. The DoppelGANger team decided to build on the recent advances in GANs generating deep fake images and apply it to data-driven network research.
The DoppelGANger generator is appealing for a few reasons.
First, while most approaches generate attributes and features together, DoppelGANger decouples the attribution generation from the time series generation, instead feeding attributes to the time series generator at each time step.
This framework also allows for flexibility around the distribution and conditioning of attributes. Since the attributes usually contain personally identifiable information, this decision from the DoppelGANger team serves to increase privacy.
The authors have also cleverly allowed for auto-normalisation, which automatically factors in edge cases but also allows for them to be “smoothed out” by setting minimums and maximums before training.
The authors of DoppelGANger were most interested in its application in academic circles, so, at Hazy, we first evaluated it on a more business use case — a dataset of 10 million bank transactions. Since the Hazy data generator actually created the input data, we already knew the real distributions, making it easier to verify if this model could learn the time dependencies.
Definitely read Armando’s piece to get a copy of our synthetic starter dataset, learn more about the data and how we prepared the data and ran it through DoppelGANger. The original Towards AI article also offers advice for how to balance training neural networks in batches and how to allow your neural network to learn faster without risking instability and almost inevitable mode collapse.
We also tested the DoppelGANger generator on a much more complex dataset that reflects six years of traffic and weather.
As Armando explains: "In order to generate good quality synthetic data, the network has to predict the right daily, weekly, monthly, and even yearly patterns, so long-term correlations are important.”
To test the quality of the generated data, we looked at three metrics:
- Similarity - how similar the curve drawn across a histogram is
- Autocorrelation - the measurable comparison between real and synthetic data
- Utility - the relative ratio of forecasting error when trained with real and synthetic data
Once we evaluated the model, it was time for our team to experiment with it and see where we could build on this excellent work.
Improving the privacy-utility trade-off for better Differential Privacy
While we think the DoppelGANger generator is great, there were two drawbacks that were highlighted in the DoppelGANger paper that Hazy looked to improve upon. By applying differential privacy, even at modest levels, the quality of the data generated by the model drops drastically - measured by the relative decay of correlations in time.
Differential privacy is a mathematical guarantee that high quality fake data cannot be reverse engineered for re-identification purposes. At Hazy, we believe this has to be the new norm for synthetic data — and the data economy as a whole — to reach its full potential.
We built our synthetic data generation on distribution estimators, not individual data points, so we are able to be differentially private.
The original DoppelGANger paper enforces differential privacy using the standard method Differentially Private Generative Adversarial Network or DPGAN, which involves adding noise to the discriminator and clipping its gradients. DPGAN was the first implementation of differential privacy to GANs, but, in the case of DoppelGANger, it has led to low fidelity.
At Hazy, we decided to try applying Privacy-Preserving Generative Adversarial Network because we thought it could be a better way to deal with privacy, as PPGAN doesn’t add noise blindly to the discriminator. Instead you select only the more informative or sensitive data points to add noise to.
For example if someone is more than two meters tall, they are literally above average. The PPGAN model only injects noise to perturb these points and smooth these kinds of outliers. Those in the normal height range couldn’t be identified just based on their heights combined with other data so this data is left alone. Basically, instead of being indiscriminate in injecting noise, we are cherry-picking.
The CMU team writes that when trying to make DoppelGANger differentially private, DPGAN destroys the autocorrelations. The perfect model is one-to-one or a 100 percent match. The DoppelGANger generator only hit a 43 percent match, while the Hazy synthetic data generator has so far resulted in an 88 percent match for privacy epsilon of 1.
We consider this a drastic improvement — but are working to make it even better. It also became apparent that the longer the data sequence, the more PPGAN outshines DPGAN in synthesising higher quality, differentially private data.
Increasing the stability of DoppelGANger
As an added benefit, this well-designed noise in the discriminator was able to make the process not only differentially private without degrading the quality of the data, but it also improved the stability of the GAN, by speeding up its convergence and avoiding mode collapse.
Training GANs models is very hard. One of the reasons is that the way they learn is very unstable. This makes it quite tricky, and there’s always some trial and error to discover which learning rate will allow each GAN to train properly.
The DoppelGANger generator follows the traditional schedule of using a constant learning rate through the training. Some other strategies use a decaying learning rate, which evolves decreasing the learning rate as more the network is updated through the epochs.
At Hazy, we decided to use a cyclical learning rate, where learning rates oscillate over time. This method has been successfully applied to train neural networks, but, to our knowledge, not to GANs. This oscillating behaviour is a way for the model to jump around and not get stuck on local minima and avoid mode collapse.
At Hazy, we are really excited about how we are able to generate truly high quality synthetic sequential data, and we look forward to working with clients to improve on this work.