Artificial Neural Networks have been around for more than half a century. Much of the excitement around artificial intelligence and machine learning is based on rather old ideas and algorithms. Neural networks and the algorithms to train them were proposed and tested back in the 80s with the work of Geoffrey Hinton and Jurgen Schmidhuber. However, one of the recent breakthroughs was the work of Ian Goodfellow, who proposed in 2014 a new architecture named Generative Adversarial Networks or GAN.
We’ve written quite a bit about the applications of GANs — including about synthetic sequential data — but we realised we should invest a post or two into explaining what are GANs and what makes them so powerful for synthetic data generation.
So this piece is dedicated to what I like to call the beauty of GANs.
GANs: What are generative adversarial networks?
When we train a typical machine learning algorithm we provide a set of inputs and outputs and the algorithm will try to relate the two sets. Generative models, on the other hand, belong to a class of unsupervised learning algorithms that automatically discovers patterns within a dataset without any labels. GANs are implicit generative networks that learn to model the distribution of the data without need of auxiliary labels. They are composed of two networks, a generator and a discriminator, that compete with each other: the generator tries to fool the discriminator into believing that a generated sample is real while the discriminator learns to distinguish between both samples. The game is continued until convergence.
Nobel Physicist Richard Feynman wrote on his blackboard, “What I cannot create, I do not understand” to emphasize that you cannot truly understand anything until you can recreate it yourself step by step. GANs, being generative models, follows Feynman’s logic.
GANs are very powerful, unsupervised techniques for generating really realistic, while fake, data. Just like a child learning, the generator will initially have a difficult time creating quality samples. The discriminator is tasked with identifying the fake from the real data. At first it’s easy to distinguish between the two. As the generator continues to learn, it will eventually fool the discriminator that would not be able to tell the difference between a fake and a real sample.
The most known example of GANs is certainly DeepFakes. In fact, the vast majority of generative adversarial networks work with photos or videos. Images are easier to fake because they have high correlation between pixels that have spatial smoothness that make them easier to simulate with convolutional networks.
In some cases the generator can be explicitly conditional, meaning that we can generate data given specific attributes or combination of attributes This feature can be very useful for generating very rare. For instance, even if the machine has never trained on an image like, say someone with purple hair, a mustache and sunglasses, the generator can assemble a realistic face with those improbable and funny characteristics That’s because the generative machine learning model has learned how lower level (latent) features come together to build meta entities — for example it knows that a nose has a fairly predictable relationship with eyes and mouth. This allows the generator to take what it knows and to accurately fill in the gaps and predict what that person would look like.
Generative modelling for synthetic data
Generating synthetic data that relies on tables is a whole other kettle of fish. Enterprise customer data could have thousands of columns with numerous categorical values. Each column becomes its own dimension. Working out how the categorical values in a table interrelate is much more challenging than the relatively predictable pixel dimensions of a human face.
At Hazy, we use a variety of models, including GANs, to learn the statistical properties of the database-backed source data. This begins with the generator learning the distribution of the values within each column.
The Hazy synthetic data generators are able to train on source data across large multi-table datasets and produce artificial data that is indistinguishable by the discriminator.
Our unique models achieve statistical equivalence by learning the correlation among values in different columns.
Again this isn’t some sort of pseudo-anonymisation or obfuscation or redaction. This is entirely new artificial data that retains the correlations that are essential to machine learning engineers and data scientists.
And because it is completely synthetic data, it is able to be differentially private, a mathematical guarantee that it won’t run the risk of re-identification.
The great advantage of neural networks in GANs is that they work in the abstraction levels. This allows the algorithm to look at and learn from each layer of the neural network to understand what it’s doing. This also allows GANs to uniquely transfer this learned knowledge across to domains.
To me, it’s like GANs gives machines the potential to progressively learn, apply and even create in a similar way to the human brain.
How GANs fulfills the potential of synthetic data use cases
What are some synthetic data use cases? Really any time you want to innovate on top of data that includes personally identifiable information. Or any data you want to share and get use out of without running the risk of leaks or attacks. And certainly any time you want to avoid the rigamarole of getting data unleashed from behind compliance barriers.
Remember, since GANs-created synthetic data is artificial, it’s completely safe. Furthermore, we can add a differential privacy mechanism — which adds noise to the gradients of the discriminator — to ensure an extra level of safety.
Synthetic data use cases range from machine learning teams needing access to highly realistic data to QA testers wanting to test an app before it’s released and even has users.
Synthetic data unlocks faster innovation. This can be safely testing prospective third-party integration partnerships. And it can also just enable cross-organisational, cross-border data portability.
It can also be used in fraud detection where banks have many examples of non-fraudulent cases, but, in order to improve their model, they need just as many examples of fraudulent behavior. Synthetic data can be applied here by rebalancing data.
Similarly the Hazy generative adversarial networks are able to create what are called style transfer. Imagine if a bank has a million customers in the United States and has only opened up to 50,000 customers in the UK. When you’re ready to expand, a GANs can take what it knows about the bank’s user behaviour in the U.S. and combine that knowledge with the income variability and the ages, income and educational level for each customer in both countries.
First the GANs generates synthetic data that protects all the personally identifiable information across the two distinct geographic regions. The model then trains on both the UK customer behaviour and then the behavioural differences between the two countries. Then this model can learn to translate the American consumer behaviour into British consumer behaviour to predict the best way a British expansion might play out.
It’s not a crystal ball, but it can make a fairly solid predictor for customer behaviour.
At Hazy, we are perhaps most excited about our advances with the generative adversarial network. It has a potential to preserve and even improve on the utility of data, while still protecting privacy.
Because unlocking the promise of the information economy means you shouldn’t have to choose between speed and compliance.