Hazy uses a range of best-in-class generator models to synthesise data:
Each of these has different properties and trade offs. This means that different models work better with different data and for different use cases.
Bayesian models are one-shot learning models that use a counting approach. They discretise column distributions and build an optimal bayesian network by maximising mutual information score.
This approach allows Bayesian models to deliver very high quality results on a variety of data sets. They can also run very fast, although in some cases their time-complexity can make them unsuitable for larger data sets.
The Synthetic Data Vault (SDV) is a system that builds generative models of relational databases.
Like Bayesian models, SDV uses a traditional statistical approach and is a one-shot learning model. It works by fitting a statistical distribution for each column and modelling the dependency between columns by calculating the variance-covariance matrix.
SDV has benefits around speed and simplicity -- it's fast and easy to understand. However, it can struggle for quality and has limitations on the data values and structures it can support.
Generative adversarial networks (GANs) are one of the most important technological breakthroughs in deep learning. Initially known for their ability to generate realistic synthetic images and videos, they've increasingly been shown to be applicable to other domains, including structured data.
GANs work by implicitly learning the probability distribution of a dataset and can then draw samples from the distribution. In many cases, they can outperform conventional statistical generative models -- both in capturing the correlation between columns and in scaling up for large datasets.
The task of selecting, configuring and optimising the right model for the data and use case is a core feature of the Hazy platform. We call it auto-tuning.
Auto-tuning sits in between data selection and training. It’s where we automatically match the best model to the data and optimise its configuration.
Once a data set has been read from a connector and the generator use case configured, Hazy iterates through the configuration and parameter space for each candidate generator model, exploring and testing for the optimum model selection and configuration.
This process allows Hazy to automatically choose, evaluate and apply the best synthetic data generation model to your data and validate this selection every time your data is read and your generator is re-trained.
Once a generator model is selected and configured, it is trained to learn the patterns and correlations in your source data. This trained model is then stored as a Hazy data generator that can be used to output representative synthetic data.
Hazy generators are re-trained on a regular basis, typically on a schedule, for example once a day, or when new data is available. See outputs for more information on using a generator.
Auto-tuning and training are compute intensive processes. Depending on the model and the data, it can take somewhere between minutes and weeks to train or re-train a generator.
The Hazy team can help you understand the trade offs and likely performance impact of specific data set + model combinations, both for first-run and auto-run usage. Where performance is a blocker, the first steps are:
- to accelerate tuning and training using a GPU
- manually bound the auto-tuning / parameter exploration process
Once trained, generators are extremely fast, with synthetic data generation typically bounded by IO rather than compute.