Hazy’s synthetic data is optimised for data science, which means that it need to be statistically valid, not just structurally valid.
Learning and carrying through the complex statistical properties of the source data is Hazy’s key task and differentiator. It is achieved using a range of different algorithms and machine learning techniques, ranging from bayesian statistical models to deep learning algorithms including auto-encoders and generative adversarial networks (GANs).
Ultimately, the aim of these techniques is to keep the output synthetic data as statistically similar to the source data as possible and to preserve as much of the source data’s utility as possible. As a result, Hazy measures the quality of its data in terms of:
- Similarity: statistical similarity to the source data.
- Utility: performance of the models trained on it.
- Privacy: risk of data disclosure.
- Sequential quality: Analysing sequential trends across data rows.
The metrics above are calculated using Hazy’s built in evaluation system. It is also possible and encouraged to run your own bespoke evaluation. The process for this is:
- take pre-prepared, subsetted source data.
- ingest it into Hazy and generate equivalent synthetic data.
- compare the performance of algorithms using or models trained on the source data vs the synthetic.
This bespoke evaluation flow is very common and can be automated / simplified by using the Python client to import generators and run tuning and testing natively from within a Python environment.