What is synthetic data?
Synthetic data is structurally equivalent and statistically similar to the source data, whilst being made entirely out of artificial data points. Hazy’s synthetic data 'looks and feels' like the source data, preserving the same structure, value types, patterns, distributions and relations.
For example, imagine that you have a table with three columns:
A synthetic version of this table might look like this:
In a more realistic example, there would be lots more rows and the output data would be carefully generated to preserve the distributions and correlations between columns. However, we can already start to highlight some of the aspects of the synthetic data. The data:
- has the same structure / schema / value and entity types as the real data, which allows it to be used as a "drop-in" replacement for it
- looks and feels the same — to a data scientist "eyeballing" the data
- is similar to the source data but does not contain any real values or records
- is not obvious how you could reverse engineer the source data from the synthetic data
How is synthetic data generated?¶
Hazy uses generative models, which we call generators, to learn the properties of the source data and then generate representative synthetic data.
Imagine an SQL table with column types. If it was the table above, we might have a fixed precision numeric column for salary, a string or enum column for gender and an integer column for age. As a starting point, you could implement generator functions that emitted valid random values for those types and then use these to generate valid rows of artificial data.
This is how "dumb" test data is typically generated: take a schema and generate structurally valid data points and records. However, this is not enough for Hazy, as we need to generate data that is statistically valid as well as structurally valid.
So, with Hazy, our generators start by learning the distribution of values in each column. When we then generate a data point, for example, a value for an age in a synthetic data record, we can sample from a distribution estimator, to ensure that the synthetic data points follow the same distribution as the source data:
This is just a starting point and we layer on a range of other statistical properties and sampling methods, for example to preserve relationships between columns and patterns in sequential data. This gives you a basic mental model of how smart synthetic data is generated.
Generator Models are serialised data files that can be used to generate synthetic data.
The serialised objects are essentially a compressed representation of the source data. They contain vector representations of the distribution estimates, patterns and relations in the source data. This allows generators to be copied from one environment to another. For example, from an on-premises or production environment where the generator was trained to a lab or cloud environment where the generator can be used to provision data.
These generator-based workflows are very powerful. They allow you to provision data into less trusted environments without ever moving the source data.
For example, you can:
- provision data into lab or cloud environments without the source data ever leaving your production or on-premises environment
- share insight rather than sharing data by provisioning multiple generators into a shared environment and aggregating their output data