What is synthetic data?
Hazy generates entirely artifical synthetic data that is structurally equivalent and statistically similar to the source data, whilst being made entirely out of artificial data points.
What is synthetic data?¶
Hazy's synthetic data is data that looks and feels like the source data, preserving the same structure, value types, patterns, distributions and relations, whilst actually being made up out of artificial data points and artificial data records.
For example, imagine that you have a table with three columns:
A synthetic version of this table might look like this:
Now, in a more realistic example, there would be lots more rows and the output data would be carefully generated to preserve the distributions and correlations between columns. However, we can already start to highlight some of the aspects of the synthetic data:
- it has the same structure / schema / value and entity types as the real data -- which allows it to be used as a "drop in" replacement for it
- it looks and feels the same -- to a data scientist "eyeballing" the data
- it's similar to the source data but doesn't contain any real values or records
- it's not obvious how you could reverse engineer the source data from the synthetic data
How is synthetic data generated?¶
Hazy uses generative models -- which we call generators -- to learn the properties of the source data and then generate representative synthetic data.
Imagine an SQL table with column types. If it was the table above, we might have a fixed precision numeric column for salary, a string or enum column for gender and an integer column for age. As a starting point, you could implement generator functions that emitted valid random values for those types and then use these to generate valid rows of artificial data.
This is how "dumb" test data is typically generated: take a schema and generate structurally valid data points and records. However, this isn't enough for Hazy, as we need to generate data that is statistically valid as well as structurally valid.
So, with Hazy, our generators start by learning the distribution of values in each column. When we then generate a data point, for example, a value for a age in a synthetic data record, we are able to sample from a distribution estimator, to ensure that the synthetic data points follow the same distribution as the source data:
This is just a starting point and we layer on a range of other statistical properties and sampling methods, for example to preserve relationships between columns and patterns in sequential data. However, hopefully this gives you a basic mental model of how smart synthetic data is generated.
Generator Models are serialised data files that can be used to generate synthetic data.
The serialised objects are essentially a compressed representation of the source data. They contain vector representations of the distribution estimates, patterns and relations in the source data. This allows generators to be copied from one environment to another. For example, from an on-premise or production environment where the generator was trained to a lab or cloud environment where the generator can be used to provision data.
These generator-based workflows are very powerful. They allow you to provision data into less trusted environments without ever moving the source data.
- you can provision data into lab or cloud environments without the source data ever leaving your production or on-premise environment
- you can share insight rather than sharing data by provisioning multiple generators into a shared environment and aggregating their output data