Glossary - Hazy

Auto-tuning

The task of selecting, configuring and optimising the right model for the data and use case.

Dataset

An instance of records output from the source data or Generator model. For instance, the exact data contained within a dataset may change over time, while the source schema remains constant.

Differential privacy

Minimises the influence any individual’s data has on queries run on a synthetic dataset. Read more

DMZ

Demilitarized Zone. A subnet that adds an extra layer of protection from external attacks.

Epsilon

A differential privacy setting of Generator model files, the Epsilon value (ε) can be understood as the amount of noise applied to the data. Read more

Generator model

Data files that can be used to generate synthetic data.

They contain the serialised set of statistical properties of the source data, sufficient to re-create a synthetic version of the source.

Handler

A normalisation tool for managing rules, IDs, PII, entities such as business names, addresses, social security codes and so on. Read more

Multi-table synthesiser

Creates synthetic versions of tables related by foreign keys. Read more

Privacy

Measures the risk of exploiting the synthetic data to extract information about the source data. Read more

Sequential data

Any data that has an interdependence between records. For example, data involving a time component. The previous event will affect the likelihood of the next event. Read more

Similarity

The measurement of how well the synthetic data measures the statistical properties of the source data, such as distributions of values. Read more

Source data

The original data, that may contain sensitive or private data, used to train the Generator model.

Synthesiser

A model training pipeline that ingests the source data and uses it to train a Generator model. Packaged as a Docker/OCI container image.

Synthetic data

Data that is structurally equivalent and statistically similar to the source data, whilst being made entirely out of artificial data points.

Time-series data

See Sequential data.

Utility

Measures the performance of a downstream model or algorithm when run on the source data compared to the synthetic data. Read more