The task of selecting, configuring and optimising the right model for the data and use case.
Glossary
- Auto-tuning
- Dataset
-
An instance of records output from the source data or Generator model. For instance, the exact data contained within a dataset may change over time, while the source schema remains constant.
- Differential privacy
-
Minimises the influence any individual’s data has on queries run on a synthetic dataset. Read more
- DMZ
-
Demilitarized Zone. A subnet that adds an extra layer of protection from external attacks.
- Epsilon
-
A differential privacy setting of Generator model files, the Epsilon value (ε) can be understood as the amount of noise applied to the data. Read more
- Generator model
-
Data files that can be used to generate synthetic data.
They contain the serialised set of statistical properties of the source data, sufficient to re-create a synthetic version of the source.
- Handler
-
A normalisation tool for managing rules, IDs, PII, entities such as business names, addresses, social security codes and so on. Read more
- Multi-table synthesiser
-
Creates synthetic versions of tables related by foreign keys. Read more
- Privacy
-
Measures the risk of exploiting the synthetic data to extract information about the source data. Read more
- Sequential data
-
Any data that has an interdependence between records. For example, data involving a time component. The previous event will affect the likelihood of the next event. Read more
- Similarity
-
The measurement of how well the synthetic data measures the statistical properties of the source data, such as distributions of values. Read more
- Source data
-
The original data, that may contain sensitive or private data, used to train the Generator model.
- Synthesiser
-
A model training pipeline that ingests the source data and uses it to train a Generator model. Packaged as a Docker/OCI container image.
- Synthetic data
-
Data that is structurally equivalent and statistically similar to the source data, whilst being made entirely out of artificial data points.
- Time-series data
-
See Sequential data.
- Utility
-
Measures the performance of a downstream model or algorithm when run on the source data compared to the synthetic data. Read more