The task of selecting, configuring and optimising the right model for the data and use case.
An instance of records output from the source data or Generator model. For instance, the exact data contained within a dataset may change over time, while the source schema remains constant.
- Differential privacy
Minimises the influence any individual’s data has on queries run on a synthetic dataset. Read more
Demilitarized Zone. A subnet that adds an extra layer of protection from external attacks.
A differential privacy setting of Generator model files, the Epsilon value (ε) can be understood as the amount of noise applied to the data. Read more
- Generator model
Data files that can be used to generate synthetic data.
They contain the serialised set of statistical properties of the source data, sufficient to re-create a synthetic version of the source.
A normalisation tool for managing rules, IDs, PII, entities such as business names, addresses, social security codes and so on. Read more
- Multi-table synthesiser
Creates synthetic versions of tables related by foreign keys. Read more
Measures the risk of exploiting the synthetic data to extract information about the source data. Read more
- Sequential data
Any data that has an interdependence between records. For example, data involving a time component. The previous event will affect the likelihood of the next event. Read more
The measurement of how well the synthetic data measures the statistical properties of the source data, such as distributions of values. Read more
- Source data
The original data, that may contain sensitive or private data, used to train the Generator model.
A model training pipeline that ingests the source data and uses it to train a Generator model. Packaged as a Docker/OCI container image.
- Synthetic data
Data that is structurally equivalent and statistically similar to the source data, whilst being made entirely out of artificial data points.
- Time-series data
See Sequential data.
Measures the performance of a downstream model or algorithm when run on the source data compared to the synthetic data. Read more