Glossary

Auto-tuning

The task of selecting, configuring and optimising the right model for the data and use case.

Bridge

A communication channel that allows for data to be transferred between different environments, without a direct network connection.

This allows the trainer, which requires access to production data usually running within a DMZ environment, to move Generator models into the more open network hosting the Hub. Read more

Dataset

An instance of records output from the source data or Generator model. For instance, the exact data contained within a dataset may change over time, while the source schema remains constant.

Differential privacy

Minimises the influence any individual’s data has on queries run on a synthetic dataset. Read more

DMZ

Demilitarized Zone. A subnet that adds an extra layer of protection from external attacks.

Epsilon

A differential privacy setting of Generator model files, the Epsilon value (ε) can be understood as the amount of noise applied to the data. Read more

Generator model

Data files that can be used to generate synthetic data.

They contain the serialised set of statistical properties of the source data, sufficient to re-create a synthetic version of the source.

Generator version

Each Generator has one or more versions, each version being one or more Generator models trained on a particular snapshot of the source data. Read more

Generators

Generators encapsulate a collection of Generator models produced from a particular source dataset.

A generator is organised into versions. Each version is a set of models learned from the same dataset, but with different training parameters, for example, epsilon.

Generators belong to an organisation and are accessed through the API by Users belonging to Teams within that organisation.

Handler

A normalisation tool for managing rules, IDs, PII, entities such as business names, addresses, social security codes and so on. Read more

Hub

An access control system for Generator models. It provides you with abstractions and user interfaces for creating and organising Generators, uploading Generator models and controlling access permissions to Users.

The Hub is the centre of your Hazy installation. At its core, it is a web application providing a user interface (UI) for configuration and a HTTP API that provides access to the trained generator models.

The main role of the Hub is to host and serve Generator models. It acts as a single point of configuration, allowing for a network of other Hazy components to communicate and coordinate between different network zones over bridges.

Multi-table synthesiser

Creates synthetic versions of tables related by foreign keys. Read more

Organisation

Represents a company or other discrete entity in the Hub. Each organisation can have its own set of Generators, users and teams. Read more

Privacy

Measures the risk of exploiting the synthetic data to extract information about the source data. Read more

Sequential data

Any data that has an interdependence between records. For example, data involving a time component. The previous event will affect the likelihood of the next event. Read more

Similarity

The measurement of how well the synthetic data measures the statistical properties of the source data, such as distributions of values. Read more

Source data

The original data, that may contain sensitive or private data, used to train the Generator model.

Synthesiser

A model training pipeline that ingests the source data and uses it to train a Generator model. Packaged as a Docker/OCI container image.

Synthetic data

Data that is structurally equivalent and statistically similar to the source data, whilst being made entirely out of artificial data points.

Teams

The Hub’s mechanism for granting fine-grained permissions to access Generator models via the API and web interface. Read more

Trainer

The server that runs the synthesiser in order to train Generator models from some source data.

Utility

Measures the performance of a downstream model or algorithm when run on the source data compared to the synthetic data. Read more