Source data persistence

Hazy uses generative models, which we call generator models, to learn the properties of the source data and then generate representative synthetic data.

Generator models are serialised data files that can be used to generate synthetic data. The serialised data files contain vector representations of the distribution estimates, patterns and relations in the source data.

Below we detail all the cases where information about source data gets persisted in the generator models.

1. Metrics visualisations

The distributions of the source data (labelled as “real”) are shown to users who have access to the Hazy UI
The distributions of the source data (labelled as “real”) are shown to users who have access to the Hazy UI

For the similarity metrics: the following information about the source data is persisted in the generator model in the Hazy UI: column names, min value, max value for all numerical columns alongside the probabilities of the processed (discretised) data.

For the utility metrics: no source data is persisted in the model

For the privacy metrics: a summary of the source data joint probability is persisted in the model.

All this information is accessible to the users in the organisation with the right access levels.

2. Categorical data

All distinct values for columns marked as categorical from the source data will be stored by the model (the model needs to save all distinct categories so it can sample from them).

3. Reference tables

If the configuration includes reference tables then the model will store those reference tables from the source data.

4. Training/generation run failure

If a training run fails, there is no model produced so no model is uploaded to the UI. Therefore, none of the source data is saved to the UI either.

5. Temporary files stored on the disk during and after training

Temporary files persist differentially private distributions of the source data. These distributions preserve the privacy of all individual data points. Please refer to our Differential Privacy section.

Caching

As noted in Database Subsetting, it is possible for the subsetting component to read directly from a local cache. In this case, source data is stored in a configurable disk location but is encrypted with a password known only by the hub.

Swapping

To ensure that source data is not swapped to disk during training, swapping should be disabled. This is typically available as an option in the chosen container runtime. At time of writing swap is not an option in kubernetes but that is an area of active development so care should be taken in future to take necessary precautions to avoid swapping to disk if that is undersireable. For docker, swap can be disabled using the --memory-swap option (see Docker documentation for details.)

6. Source data stored in the hub

The hub stores summary statistics from the source data (min value, max value, mean, median, null count and cardinality) and the first five values from each column, in order to help the user configure the data types correctly.

User access to this may be configured via the hazy/SampleDataViewer scope. Please refer to Access Control for more information.

This can be switched off with the configuration setting ANALYSER_INCLUDE_SAMPLE_DATA. If this is set to false then sample data will not be stored in the hub. This has the effect of making some configuration decisions slightly harder if the range and format of the columns is not known a-priori.