Source data persistence
Hazy uses generative models, which we call generator models, to learn the properties of the source data and then generate representative synthetic data.
Generator models are serialised data files that can be used to generate synthetic data. The serialised data files contain vector representations of the distribution estimates, patterns and relations in the source data.
Below we detail all the cases where information about source data gets persisted in the generator models.
1. Metrics visualisations¶
For the similarity metrics: the following information about the source data is persisted in the generator model in the Hazy UI: column names, min value, max value for all numerical columns alongside the probabilities of the processed (discretised) data.
For the utility metrics: no source data is persisted in the model
For the privacy metrics: a summary of the source data joint probability is persisted in the model.
All this information is accessible to the users in the organisation with the right access levels.
2. Categorical data¶
All distinct values for columns marked as categorical from the source data will be stored by the model (the model needs to save all distinct categories so it can sample from them).
3. Reference tables¶
If the configuration includes reference tables then the model will store those reference tables from the source data.
4. Training/generation run failure¶
If a training run fails, there is no model produced so no model is uploaded to the UI. Therefore, none of the source data is saved to the UI either.
5. Temporary files stored on the disk during and after training¶
Temporary files persist differentially private distributions of the source data. These distributions preserve the privacy of all individual data points. Please refer to our Differential Privacy section.
Caching¶
As noted in Database Subsetting, it is possible for the subsetting component to read directly from a local cache. In this case, source data is stored in a configurable disk location but is encrypted with a password known only by the hub.
Swapping¶
To ensure that source data is not swapped to disk during training,
swapping should be disabled. This is typically available as an option
in the chosen container runtime. At time of writing swap is not an
option in kubernetes but that is an area of active development so care
should be taken in future to take necessary precautions to avoid
swapping to disk if that is undersireable. For docker, swap can be
disabled using the --memory-swap
option (see Docker
documentation
for details.)
6. Source data stored in the hub¶
The hub stores summary statistics from the source data (min value, max value, mean, median, null count and cardinality) and the first five values from each column, in order to help the user configure the data types correctly.
User access to this may be configured via the hazy/SampleDataViewer
scope. Please refer to Access Control for more information.
This can be switched off with the configuration setting
ANALYSER_INCLUDE_SAMPLE_DATA
. If this is set to false
then sample
data will not be stored in the hub. This has the effect of making
some configuration decisions slightly harder if the range and format
of the columns is not known a-priori.