Use case recommendations

Although the configuration of data types and table relationships in Hazy are use case agnostic, the purpose, audience, and eventual location of the synthetic data does influence the decisions you make when adjusting your model parameters. This could be to increase or decrease the level of utility/similarity/privacy, to reduce time to train, or to reduce RAM usage among other reasons. In this section, we will give model parameter recommendations for three use cases and how we expect the scores to differ based on these parameters.

Testing

A testing use case focuses on ensuring that business logic is preserved as well as some top level statistics regarding the data. Generally the user is looking for performance above statistical accuracy.

Model setting Recommended range
n_parents 1 – 2
n_bins 10 – 50
epsilon 1e-6 – 1e-3
max_cat 10 – 50

We recommend you enable and assess the following metrics:

Data Analysis

A data analysis use case focuses on ensuring that statistics are preserved, including cross-column statistics, to allow a user to produce a reasonably accurate report or dashboard using the generated synthetic data.

Model setting Recommended range
n_parents 2 – 3
n_bins 50 – 100
epsilon 1 – 1e-4
max_cat 100 – 250

We recommend you enable and assess the following metrics:

ML Modelling

A Machine Learning modelling use case focuses on ensuring that the overall utility found in the source data for a specific machine learning task is preserved when attempting a given modelling task after training on the generated synthetic dataset. While this does not necessarily guarantee that a model trained on synthetic data can be viable in production, it can help speed the fine-tuning process of a modelling strategy without requiring the need to access or analyse the source dataset.

Model setting Recommended range
n_parents 3 – 4
n_bins 100 – 150
epsilon 1 – 1e-5
max_cat 150 – 250

We recommend you enable and assess the following metrics: