Configuration
A Configuration defines the parameters for the model training process. The Hub bootstraps the process of creating a configuration by first running an analysis step. The Configuration is broken down into steps and covers both the data schema configuration where the columns in the data are mapped to Hazy's defined data types and model parameters which define how the generative model is trained.
You will need to complete all the following stages in order to produce a finished configuration that is ready to be used to train a model.
Choose source files¶
The hub can connect to your source data stored as flat files formatted as csv
, csv.gz
, parquet
and avro
, from one of the following locations:
- S3 bucket
- Google Cloud Storage
- Azure Blob Storage
- Locally mounted file system (bespoke installation only)
Or from a database connection to a supported database type:
- Microsoft SQL Server
- Snowflake
- Db2
Each file will be treated as a table in the configuration. Choose as many files or database tables as you wish to synthesise. You can also choose to include all files or tables from a given location. More information can be found here.
Video guideAnalyse source data¶
Once you have chosen your source files, you can run our preliminary analysis on them. This will give you an overview of the statistical properties of the data, and help you decide which properties you wish to model in the synthetic data.
The analysis step will also identify any potential issues with the data, such as missing values, outliers, or data types that are not supported by the hub.
Our automatic analysis connects to the chosen source to automatically determine:
-
Primary, composite and foreign key constraints (for multi-table datasets)
-
Statistical properties of all columns, including data types, cardinality, and distribution of values
-
Infer an appropriate Hazy data type if possible. See Data types.
-
Detect common date formats such as
%d%m%Y
. More unusual formats may not be detected and will require manual configuration.
It is important to note that the results of data type detection should always be reviewed by someone with an understanding of the data itself.
Video guideStructure¶
Confirm the structure of your data. This step allows you to review or edit constraints between tables. You may need an external reference to the data structure to be able to verify these relationships, if it is not obvious from the column names. We show a map of the relationships as a DAG (Directed Acyclic Graph).
You can also choose how the table will be processed. the options are:
- Tabular: The default option. Each row is treated as an independent record.
- Sequential: The rows are treated as a sequence of events.
- Reference: The table is treated as a set of references for other tables, and passed through as is, rather than being synthesised.
See Table types for more information.
Video guideColumns¶
This step allows you to review or edit the data types of each column, and to choose which columns to include in the synthetic data. Each Hazy data type may have additional options that need to be set. Note that although the Hub analysis usually suggests an appropriate datatype, there are many other possible ways to configure columns that may be more appropriate, especially if you want to capture relationships with other columns.
For a complete guide to the available data types and their options, see Data types. A number of example configurations are shown in Examples for handling some scenarios.
In order to help you decide the best data type and settings, we show a summary of the statistical properties of each column, and some sample values.
You can also choose to drop a column to exclude it from being synthesised. This is useful for columns that are not relevant to the synthetic data, or cannot be easily synthesised.
The table view of the columns can be sorted by any of the headings, for instance to sort by columns with invalid or incomplete configurations first. You can also search for a column or table name to filter the list, or use our advanced search operators:
is:error
– Show only columns with errorsis:drop
– Show only columns that have been droppedis:pii
– Show only columns that have been marked as containing PIIis:{datatype}
– Show only columns of the specific datatype, e.g.is:category
,is:id
,is:date
, etc.is:{entity}
– Show only columns of the specific entity type, e.g.is:person
,is:location
,is:combination
, etc. You can then filter further to the unique entity ID, e.g.is:person#1
.
General settings¶
Here, you can adjust other parameters that impact the model training process. In most cases, the default values are sufficient.
See also: full technical documentation on model parameters.
Video guideEvaluation¶
This step allows you to use our suite of metrics to compare synthetic to source data, and determine ratings for statistical qualities and privacy risks. Four metrics are enabled by default: Marginal distribution, Mutual information, Cross-table mutual information, Degree distribution.
Read more about metricsVideo guide
Compliance¶
We evaluate your configuration against our compliance checklist to help you ensure that your synthetic data will be compliant with GDPR and other regulations. We also provide a summary of the compliance risks of your configuration, and suggestions for how to mitigate them. At this point, having reviewed all the previous steps, you should be able to make an informed decision about the compliance risks of your configuration, and you can record this decision, which will also be noted in the model and generation tasks.
Video guideTrain¶
The config is exported and user can kick start the training process if all config validation rules pass. We also provide you with the full config as a Python or JSON file, which you can modify and use to train a model via our API.
Live logs are displayed during training, but you can leave this screen without affecting the background task, and the training progress is also indicated on the models list page.
Video guide