After generation, the integrity of the synthetic data is checked against the config used to train the model, as well as metadata from the training data. This step can be enabled and disabled via the validation GenerationConfig field. If enabled, the results for this set of checks are displayed in the
Hub for each generation job. The result of each validator is one of:
PASS: No issues found
WARN: An issue was found, but it requires the user's judgement as to whether this needs addressing.
FAIL: A more serious issue was found, requiring the user's attention.
The following checks are performed:
A check that compares a number of characteristics of source data to that of the generated synthetic data. These are:
- The cardinality of columns.
- The number of missing rows in column.
- The average record lengths of columns.
Column Counts and Names¶
A check that the number of columns in the generated synthetic data match that of the source data, and that the column names are the same.
Data Controller Conformance Validation¶
A check that the Data Controller signed off pre-training that configuration was conducted with care and in accordance with our guidance to mitigate compliance and privacy concerns. They also assume accountability for any potential product misuse leading to such issues. Read more about compliance.
Differential Privacy Validation¶
PII in Category Columns Validation¶
A check that lists all category columns that were selected as containing PII during configuration. This presents a privacy risk because the generated data will contain samples from the source column values. Read more about PII and compliance.
A check that all foreign keys in a child table link to a primary key in a parent table. We do not support generating orphaned records in child tables, so the result of this validator is a binary
FAIL. Note that this validator is only run if the generated data contains at least one table with keys pointing to another table.
A check that the number of generated records match that of the source data, after adjusting for train-test-split and magnitude settings. Note that there are a number of factors that may mean synthetic record count does not exactly match that of the source data. For example, in the mulit-table case, the number of times a parent key appears in a child table is modelled on the source data, and is therefore a statistical property. Therefore this validator will return a
PASS result within a certain range.
A check that all regex patterns supplied by the user during training are respected in the generated synthetic data.