New sequential synthesiser¶
We have now added a new model to our list of Hazy models - the sequential DGAN synthesiser (based on the DoppelGANger workflow). This model is specialised for synthesising sequential data, such as banking transactions and IoT data. Unlike the existing standard models from Hazy, the new sequential model generates each new data point while taking into account the current state of the data set, allowing seasonality and other trends to be mirrored in the synthetic data.
Use cases for generating synthetic data using our sequential synthesiser include:
- Modelling seasonality
- Capturing trends over time
- Detecting fraudulent activities spread across multiple time periods
- Recurring entries are not included such as regular payments in the banking transactions use case.
- We are unable to forecast past the training data window.
- While we are working on a differentially private version of our sequential synthesiser, the current version is not yet differentially private. Don’t worry, all of the generated data is still synthetic and your source data will remain safe.
- We are currently unable to add in unexpected or out of the ordinary entries.
New sequential discriminator metric¶
In order to measure the performance of our new sequential synthesiser, we have built a new specialised metric, the discriminator similarity similarity metric. This method uses a RNN classifiern identifier RNN classifier model that looks at each sequence of data points and tries to determine if it is from the real data or is synthetic. We then capture these results and group them into False Positives, False Negatives, True Positives, and True Negatives. If the sequential synth has done a good job then the model will not be able to determine what sequences are synthetic and will end up with an even split over the four categories.
Statistical metric indicators¶
All metrics in Hazy will now also have a star rating between 1 and 5 to represent how well a model is performing with respect to that metric. This is not a pass or fail rating but is instead intended to normalise the metric scores as the scale of “bad” to “good” is not linear and is not the same between metrics. This improved visualisation is aimed to aid non-technical users when assessing model performance against our metrics.
Note: Whether synthetic data is suitable is determined by your use case and, as such, models with a low star rating on some metrics may still be satisfactory. Furthermore, models with a low star rating on “privacy” metrics are no less secure than a 5-star rating.
- The boundaries that indicate what rating is attributed to each metric have been determined by Hazy.
- Star ratings have been averaged across scores for metrics with multiple scores (e.g. metrics with one score per table use the average score across tables to determine the star rating).
We have now incorporated user management into our configurator allowing users to have individual sign-in access. This feature moves Hazy one step closer to deprecating our old Hub and unifying our two systems. A system admin will have the ability to create/remove/edit users and manage the mappings between Hazy and your LDAP server (if using).
- No "private" projects: Users will still see all projects created by different users on Hazy.
- System administrator can still operate as a regular user.
- User management section has a different UI theme.
Disable sample view¶
We now have a feature flag that allows us to globally remove samples from the hub. This will stop Hazy from storing samples from the source data. Please contact Hazy support on installation if you would like to include this feature and hence remove samples from Hazy.
- Not applied by table/schema.
- Currently unable to apply this feature on a per-user basis.
DB2 write via UI¶
This feature was introduced to extend the support of the E2E UI to write data to a local or remote IBM Db2 database. The user provides the database credentials on the E2E UI of where the user would like to write.
- Connections must be configured on setup.
- Credentials must be provided for a database user with write access to tables storing the synthetic data.
- Only one connection can be used – no schemas across different databases.
- Only one schema can be used – no tables across different schemas in a database.
- Cannot specify new table names on generation.
Generated synthetic sample¶
With this feature, users can now check the quality and contents of the output model before committing time and computational resources to generate the entire synthetic dataset. Should there be an issue, a user need not run the entire process to validate.
The user can enable/disable this feature and can configure magnitude for generating the sample.
With the functional validation log, a user can now check against a list of functional configurations and logic applied for synthetic data generation in one go. Consequently, this improves the user's confidence in sharing the data post-generation.
For the initial release, the report is provided in the generation log. This will eventually be migrated to a permanent UI feature.
- Referential integrity checks between selected tables of synthetic data.
- Comparing rows and columns of generated synth and source tables.
- Characteristics validation (unique count, missing value count, avg string length) Regex ID validation (validates that IDs generated match the REGEX pattern specified in the configuration).
- Available in log output for this version
Payment datatype detection¶
Enhanced datatype detection for payment information has been extended beyond IBAN/BBAN to support payment Sort Codes, Swift codes, Credit Card numbers and Phone numbers.
- Unable to automatically detect phone number locales.
Configurator UI parity¶
Several changes have been made to bring parity between what can be configured in the UI and what can be configured through the Python SDK:
- Compound ID
- Conditioned ID
- ID Mixture
These ID types were previously only configurable through the Python SDK, however, now they are configurable through the UI via JSON text entry boxes. This form of input is validated on the fly like other UI inputs during configuration.
An additional update has been made to allow cross-table dependencies for most data types. For example, when configuring an AgeType, if the target age column exists in TableA and the date of birth column exists in TableB, this is now configurable through the UI (previously only configurable through the Python SDK).
There are still the following differentials between what can be configured in the UI and what can be configured through the Python SDK:
- Sequential tables.
- Reference tables.
- Denormal items.
- Cross-table dependencies for PersonType and Formulas.
- Name samplers cannot be used as IdTypes.
- Custom handlers not superseded by data types (these can however be configured via a JSON text box on the general settings page).
Datetime as unique primary key¶
Datetime can now be set as a primary key only in a table, as in it no longer has to be a part of a composite key to work.
- It can only sample datetimes as from the original range of the data. It can't extend past or future datetime ranges beyond the source data.
- There should be more rows than the n_bins parameter set in the model parameters.
Preview Support for Google Cloud Storage¶
Google Cloud Storage can now be used for input and output in the same way the S3 is available.
- Currently the
GOOGLE_APPLICATION_CREDENTIALSenvironment variable must be set explicitly and point to a valid service account key file.
- Parent/child key column ordering for ReferentialIntegrityValidator log.
- CPR in Synth training.
- Fkey col settings.
- Warning in Analyser._analyse_fkeys_df if column contains all-Nones.
- Bug listfilesdispatchtaskmodel error col.
- Docs for dispatch CLI installation.
- Reference to BankLocale in IBAN/BBAN (now IbanLocale).
- SWIFT codes are now generated in the correct format that corresponds to the locales provided.
Hazy Synthesiser / Configurator¶
- Project navigation.
- Alphabetically sorted tables.
- Speed up primary key analysis by getting non-none unique values once
SWIFT code formats (swift type and locales) can now be inferred from
the data using the