Hub - Configurator workflow integration¶
This release will mark the combination of the previous Configurator and Hazy Hub into one end to end Hub. This new Hazy Hub will act as a configuration tool, job manager, and model repository all in one. There will no longer be any need to move models or configurations in and out of the configurator or hub as it will all be controlled from the same UI. The old versions of the configurator and the Hub are no longer being updated by the engineering team and support for this will be phased out over the coming months.
- Fine grained access control is being replaced to make it simpler and better suited for its purpose. We have implemented RBAC in this release (see below) as the foundation of a new security model.
Role Based Access Control¶
In the previous release of Hazy, we introduced some basic IAM features which allowed users to have their own user login and password credentials. In this release we have built on this system and incorporated role based access control (RBAC). This allows admins to grant a range of CRUD permissions for each object type in Hazy including:
- Jobs (analysis, training, and generation).
Users can also control other functionality such as:
- Viewing samples.
- Viewing metrics.
- Viewing functional QA reports.
Full details of what can be controlled and how to do it can be found in our documentation.
- No resource level permissions.
Functional validation in UI¶
In this release, we have upgraded the functional validation log to a feature on the UI. Users can now check against a list of functional configurations and logic applied for synthetic data generation in one go to see which items pass/fail/warning.
- Referential integrity checks between selected tables of synthetic data.
- Rows and columns comparison between generated synth and source tables.
- Characteristics validation (unique count, missing value count, avg string length).
- Regex ID validation.
- Datatypes applied per column.
- Regex ID validation is not supported when 1) a mix of regex and non-regex ids are used in a single column e.g. when using an ID mixture 2) when mismatched data is preserved e.g. when using a conditioned id 3) when a regex is used in a compound id.
Database debug write¶
In this release we have introduced a debug mode for writing to databases. This can be useful when writing to a database fails due to the violation of constraints. With this new feature, in the event of a failure during the initial attempt to write the entirety of the synthetic data, Hazy will automatically write the data to the database in batches. Once a batch fails, we then proceed to write each record individually until we reach the specific record that caused the failure. Once we reach the failing record, we save the synthetic record alongside the corresponding error message to a log file so that this information can then be used to debug the issue. This feature applies for both SQL Server and IBM DB2 databases.
- Writing data in batches/individually for each record may impact performance.
- At present, the only available option to store the failed records is in CSV format. If there is a substantial number of failed records this may take up a large amount of storage space.
Database constraint checks¶
Hazy is able to coerce generated synthetic data to meet nullability, maximum length and uniqueness constraints. Previously, these constraints were simply inferred from the source data that is read in during training. In this release, we have improved our ability to enforce constraints on the generated data by querying the constraints directly from the database, meaning our understanding of the constraints is far more accurate.
Additionally, to prevent errors caused by uniqueness constraints when writing to a database (either enforced by a primary key constraint or otherwise) we have implemented a Drop Unique Violation generation parameter which by default is False. When True, any records that violate the unique constraint for a particular column will be dropped.
This feature applies to both SQL Server and IBM DB2 databases.
- Maximum length constraints inferred from a database are based on whether the column datatype is CHAR or VARCHAR. Any length limitations imposed by other SQL data types will not be detected.
- When reading from file formats (csv, parquet and avro) the constraints are still inferred from the underlying source data.
Several changes have been made to bring parity between what can be configured in the Hub and what can be configured through the Python SDK:
- Name ID sampler is now configurable.
- PersonType now supports cross table dependencies for gender and title (see the Breaking Changes section for information on how to update existing configurations).
- Formulas now support cross table dependencies and static values can be configred through the UI.
SymbolType, CurrencyType & PercentageType¶
Three new HazyDataTypes have been introduced. A summary of when each type should be used is as follows:
- SymbolType: Allows support for numerical columns with a symbol
leading or trailing the numerical value, as well as thousand
separators and decimal points such as
- CurrencyType: Allows support for currency value columns with a
currency unit leading or trailing the numerical value, as well as
thousands separators and decimal points such as
£250,000. Currency values and units may be separate (one column for each), or together, and will be parsed when needed. Comes along with a CurrencyHandler which underpins the type.
- PercentageType: Allows support for numerical columns with a leading
or trailing % symbol, comma separators and decimal points such as
- SymbolType can only handle a single symbol being present in the column.
- All three types require that symbols are all present on either the left or the right hand side of the numerical values.
- Currency processing relies on historical exchange rates data. We’re using data from the European Central Bank, which might not have all currencies.
- It is expected that all numericals in a column have the same floating point precision.
- While we can parse data with space between symbols/currencies and numericals, the generated data currently won’t have any.
Users can now subset each table on the database with its own filter, before the sampling process happens. These filters should be written as a SQL-like WHERE clause condition (or multiple conditions). The condition needs to be written using SQLite syntax.
By conducting the filter prior to the sampling process, Hazy maintains referential integrity of the synthetic data generated.
- Only works for database subsetting in-memory mode (currently the only mode available anyway)
- WHERE clause condition must use SQLite syntax. When this feature is also running for in-database mode, the condition should be using the syntax of the underlying SQL database.
UI Integration with Kubernetes backend¶
In this release we have connected our new dispatcher based Kubernetes backend system to our new end to end UI, the Hazy Hub. This will allow you to run all the same jobs end to end via the UI, as you can with the standalone docker image, but in the cloud with our native Kubernetes backend. This integration allows you to take advantage of horizontal node autoscaling, cloud IAM role integration, and RDS postgres integration among other cloud features to save you time and money while increasing scalability and performance.
- Generate from Old models in Container Architecture.
- Dataset benchmark pipeline.
- Some Company ID locales were previously broken - these have now been removed from the configuration and CompanySettings now takes PersonLocales.
- IbanSampler & BbanSampler sample from multiple locales
Hazy Synthesiser / Configurator¶
- Sequential modelling improvements .
- Users can now incorporate recurring events such as a recurring payment in a bank account.
- Update of the generation page.
- Standardised install with test data.
- Currency Handler Improvements
- Allows users to convert currencies into a standardised amount. Useful to normalise the amount you are modelling and avoid issues with correlations in exchange rates.
- Ability to model patterns that have the format amount “space” currency e.g. 135 DOLLARS or 123 USD.
- Document links in UI
Some of the updates in this release will result in previous configurations failing validation. These are outlined below along with the necessary corrective steps.
The PersonType has been updated to now take gender and title columns as dependencies, rather than PersonPartTypes. If you have a configuration where a gender or title column is defined as a PersonPartType a validation error will be raised. Please carry out the following actions to resolve this:
- Set the gender and/or title column as a CategoryType.
- In the person entity, set the
title_columnparameter(s) as their respective columns.
The following settings existed for defining various types of names as part of an ID:
These have now been consolidated into a single setting,
NameSettings. NameSettings can now take a
type parameter to specify
the particular type of name that you would like to be
generated. Please see the documentation for more details on this.
None of the above name settings were surfaced on the UI until this release, however, users may have these defined in existing Python configurations and therefore will need to update configurations where necessary.