2.2

Features

End-to-end UI

Interface for data analysts to easily configure data, train the synthesiser and generate synthetic data all in one place. The interface requires no programming skills to operate. Users can view the metrics directly in the UI and check the training performance by task before proceeding to the generation step.

Known Limitations

  • No support for uploading a file directly from the UI. The alternative is to add the desired file to the volume mounted into the docker container specified in the Configurator installation (please refer to the Configurator installation section note 1)
  • No download data button in the UI. We surface the path the synthetic data was generated to, which has been defined during the Configurator installation.
  • Like in the previous release, metrics are set to defaults in the UI (4 similarity metrics). Other metrics must be configured through the code interface.

Cloud Deployment

Hazy is now supported on AWS Kubernetes clusters with EKS. This is our first iteration of a cloud deployment of Hazy which allows customers to run synthetic data pipelines from the cloud platforms, benefiting from elastic compute resources and managed services. Hazy can be installed into a Kubernetes cluster using the Helm charts provided.

Known Limitations

  • No UI, can only communicate with synthesisers directly or through Hazy CLI.
  • Everything must be deployed in the same environment as the data - no zoning of data.

Introducing MappedType

A new Hazy data type has been introduced, MappedType, for when a categorical column contains sensitive information that must be replaced with a sample generated by one of Hazy’s available ID generators.

An example of this is when the company name column needs to be replaced by other values that look like company names (generated by the CompanySampler) and then treated as categorical, so that the overall distribution of the column is modelled.

Column Sampler - Random Sample

The regular ColumnSampler allows the user to specify that the target series should be an exact copy of another column. This update includes a new setting for the ColumnSampler called random_sample. When set to True, the target column is created by taking a random sample from the dependency column.

S3 integration into E2E UI (read & write)

This feature is about extending the support of the E2E UI to data files from an external source, more specifically S3. The user provides the S3 bucket name in a configuration file for data sources, and also provides AWS access key information in another configuration file, and the connection is retrieved and displayed by the E2E UI – configuration of bucket names does not happen through the UI. Once the connection is selected from the drop down menu, the user can select the table(s) that they wish to generate synthetic data for.

Known Limitations

  • S3 connections must be configured on setup.
  • The AWS user must be able to perform the GetObject, ListBucket and HeadBucket S3 actions.
  • Supported file formats: CSV, Avro, Parquet.

Database integration into E2E UI (read)

This feature is about extending the support of the E2E UI to read data stored in a local or remote IBM Db2 or SQL Server database. The user provides the database credentials in a configuration file for data sources, and the connection is retrieved and displayed on the E2E UI – configuration of credentials does not happen through the UI. Once the database connection is selected from the drop down menu in the UI, the user can choose from any of the available schemas in the database, and further select the tables from that schema that they wish to generate synthetic data for.

Known Limitations

  • Read-only integration.
  • Connections must be configured on setup.
  • Credentials must be provided for a database user with read access to:
    • Tables storing source data to be read in by the UI.
    • Metadata tables or views such as the Information Schema for SQL Server, which store information about database schemas and tables.
  • Only one connection can be used – no schemas across different databases.
  • Only one schema can be used – no tables across different schemas in a database.

Generation schema versioning

Adds support for generating data from old model files with the latest software by versioning all parameters in the schema.

SynthAPI in hazy_client2

Begin training by sending jobs to the server. If the exported configuration needs to be altered outside of the UI - this can be used to integrate immediately back into the UI to see job status/progress and have metrics displayed at the end of training. Models will be saved to the preconfigured storage folder.

Known Limitations

  • Only training is available, generation must be conducted through the UI.

Python export options and fully runnable scripts.

On export the user can optionally export with all default values shown. They can choose to export just the training config or fully runnable scripts with a choice of SynthDocker, for running the job locally with Docker, or SynthAPI (as described above).

Fixes

Configurator UI

  • Fix internal issues with multi table key.
  • Fix internal issues with Analysis task.

Hazy Synthesisers

  • Adding a dataset to be for a Czech locale (cs_CZ).
  • Updated Passport Regex Patterns for all Locales.
  • Fix bug with handlers and column logic.

Improvements

Hazy Synthesisers

  • Surface sampling arguments for presence disclosure metric (allow fractions for sampling).
  • Allow conditioned/determined values to be NoneType.
  • Add new locales to location handler.
  • Add K-means strategy to binning.
  • Support limiting rows for SQL analysis.
  • Support Avro and Parquet input formats for Configurator.

2.2.1

  • Enable configuration of synth init container in kubernetes

2.2.2

  • Avoid filtering schemas with names starting "SYS"
  • Support for large schemas in UI

2.2.3

  • Revert to filtering schemas with names starting "SYS"
  • Use textinput for schema name

2.2.4

  • Update SQL read usage