1.4.0

Features¶

Post-generation options (prepend, append and encryption)¶

An optional treatment of selected fields in the synthetic data post-generation. There are three options that can currently be applied post-generation:

Prepend: for example, adding “Synth:” to the data, e.g. Synth:John Doe
Append: for example, adding ":Synth" to the data, e.g. John Doe:Synth
Encrypt: encrypting a field (e.g. email), with the option to decrypt it using a customer-provided key

Known Limitations

Decryption is carried outside of the Hazy product and requires the customer to perform encryption key management. The platform used to perform the decryption may need significant resources in order to process production workloads.

Sequential metrics¶

Metrics to measure the similarity between source sequential (time series) data and synthetic sequential data.

Known Limitations

This feature only works if a sampling frequency is specified or it's detected (detection won't work if there are unknown/missing datetime values), since the metric results won't be meaningful for irregularly sampled time series.
This feature won't work for spectral features in the time series data (the metric is not sensitive to features in the frequency domain) source.

SQL Server integration using SQL Authentication (as a training data source)¶

Include SQL Server as a data source. This will allow selection of tables as inputs to the training process.

Data source/sink: S3 integration¶

Clients who host data in the cloud via AWS are able to read the source data directly from an S3 bucket for the training of, and then generation of, synthetic data.

Data source/sink: Parquet and Avro files¶

Parquet is a free and open-source storage format for fast analytical querying, developed by Apache. Avro, also developed by Apache, is also a data format that stores the schema in JSON format, making it easy to read and interpret by any program. Hazy is now able to consume these two file formats for training and output in this format for synthetic data generation.

Known Limitations

It is possible for Parquet data source to be split across multiple files - Hazy only supports a single Parquet file for ingest.

Avro supports customised "structured types" - Hazy only supports standard data types that match python dataframes.

Including version-specific documentation in Hazy Hub¶

Clients will be able to see version-specific help files in their on-premises version of the Hazy Hub.

Training Performance Visualisation in Hazy Hub¶

Visual graph now appears in the Hazy Hub showing the discrete steps, with elapsed time, of the training process.

Performance Improvement to training¶

Clients can train much faster in a specific use case: when data has a complex set of inferred rules, the performance improvement increases training speed by up to 30X.

Docker Registry release process¶

Clients can download new releases via the Hazy docker registry.

Improvements¶

Synthesisers¶

Configuration model performance & quality improvement
Time data preprocessing separate from reading
Improvement to the Presence Disclosure Privacy Metric

Hazy Client Library¶

Allow setting custom environment variables for docker execution

Fixes¶

Synthesisers¶

Fix issues related to pandas upgrade
Non lossy single degree distributions
Improve composite key generation
Validate predictor choice
Add no-repeating edges to configuration model
Fix entropy estimator
Fix date format handler error handling
Fix undefined logger
Remove unused date_cols assignment
Remove unused output_path assignment
Reduce number of python dependencies

Hub¶

Redirect to previous location when editing user
Give team members access to metrics page for model
Change org member display meaning
Better messaging for org blankslates
Allow org admins to promote users to admin
Org admin can now disable users
Sessions now include the IP address

Hazy Client Library¶

Ensure table_paths is set
Update logic to determine paths in multi-table from client library

1.5.0

Glossary