2.0.0

Features

Hazy Configurator UI

Hazy Configurator UI provides a UI for quickly creating configurations for synthetic data which will make the configuration more straightforward as users will be able to either follow a user-interface, or a cascading configuration method, with each table's configuration separated for larger configs.

Current limitations

The following features are currently limited using the UI:

  • Reference and Sequential Tables cannot be set up.
  • Only the default metrics can be used.
  • Unable to configure files hosted on S3 buckets or databases.
  • It is not possible to run the configuration using the UI at the moment.
  • Handlers not currently supported through the GUI, only Hazy dtypes.

Python SDK

The Hazy client module has been re-written to make it easier to use.

The configurator library used by the client checks the consistency of the configuration before it reaches the synthesiser. This will enable validation at a much earlier stage and provide a better developer experience when working on configuration including the possibility of on-the-fly error checking depending on IDE in use.

The client also integrates with an enhanced error reporting mechanism provided in the updated multi-table synthesiser that will enable users more easily to attach complete information with error reports to support tickets.

Improvements to the configuration for multi-table will make it easy to configure data types without having to configure handlers as well, since some handlers have been turned into data types.

Improvements

  • Improved test coverage across the entire SDK.
  • Altered API to integrate naturally with hazy_configurator.
  • Support for hosting docker images in local registry (e.g. artifactory); the Hub API will remap a public name to use the internal registry.
  • SDK Fully documented
  • Hazy data types simplify configuration since most columns will only have to be specified once. Handlers preserved for complex use cases.

Fixes

  • Use requests-toolbelt package to support model upload to the hub to improve reliability.

Metrics

Existing metrics available for sequential data

The following metrics have been added to tables with sequential data by first aggregating the data.

  • Marginal distribution - Probability distribution of the variables contained in a subset.
  • Mutual information - The concept of similarity is essential to synthetic data. The source data has statistical properties, such as distributions of values. How well are these properties and distributions mirrored in the safe synthetic data? Hazy measures similarity for a single column, pair of columns and a matrix of columns.
  • Query utility - Measures the performance of a downstream model or algorithm when run on the source data versus the synthetic.
  • Density Disclosure - Estimates the risk of an adversary constructing a mapping from the synthetic data points to the real data points by counting how many real data points exist in the neighbourhood of each synthetic data point.
  • Presence Disclosure - Measures the certainty with which an adversary could infer whether an arbitrary data point was present in the real data used to train the synthetic data generator.

New Metrics

The following metrics have been added for all data

  • Distance to closest record - Compares the distance between records in the source and synthetic datasets, and any situation where exact copies or simple perturbations of the source records that exist in the synthetic dataset will be easily exposed by the DCR metric.

The following marketing metric has also been added

  • Index similarity metric - Indexing is a market research approach used to highlight points of interest in a data set. For a given sub-population, the probability of having a particular attribute is divided by the probability of the total population also having that attribute. This similarity between the real and synthetic data is then calculated by taking the Jaccard distance between the real and synthetic data sets.

The following metrics have been added to multi-table data

  • Presence disclosure
  • Predictive utility
  • Density disclosure
  • Query utility

Improvements

  • Improved speed of presence disclosure metric Added parameter to sample a part of the synthetic data before evaluating Presence Disclosure Metric. It then runs the metric on 5 samples of that size and averages their results. It includes the behaviour to throw a warning if the variability of the metric for those samples is above a certain value.
  • Predictive utility When the augment flag is enabled, the metric assesses predictive utility by combining real & synthetic data as opposed to only using synthetic data.

Key constraints

  • Added support for tables with no primary keys or foreign keys.
  • Added support for tables with a foreign key that is also a primary key.
  • Added support for attribute column values to be copied from a column in another table based on a matching key, to maintain referential integrity.

LDAP Authentication Initial Release

Added ability to authenticate with LDAP to support single sign on. This feature is currently experimental and will only be enabled on request.

Improvements

  • Ensure sensitive database credentials are not stored in the model.
  • Warn when generating data with a magnitude != 1.0 when an ‘identity’ adjacency type is specified.
  • Remove excessive debug logs.
  • Bug fixed when sampling single table datasets using multi-table.
  • Fixed NaN issue in multi-table metrics
  • The different docker images used for generating data have been merged into one image that can be used for all different types of data and the correct features for a customer enabled using feature flags.

Documentation improvements

  • SDK: New documentation section for the hazy_configurator and hazy_client2 Python packages, with getting started guides.
  • Step-by-step tutorials for configuration, training and generation.
  • Updated information on security and source data persistence.

Known Issues

  • ID Mixture Handler is not available in 2.0.0.
  • Multi-table column mapper does not support primary keys.

Deprecated

  • The “TSV” format used for configuring how to synthesise data will no longer work with 2.0.
  • The previous json format used for configuring how to synthesise has been updated to a new format and old formats will need to be updated before they can be used with 2.0.
  • PatternHandler is no longer supported. Its functionality has been replaced by the use of an ID Handler with a CompoundSampler.

2.0.1

Synth

  • Fixed error where string data exceeded VARCHAR limits in database.
  • Added missing gender map to person handler.
  • Multi-table column mapper now supports primary keys.
  • Fixed out of bounds exception issue.
  • Enabled ID Mixture handler.
  • Support null string/category columns in Db2.
  • Update asset location checksums.
  • Fix to nulls created in non-null categorical column.
  • Fix location handler custom columns bug.
  • Added “preserve” option to mismatch attribute of the text category handler.
  • Fix to passport pattern used when generating data.

Python SDK

  • Fix error where directories are created on host if data inputs are not found.

2.0.2

  • Allow table to be configured without a primary key in the Configurator UI.
  • NumericalIdSettings, CPFIdSettings, UUIDSettings and CPRSettings objects now have their unique parameter set to True by default.
  • Improved map Data location: When UI is running from a Docker image. If host directory (e.g. /host/input/volume/data.csv), the data_input can now be configured as DataLocationInput(name="account", location="/host/input/volume/data.csv") instead of DataLocationInput(name="account", location="/configurator-input/data.csv") if you add the environment variable CONFIGURATOR_INPUT=/host/input/volume/ to Docker.
  • Bug fix for static bounds in the BoundedRule handler.

2.0.3

  • Bug fix to allow serialisation of IDSampler when generating the following id types:
    • bank_country
    • bban
    • name
    • name_female
    • name_male
    • first_name
    • first_name_female
    • first_name_male
    • last_name
    • iban
    • license_plate
    • password
    • phone_number
    • ssn
    • swift
    • swift11
    • swift8
    • md5
  • Upgraded npm package version.
  • Bug fix file size represented as bytes goes over signed 32 bit range used by API int type causing it to error out.
  • Apply incode and outcode only for UK postcodes.
  • Ensure min-max scaler handles columns of entirely empty data.
  • Fix bug in categorical aggregator leading to KeyError.
  • Ensure unique parameter is surfaced in documentation.