2.1.0

Features¶

Smart sampling¶

Smart Sampling allows the user to sample the input (training) data while preserving connected sub-components in the data. This is done by, instead of sampling rows, sampling the components, which are identified before the sampling process starts.

An example of a connected sub-component is a collection of individuals (usually a family) that share one account. Before this new approach we sampled randomly and "broke" those components. Now we'll be sampling all the members of components together, keeping them intact.

Known Limitations¶

Only 1 target_table can be defined. This means that, at the moment, components can only be preserved in 1 table.
Disconnected tables (tables with no foreign keys to or from other tables) are acceptable but disconnected data schemas (multiple independent groups of tables connected among themselves) are not.
The solution will try to sample all tables according to the defined ratio (while keeping foreign key integrity). This includes reference tables, so these shouldn't be included in the training data if they shouldn't be sampled.
The solution cannot accommodate foreign composite keys.
The solution cannot accommodate one table having multiple foreign keys to the same primary key in a parent table.
The solution cannot accommodate data schemas where the graph has multiple paths to the same table (ex: diamond shape schema).

Batch generation¶

This is a feature, configurable by the user at generation time, that allows the user to generate synthetic data and write it to their database in a batch-wise fashion. This is useful if the user wishes to reduce the memory requirement of generating large volumes of synthetic data.

Known Limitations¶

None.

Component handler¶

A component is a collection of nodes within a network that make up an isolated set only linking to each other. A good example is a collection of individuals that share one account. This handler allows for the reproduction of redundant information shared within a given component, for example, given a component consisting of a family, members of that family will likely share details such as last names, address information and home phone numbers.

Known Limitations¶

The component handler is only responsible for the reproduction of redundant details shared within a component and should be considered more of a post-processing activity, it does not affect or improve the similarity of statistical distributions within connected sub-components.

Component-preserving adjacency¶

Component Preserving Adjacency is a new adjacency model that aims to replicate the connected sub-component structures observed in the source data. Connected sub-components may be present when two or more foreign keys are present in a given table.

Known Limitations¶

Connected sub-component level distributions are not modelled. For example, if it were the case that families of three on average earnt less than families of six, this would not be captured through the use of this adjacency model, only the structure of the component is modelled.

The adjacency model works best when there are two foreign keys in a single table. As the number of foreign keys in a table increases, so does the complexity of the component structures, and the similarity between the component structures observed in the source and synth data deteriorates.

Improvements for foreign keys and composite keys¶

Support for "foreign composite keys" A foreign composite key is a set of foreign keys in a child table that points to columns in a parent composite keys. The rows in the foreign composite key correspond to whole composite key rows in the parent, as opposed to individual foreign keys pointing to the composite key parts independently.

Support for composite key parts that are not foreign keys Previously composite key columns were required to be foreign keys to other tables and could not be standalone ID. A composite key can now consist of standalone IDs, foreign keys, or a mixture of both.

Support for chains of foreign keys A foreign key chain is a scenario that arises when a foreign key references a primary foreign key. Consider the following example, where t1.pk is a primary key for table t1, t2.pfk is a foreign key for table t2 that also forms a primary key for that table, and t3.fk is a foreign key that references t2.pfk: t1.pk <- t2.pfk <- t3.fk

This type of relationship was previously unsupported, and can now be configured with arbitrarily long chains of foreign keys, e.g. t1.pk <- t2.pfk <- t3.pfk <- … <- tn.fk

This also extends to foreign composite keys, e.g. (t1.pka, t1.pkb) <- (t2.pfka, t2.pfkb) <- … <- (tn.fka, tn.fkb)

Known Limitations¶

The generation of composite key columns is done on an individual basis, meaning that no statistical relationships between the columns of a composite key are captured. However for each individual composite key column, the distribution of ID repetitions is modelled and reproduced in the synthetic data.

Incremental ID generator¶

This new ID generator can be used to generate a sequence of numerical IDs that begin at a specified starting number and increase by a specified increment with each record.

Known Limitations¶

None.

Support database schema for read/write¶

A schema is a feature of many database management systems that essentially acts as a namespace for tables. Previously it was only possible to read and write using the default database schema, which is dbo for SQL Server and db2inst1 for Db2. The user may now freely specify a different schema as part of the training and generation data location parameters.

Known Limitations¶

None.

Parent compression¶

Parent compression is a memory optimization technique that can be used in the multi-table use-case to reduce the memory requirement of the training the model, potentially at the cost of distribution similarity between training and generated data.

Parent compression is the process of reducing the number of columns from a parent table on which columns in the corresponding child table are conditioned when modelling the distribution of the training data.

This option is currently configurable on a global basis. That is, based on the user’s configuration, all conditioning tables will be compressed to have the same number of parent conditioning columns.

Known Limitations¶

None.

Hazy Synthesisers¶

Fixes¶

White space and empty string handling.
Adding raise_for_status() to catch HTTP errors.

Hazy Hub¶

Fixes¶

Fix compile/runtime config change error

Changelog¶

The following features have been added in this release:

Batch generation.
Component Preserving Adjacency.
Parent Compression.
Smart Sampling.
Component handler.
Incremental ID Generator.
Improvements for foreign and composite keys.

2.2

2.0.0