Multi-table

Hazy’s multi-table synthesiser can create synthetic versions of tables related by foreign keys. Training data for tables is supplied in the same format as the Tabular+ synthesiser. The multi-table configuration contains information on how tables relate, including:

  • Primary keys
  • Foreign keys
  • Composite keys

Hazy’s multi-table synthesiser is currently undergoing beta testing. Please contact info@hazy.com for more information.

Training parameters

  • multi_table_version: int Options are [0, 1]. Version 1 is the latest version.
  • multi_table: dict Contains the keys primary_keys, composite_keys, foreign_keys and custom_handlers.
    • primary_keys: dict keyed by table name. Each item contains the keys:
      • name: str column name
      • id_type: str one of Hazy's id handler types eg numerical, uuid
      • id_settings: dict Config used to specify the parameters of a particular id type generator
    • foreign_keys: dict keyed by table name. Each item is a dict keyed by foreign key column name.
      • The format of each item is { column_name: { "column_to": [table_pointed_to, column_name_pointed_to] }}
    • composite_keys: dict keyed by table name.
      • Each item is a list containing all column names which are part of the composite key. For example, if the disp table did not have a primary key and each row had a uniqueness constraint defined by client_id and account_id the composite keys section would look like:
        { "composite_keys": { "disp": ["client_id", "account_id"] } }
      • Composite keys are handled but currently we do not guarantee uniqueness across all generated composite keys
    • custom_handlers: dict keyed by table name. Each item is a dict keyed by handler_type. The value is a list of dicts which contain everything set out by the params key in the documentation.
  • adjacency_type: str one of [default, random, degree_preserving] defines the method used for generating links between tables.
    • default uses the identity anonymised edges formed from the source bipartite graph which provides the top baseline. This option provides the highest levels of similarity, but comes with the caveat of only allowing the same volume of generated data as was trained on.
    • random generates an arbitrary set of edges and does not attempt to preserve any attributes of the original structure. It can be used as a bottom baseline. Used internally to test that the rest of the model can handle the worst case scenario of an underperforming Adjacency model.
    • degree_preserving aims to capture and preserve the joint degree distribution of each table. It allows for generating more data than trained on by using the magnitude parameter.
    • This is an active area of machine learning development for Hazy with further graph generation/expansion methods under review.

Generation parameters

The generation process produces the same volume of synthetic data as training data by default. The relative volume can be controlled using the magnitude parameter. Setting this to 1.2 generates synthetic data tables with 1.2 times as many rows as the training tables.

Evaluation parameters

  • evaluate: bool when set to True runs the evaluation pipeline on the generated data.
  • evaluation_exclude_columns: list[str] list of columns to ignore during evaluation, typically these are ID columns.

Differential privacy

The Hazy multi-table synthesiser provides an end-to-end privacy guarantee of ε-DP (epsilon differential privacy) on all relationships between tables, and ε-DP on the attributes of all tables.