Table types

Hazy defines a set of table types. By changing the table type you affect how the generative modelling takes place.

Tabular table

Usage

This is the default option. Each row is treated as an independent record. If sequential or reference don't describe the behaviour you're looking for, this is likely the one you'll want to use.

It has no extra configurable parameters.

Sequential table

Usage

Used when multiple rows in a table should be treated as a sequence of events.

Example

Here's an example table containing transactions information that would be configured using the sequential table type.

Transactions
Transaction ID Account ID Amount Date
0 0 50 04/09/2023
1 0 36 06/09/2023
2 1 2 08/09/2023
3 1 32 10/09/2023
... ... ... ...

Sequential ID

This column is used to identify rows which belong to the same sequence. In the case above the Account ID column would be set as the sequential ID ie a sequence is defined as belonging to each account.

Sort by columns

Defines how data within each sequence should be sorted. In the above example Date would be set as the only sort by column. However, multiple columns can be provided. For instance, if we had both Time and Date columns present. Date would be provided first and then Time.

Reference table

Usage

These are tables within a database schema that contain information the user does not want to be synthesised, these will be recreated exactly as the they were supplied during training. This presents a privacy risk so should only be used with entities the user does not care about applying privacy to.

The size of generated data in reference tables is hard-limited to generate the same number of rows as the source data.

If you are at all concerned about private information in this table do not use this table type!

One example of this is data which is publicly available. Another use case is if you have lookup tables present.

By setting a table as a reference table, we include it as conditioning throughout our pipeline. ie we can still use the information contained in those tables to provide higher quality synthetic data for the other downstream tables in the database schema.

Example

Often the entity that needs protecting is the customer, however if you have other entities representing your products for instance, you might not wish for Hazy to synthesize products you don't sell.

Products
Product ID Price Product Category
0 50 Shoes
1 36 Trousers
2 58 Trousers
3 32 Shirts
... ... ...
Purchases
Customer ID Product ID Date
0 1 05/09/2023
0 2 23/09/2023
1 3 24/09/2023
... ... ...

In the following schema example above, Products can be set as a reference table. Purchases can be set as a tabular table. Its the customer's behavior we want to protect, not the products our company stocks. In this example the Price and Product Category column can be used by Hazy's synthesizer to model the customer's behavior in relation to the types of items the customer purchases.

The customer would also setup the following Hazy data types: