Table types

Hazy defines a set of table types. By changing the table type you affect how the generative modelling takes place.

Tabular table¶

Usage¶

This is the default option. Each row is treated as an independent record. If sequential or reference don't describe the behaviour you're looking for, this is likely the one you'll want to use.

It has no extra configurable parameters.

Sequential table¶

Usage¶

Used when multiple rows in a table should be treated as a sequence of events.

Example¶

Here's an example table containing transactions information that would be configured using the sequential table type.

Transactions

Transaction ID	Account ID	Amount	Date
0	0	50	04/09/2023
1	0	36	06/09/2023
2	1	2	08/09/2023
3	1	32	10/09/2023
...	...	...	...

Sequential ID¶

This column is used to identify rows which belong to the same sequence. In the case above the Account ID column would be set as the sequential ID ie a sequence is defined as belonging to each account.

Sort by columns¶

Defines how data within each sequence should be sorted. In the above example Date would be set as the only sort by column. However, multiple columns can be provided. For instance, if we had both Time and Date columns present. Date would be provided first and then Time.

Reference table¶

Usage¶

These are tables within a database schema that contain information the user does not want to be synthesised, these will be recreated exactly as the they were supplied during training. This presents a privacy risk so should only be used with entities the user does not care about applying privacy to.

The size of generated data in reference tables is hard-limited to generate the same number of rows as the source data.

If you are at all concerned about private information in this table do not use this table type!

One example of this is data which is publicly available. Another use case is if you have lookup tables present.

By setting a table as a reference table, we include it as conditioning throughout our pipeline. ie we can still use the information contained in those tables to provide higher quality synthetic data for the other downstream tables in the database schema.

Example¶

Often the entity that needs protecting is the customer, however if you have other entities representing your products for instance, you might not wish for Hazy to synthesize products you don't sell.

Products

Product ID	Price	Product Category
0	50	Shoes
1	36	Trousers
2	58	Trousers
3	32	Shirts
...	...	...

Purchases

Customer ID	Product ID	Date
0	1	05/09/2023
0	2	23/09/2023
1	3	24/09/2023
...	...	...

In the following schema example above, Products can be set as a reference table. Purchases can be set as a tabular table. Its the customer's behavior we want to protect, not the products our company stocks. In this example the Price and Product Category column can be used by Hazy's synthesizer to model the customer's behavior in relation to the types of items the customer purchases.

The customer would also setup the following Hazy data types:

Products.Product ID -> Real Type which can only be used within a reference table.
Products.Price -> Integer Type
Products.Product Category -> Category Type
Purchases.Customer ID -> ID Type and use the Incremental setting.
Purchases.Product ID -> Foreign Key Type pointing to Products.Product ID.
Purchases.Date -> Datetime Type

Analysis

Data types