Table types
Hazy defines a set of table types. By changing the table type you affect how the generative modelling takes place.
Tabular table¶
Usage¶
This is the default option. Each row is treated as an independent record. If sequential or reference don't describe the behaviour you're looking for, this is likely the one you'll want to use.
It has no extra configurable parameters.
Sequential table¶
Usage¶
Used when multiple rows in a table should be treated as a sequence of events.
Example¶
Here's an example table containing transactions information that would be configured using the sequential table type.
Transaction ID | Account ID | Amount | Date |
---|---|---|---|
0 | 0 | 50 | 04/09/2023 |
1 | 0 | 36 | 06/09/2023 |
2 | 1 | 2 | 08/09/2023 |
3 | 1 | 32 | 10/09/2023 |
... | ... | ... | ... |
Sequential ID¶
This column is used to identify rows which belong to the same sequence. In the case above the Account ID
column would be set as the sequential ID ie a sequence is defined as belonging to each account.
Sort by columns¶
Defines how data within each sequence should be sorted. In the above example Date
would be set as the only sort by column. However, multiple columns can be provided. For instance, if we had both Time
and Date
columns present. Date
would be provided first and then Time
.
Reference table¶
Usage¶
These are tables within a database schema that contain information the user does not want to be synthesised, these will be recreated exactly as the they were supplied during training. This presents a privacy risk so should only be used with entities the user does not care about applying privacy to.
The size of generated data in reference tables is hard-limited to generate the same number of rows as the source data.
If you are at all concerned about private information in this table do not use this table type!
One example of this is data which is publicly available. Another use case is if you have lookup tables present.
By setting a table as a reference
table, we include it as conditioning throughout our pipeline. ie we can still use the information contained in those tables to provide higher quality synthetic data for the other downstream tables in the database schema.
Example¶
Often the entity that needs protecting is the customer
, however if you have other entities representing your products
for instance, you might not wish for Hazy to synthesize products you don't sell.
Product ID | Price | Product Category |
---|---|---|
0 | 50 | Shoes |
1 | 36 | Trousers |
2 | 58 | Trousers |
3 | 32 | Shirts |
... | ... | ... |
Customer ID | Product ID | Date |
---|---|---|
0 | 1 | 05/09/2023 |
0 | 2 | 23/09/2023 |
1 | 3 | 24/09/2023 |
... | ... | ... |
In the following schema example above, Products
can be set as a reference table. Purchases
can be set as a tabular
table. Its the customer's behavior we want to protect, not the products our company stocks. In this example the Price
and Product Category
column can be used by Hazy's synthesizer to model the customer's behavior in relation to the types of items the customer purchases.
The customer would also setup the following Hazy data types:
Products.Product ID
-> Real Type which can only be used within a reference table.Products.Price
-> Integer TypeProducts.Product Category
-> Category TypePurchases.Customer ID
-> ID Type and use theIncremental
setting.Purchases.Product ID
-> Foreign Key Type pointing toProducts.Product ID
.Purchases.Date
-> Datetime Type