Denormalisation

Hazy has a set of tools to deal with some types of denormalisation within the user's data.

Repeat by

This is a setting which is available on a number of data types. It should be used when the columns value is the same based on another column - usually an ID column.

Example

Transactions
Transaction ID Customer ID Customer Reference
0 1 004648
1 1 004648
2 2 112084
3 3 666666
4 2 112084
5 1 004648
6 2 112084
7 3 666666
... ... ...

In this example there is a single table Transactions and the Customer Reference column is a denormalisation, as it is always the same depending on the Customer ID column.

In a normalised database, this would be represented by two tables ie.

Customers
Customer ID Customer Reference
1 004648
2 112084
3 666666
... ...
Transactions Normalised
Transaction ID Customer ID
0 1
1 1
2 2
3 3
4 2
5 1
6 2
7 3
... ...

To allow Hazy to treat the first problem like the second. The customer would configure:

  • Transactions.Customer Reference -> ID Type and use the Numerical setting with length=6 and as_string=True. They would also set Repeat By to Transactions.Customer ID.

Copy type

Used when data is repeated on a table which is not the original source of truth for that piece of information.

Example

Customers
Customer ID First Name Last Name
0 James Johnson
1 Toby Zboncak
2 Lauren Kuhn
3 Will Harris
... ... ...
Accounts
Account ID Customer ID First Name
000064 1 Toby
000065 1 Toby
000354 2 Lauren
001984 3 Will
002234 3 Will
... ... ...

In this example there are two tables Customers and Accounts. Customers is normalised but Accounts contains a denormalisation where data from the Customers table is duplicated based on the Customer ID.

In this case the Accounts.First Name column, always matches the Customers.First Name when Accounts.Customer ID is equal to Customers.Customer ID.

If the user wishes to maintain this denormalisation in the synthetic data, it can be guaranteed by setting:

  • Accounts.First Name -> CopyType with the source column set as Customers.FirstName

The other columns would be set up as follows:

  • Customers.Customer ID -> ID Type and use the Incremental setting.
  • Customers.First Name -> Person Type person ID 1, person sub type first_name.
  • Customers.Last Name -> Person Type person ID 1, person sub type last_name.
  • Accounts.Account ID -> ID Type and use the Numerical setting with length=6 and as_string=True.
  • Accounts.Customer ID -> Foreign Key Type pointing to Customers.Customer ID.

Custom handlers

Custom handlers are available through the advanced section on the general settings page. They handle a wide range of scenarios to handle bespoke cases. We recommend talking to Hazy before attempting to use them, or you can read our documentation and supply config in JSON format.