Denormalisation
Hazy has a set of tools to deal with some types of denormalisation within the user's data.
Repeat by¶
This is a setting which is available on a number of data types. It should be used when the columns value is the same based on another column - usually an ID column.
Example¶
Transaction ID | Customer ID | Customer Reference |
---|---|---|
0 | 1 | 004648 |
1 | 1 | 004648 |
2 | 2 | 112084 |
3 | 3 | 666666 |
4 | 2 | 112084 |
5 | 1 | 004648 |
6 | 2 | 112084 |
7 | 3 | 666666 |
... | ... | ... |
In this example there is a single table Transactions
and the Customer Reference
column is a denormalisation, as it is always the same depending on the Customer ID
column.
In a normalised database, this would be represented by two tables ie.
Customer ID | Customer Reference |
---|---|
1 | 004648 |
2 | 112084 |
3 | 666666 |
... | ... |
Transaction ID | Customer ID |
---|---|
0 | 1 |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 2 |
5 | 1 |
6 | 2 |
7 | 3 |
... | ... |
To allow Hazy to treat the first problem like the second. The customer would configure:
Transactions.Customer Reference
-> ID Type and use theNumerical
setting withlength=6
andas_string=True
. They would also setRepeat By
toTransactions.Customer ID
.
Copy type¶
Used when data is repeated on a table which is not the original source of truth for that piece of information.
Example¶
Customer ID | First Name | Last Name |
---|---|---|
0 | James | Johnson |
1 | Toby | Zboncak |
2 | Lauren | Kuhn |
3 | Will | Harris |
... | ... | ... |
Account ID | Customer ID | First Name |
---|---|---|
000064 | 1 | Toby |
000065 | 1 | Toby |
000354 | 2 | Lauren |
001984 | 3 | Will |
002234 | 3 | Will |
... | ... | ... |
In this example there are two tables Customers
and Accounts
. Customers
is normalised but Accounts
contains a denormalisation where data from the Customers
table is duplicated based on the Customer ID
.
In this case the Accounts.First Name
column, always matches the Customers.First Name
when Accounts.Customer ID
is equal to Customers.Customer ID
.
If the user wishes to maintain this denormalisation in the synthetic data, it can be guaranteed by setting:
Accounts.First Name
->CopyType
with the source column set asCustomers.FirstName
The other columns would be set up as follows:
Customers.Customer ID
-> ID Type and use theIncremental
setting.Customers.First Name
-> Person Type person ID1
, person sub typefirst_name
.Customers.Last Name
-> Person Type person ID1
, person sub typelast_name
.Accounts.Account ID
-> ID Type and use theNumerical
setting withlength=6
andas_string=True
.Accounts.Customer ID
-> Foreign Key Type pointing toCustomers.Customer ID
.
Custom handlers¶
Custom handlers are available through the advanced section on the general settings page. They handle a wide range of scenarios to handle bespoke cases. We recommend talking to Hazy before attempting to use them, or you can read our documentation and supply config in JSON
format.