Examples
Dealing with people¶
In the following People
table
Person ID | Title | Gender | First Name | Last Name |
---|---|---|---|---|
0 | Mr | Male | Cole | Reynolds |
1 | Mrs | Female | Modesta | Sanford |
2 | Ms | Other | Caterina | Rolfson |
3 | Mr | Male | Brooklyn | Bartell |
... | ... | ... | ... | ... |
the types would be configured as:
Person ID
-> ID using the incremental ID settings.Title
-> Title with the settingsPerson ID=1
(where the 1 would be autogenerated when creating a new person entity).Gender
-> Gender, the gender map would be configured asMale=m
,Female=f
,Other=o
andPerson ID=1
(where the user would select the same person that was just added).First Name
-> Name, with the settingsName Type=first_name
andPerson ID=1
.Last Name
->Name, with the settingsName Type=last_name
andPerson ID=1
.
Dealing with locations¶
In the following Location
table
Address ID | City | Country | Postcode | Address Line 1 |
---|---|---|---|---|
0 | London | United Kingdom | AW9 46J | 62 Reynolds Street |
1 | Manchester | United Kingdom | AW9 46J | 17 Flynn Road |
2 | New York | United States | 16745 | 145 Chapman Way |
3 | Paris | France | 45287 | 234 Georges Clémencea |
... | ... | ... | ... | ... |
the types would be configured as:
Address ID
-> ID using the incremental ID settings.City
-> City with the settingsLocation ID=1
(where the 1 would be autogenerated when creating a new location entity).Country
-> Country, with the settingsLocation ID=1
(where the user would select the same location that was just added).Postcode
-> Postcode, with the settingsLocation ID=1
.Address Line 1
->Street address, with the settingsFormat string="{street_number} {street}"
andPerson ID=1
.
the Entity settings should also be configured as Locales=["en_GB", "en_US", "fr_FR"]
.
Dealing with denormalisation¶
Multiple examples can be found here using the Repeat by setting found on a number of types and also the Copy type.
Dealing with ID formats¶
The most commonly used ID types for databases and general entities are:
- Incremental Typically used for database autogenerated integer IDs.
- Numerical These are also integers, but defined by the length of number, and be configured to include leading zeroes.
- UUID Version 4 Universally Unique Identifiers.
- MD5 Generates MD5 hashes.
A set of personally identifiable IDs are also available:
- Passport number Now has its own type.
- Phone number
- Credit card number
- CPF number Generates Brazilian CPF numbers.
- CPR Generates Danish CPR numbers.
- License plate Generates fake license plate numbers.
- SSN Generates Social Security Numbers (or local equivalents for countries other than the USA).
A set of banking IDs/codes are available:
- Sortcode Identifies a bank.
- Bank country Generates ISO 3166-1 alpha-2 country codes.
- BBAN Generate a Basic Bank Account Number.
- SWIFT Generates SWIFT bank codes to identify overseas bank branches typically.
- IBAN Generates International Bank Account Numbers.
- Credit card security code while not strictly an ID, it is also available here.
Miscellaneous:
- Company Generates fake company names.
- Name designed to be used with some of the more complex types below. The Person type should be used, if you're not dealing with a complex ID format involving names.
- Option Generates a random sample from provided options, designed to be used with the more complex types below.
The more complex ID formats are outlined below.
Regex¶
Generates IDs which conform to a regex pattern.
For example:
[a-z]{2}
would generate values such asag
,cb
,yt
.[0-9A-F]{4}
would generate values such as0A5D
,BCAD
,94FF
.[0-9]{3}-[0-9]{3}-[0-9]{4}
would generate values such as974-920-8306
.(1[0-2]|0[1-9])(:[0-5]\d){2} (A|P)M
would generate11:57:12 AM
,01:16:01 PM
.
The regex patterns use the Python
flavour of regular expressions. We recommend using Regex101 for testing out the correct regex to match your data.
The analysis will often return valid regular expressions, although the output should be checked it matches your business logic.
Compound¶
Generate complex ID patterns by providing a pattern alongside a dictionary of sampling behaviours for each component within the pattern. An example would be
-
Pattern set to
{IDTYPE1}/{IDTYPE2}
. -
ID settings set to
{ "IDTYPE1": { "id_type": "name", "name_type": "first_name_male" }, "IDTYPE2": { "id_type": "numerical", "length": 4, "as_str" true, } }
This would generate IDs such as
dave/0453
,trevor/7361
,peter/4634
.All ID settings parameters can be found here. This is where some of the more miscellaneous ID sampler types become more relevant.
Conditioned¶
Generates various ID types based on provided conditions being met. An example would be
-
Conditions set to
[ { "query": "`('table1', 'col1')` == 'A'", "dependencies": [{ "col": "col1", "table": "table1" }], "sampler": { "id_type": "numerical", "length": 6, "as_str": true } }, { "query": "`('table1', 'col1')` == 'B'", "dependencies": [{ "col": "col1", "table": "table1" }], "sampler": { "id_type": "numerical", "length": 3, "as_str": true } } ]
ie this reads as when
table1.col1
is equal to the value"A"
the ID sampled in this column would be something like003653
,649274
. and when the value intable1.col1
is equal to the value"B"
the ID samples in this column would be something of the form375
,207
. -
Mismatch how values in the training data which don't match any query conditions are handled.
replace
means on generation, sample from one of the samplers randomly.preserve
means keep a list of values which didn't match the query conditions. Sample from these values on generation.
Note this does present a privacy risk using the preserve
behaviour, so we recommend only using it you're certain there is no private information in the column.
ID Mixture¶
Generates various ID types based on regex patterns being matched. It provides a way to model distributions of IDs. An example would be
-
Patterns set to
[ { "match": "A.*", "sampler": { "id_type": "regex", "pattern": "A[0-9]{3}" } }, { "match": "B.*", "sampler": { "id_type": "regex", "pattern": "B[0-9]{3}" } } ]
This means we're going to effectively treat the column as two categories. Those which match the prefix starting with A, and those that match the prefix starting with B. Let's say 20% of our IDs start with
A
prefix and 80% start withB
prefix. On generation the user should expect the same distribution with 20% of the formA947
,A047
,A461
etc and 80% of the formB557
,B546
,B223
etc. -
Mismatch how values in the training data which don't match any regex patterns are handled.
replace
means on generation, sample from one of the samplers randomly.preserve
means keep a list of values which didn't match the query conditions. Sample from these values on generation.
Note this does present a privacy risk using the preserve
behaviour, so we recommend only using it you're certain there is no private information in the column.
Split¶
Split a column into separate components that can either be modelled as categoricals or replaced with a regex pattern.
A good use case might be credit card data where a set of digits contain information about the issuer of the card and you want this distribution to be preserved.
An example would be
- Split map set to
Key | Value |
---|---|
6 | None |
12 | [0-9]{6} |
18 | None |
This means model characters 1-6 and 13-18 inclusive as categorical variables. Characters 7-12 will be replaced with a selection of characters that match the corresponding regex pattern.
Note this type does present a privacy risk, so we recommend only using it if you're sure the character ranges you model with null
don't include PII information.
Dealing with sensitive categorical data¶
Say you have some retail data in the form of the Purchases
table below. You wish to share the data with an outside party to do some analysis. You offer voucher codes which could be used as part of the analysis, however you don't wish to release the actual voucher codes as this could be misused if the data was released.
Purchase ID | Product ID | Value | Voucher Code |
---|---|---|---|
0 | 1 | 2000 | |
1 | 5 | 2500 | HALF_STARTER |
2 | 1 | 1800 | 20_OFF |
3 | 5 | 5000 | |
... | ... | ... | ... |
The user might configure the Voucher Code
column using the Mapped type, with the ID settings set as incremental.
The synthetic data might look something like this:
Purchase ID | Product ID | Value | Voucher Code |
---|---|---|---|
0 | 1 | 1800 | 1 |
1 | 5 | 2500 | 2 |
2 | 1 | 2000 | |
3 | 5 | 5000 | |
... | ... | ... | ... |
The other columns would be set as follows:
Purchase ID
-> ID using the incremental ID settings.Product ID
-> Category since we don't have any other tables in this example, the column can be treated purely as categorical data.Value
-> Integer.
If you wish for more control over the types of patterns that should appear and how you want to model the IDs you may be interested in the ID mixture settings inside the ID type.
Dealing with enums and sub-enums¶
Say you had a table with data relating to car models and their corresponding brands. The combinations of values between the brand and model columns only make sense when they are paired together correctly. For example, yielding a result such as BMW
and A Class
would not make sense.
Model ID | Brand | Model |
---|---|---|
0 | BMW | M2 |
1 | Mercedes | A Class |
2 | BMW | X5 |
3 | Lotus | Emira |
... | ... | ... |
the following configuration would enforce that only combinations that are observed in the source data will be generated in the synthetic data:
Model ID
-> ID using the incremental ID settings.Brand
-> Category with the settingsCombination ID=1
(where the 1 would be autogenerated when creating a new combination entity).Model
-> Category, with the settingsCombination ID=1
(where the user would select the same combination that was just added).
It should be noted that the model will often be able to pick up on the relationships between truthful categorical columns, and therefore automatically enforce this behaviour without the need for linking the columns via the Combination ID. However, as the privacy of the model is increased, the chances of generating unseen categorical combinations in the synthetic data also increases. Therefore, it is recommended to use the Combination ID as outlined here for any columns with which it makes sense to do so.
Dealing with free text¶
Hazy doesn't naturally support free/unstructured text fields such as descriptions. Where data contains unsupported columns these can be dropped or an alternative generator can be used from Hazy's large list of supported types. In particular, the set of ID types described in detail here do not use any of the underlying data by default.
Dealing with high cardinality data¶
The approach of mapping string values to categories works well when there are a relatively small number of distinct values / categories that contain most of the distribution. Performance varies depending on the algorithms used and a number of settings. The max_cat parameter can be used to increase the number of categories modelled accurately. Increasing this parameter will significantly increase memory usage. 250 is a "rule of thumb" upper bound but it might make sense to go higher in some circumstances.