Examples

Dealing with people

In the following People table

People
Person ID Title Gender First Name Last Name
0 Mr Male Cole Reynolds
1 Mrs Female Modesta Sanford
2 Ms Other Caterina Rolfson
3 Mr Male Brooklyn Bartell
... ... ... ... ...

the types would be configured as:

  • Person ID -> ID using the incremental ID settings.
  • Title -> Title with the settings Person ID=1 (where the 1 would be autogenerated when creating a new person entity).
  • Gender -> Gender, the gender map would be configured as Male=m, Female=f, Other=o and Person ID=1 (where the user would select the same person that was just added).
  • First Name -> Name, with the settings Name Type=first_name and Person ID=1.
  • Last Name ->Name, with the settings Name Type=last_nameand Person ID=1.

Dealing with locations

In the following Location table

Location
Address ID City Country Postcode Address Line 1
0 London United Kingdom AW9 46J 62 Reynolds Street
1 Manchester United Kingdom AW9 46J 17 Flynn Road
2 New York United States 16745 145 Chapman Way
3 Paris France 45287 234 Georges Clémencea
... ... ... ... ...

the types would be configured as:

  • Address ID -> ID using the incremental ID settings.
  • City -> City with the settings Location ID=1 (where the 1 would be autogenerated when creating a new location entity).
  • Country -> Country, with the settings Location ID=1 (where the user would select the same location that was just added).
  • Postcode -> Postcode, with the settings Location ID=1.
  • Address Line 1 ->Street address, with the settings Format string="{street_number} {street}"and Person ID=1.

the Entity settings should also be configured as Locales=["en_GB", "en_US", "fr_FR"].

Dealing with denormalisation

Multiple examples can be found here using the Repeat by setting found on a number of types and also the Copy type.

Dealing with ID formats

The most commonly used ID types for databases and general entities are:

  • Incremental Typically used for database autogenerated integer IDs.
  • Numerical These are also integers, but defined by the length of number, and be configured to include leading zeroes.
  • UUID Version 4 Universally Unique Identifiers.
  • MD5 Generates MD5 hashes.

A set of personally identifiable IDs are also available:

  • Passport number Now has its own type.
  • Phone number
  • Credit card number
  • CPF number Generates Brazilian CPF numbers.
  • CPR Generates Danish CPR numbers.
  • License plate Generates fake license plate numbers.
  • SSN Generates Social Security Numbers (or local equivalents for countries other than the USA).

A set of banking IDs/codes are available:

  • Sortcode Identifies a bank.
  • Bank country Generates ISO 3166-1 alpha-2 country codes.
  • BBAN Generate a Basic Bank Account Number.
  • SWIFT Generates SWIFT bank codes to identify overseas bank branches typically.
  • IBAN Generates International Bank Account Numbers.
  • Credit card security code while not strictly an ID, it is also available here.

Miscellaneous:

  • Company Generates fake company names.
  • Name designed to be used with some of the more complex types below. The Person type should be used, if you're not dealing with a complex ID format involving names.
  • Option Generates a random sample from provided options, designed to be used with the more complex types below.

The more complex ID formats are outlined below.

Regex

Generates IDs which conform to a regex pattern.

For example:

  • [a-z]{2} would generate values such as ag, cb, yt.
  • [0-9A-F]{4} would generate values such as 0A5D, BCAD, 94FF.
  • [0-9]{3}-[0-9]{3}-[0-9]{4} would generate values such as 974-920-8306.
  • (1[0-2]|0[1-9])(:[0-5]\d){2} (A|P)M would generate 11:57:12 AM, 01:16:01 PM.

The regex patterns use the Python flavour of regular expressions. We recommend using Regex101 for testing out the correct regex to match your data.

The analysis will often return valid regular expressions, although the output should be checked it matches your business logic.

Compound

Generate complex ID patterns by providing a pattern alongside a dictionary of sampling behaviours for each component within the pattern. An example would be

  • Pattern set to {IDTYPE1}/{IDTYPE2}.

  • ID settings set to

    {
        "IDTYPE1": {
            "id_type": "name",
            "name_type": "first_name_male"
        },
        "IDTYPE2": {
            "id_type": "numerical",
            "length": 4,
            "as_str" true,
        }
    }
    

    This would generate IDs such as dave/0453, trevor/7361, peter/4634.

    All ID settings parameters can be found here. This is where some of the more miscellaneous ID sampler types become more relevant.

Conditioned

Generates various ID types based on provided conditions being met. An example would be

  • Conditions set to

    [
      {
        "query": "`('table1', 'col1')` == 'A'",
        "dependencies": [{ "col": "col1", "table": "table1" }],
        "sampler": { "id_type": "numerical", "length": 6, "as_str": true }
      },
      {
        "query": "`('table1', 'col1')` == 'B'",
        "dependencies": [{ "col": "col1", "table": "table1" }],
        "sampler": { "id_type": "numerical", "length": 3, "as_str": true }
      }
    ]
    

    ie this reads as when table1.col1 is equal to the value "A" the ID sampled in this column would be something like 003653, 649274. and when the value in table1.col1 is equal to the value "B" the ID samples in this column would be something of the form 375, 207.

  • Mismatch how values in the training data which don't match any query conditions are handled. replace means on generation, sample from one of the samplers randomly. preserve means keep a list of values which didn't match the query conditions. Sample from these values on generation.

Note this does present a privacy risk using the preserve behaviour, so we recommend only using it you're certain there is no private information in the column.

ID Mixture

Generates various ID types based on regex patterns being matched. It provides a way to model distributions of IDs. An example would be

  • Patterns set to

    [
      {
        "match": "A.*",
        "sampler": {
          "id_type": "regex",
          "pattern": "A[0-9]{3}"
        }
      },
      {
        "match": "B.*",
        "sampler": {
          "id_type": "regex",
          "pattern": "B[0-9]{3}"
        }
      }
    ]
    

    This means we're going to effectively treat the column as two categories. Those which match the prefix starting with A, and those that match the prefix starting with B. Let's say 20% of our IDs start with A prefix and 80% start with B prefix. On generation the user should expect the same distribution with 20% of the form A947, A047, A461 etc and 80% of the form B557, B546, B223 etc.

  • Mismatch how values in the training data which don't match any regex patterns are handled. replace means on generation, sample from one of the samplers randomly. preserve means keep a list of values which didn't match the query conditions. Sample from these values on generation.

Note this does present a privacy risk using the preserve behaviour, so we recommend only using it you're certain there is no private information in the column.

Split

Split a column into separate components that can either be modelled as categoricals or replaced with a regex pattern.

A good use case might be credit card data where a set of digits contain information about the issuer of the card and you want this distribution to be preserved.

An example would be

  • Split map set to
Key Value
6 None
12 [0-9]{6}
18 None

This means model characters 1-6 and 13-18 inclusive as categorical variables. Characters 7-12 will be replaced with a selection of characters that match the corresponding regex pattern.

Note this type does present a privacy risk, so we recommend only using it if you're sure the character ranges you model with null don't include PII information.

Dealing with sensitive categorical data

Say you have some retail data in the form of the Purchases table below. You wish to share the data with an outside party to do some analysis. You offer voucher codes which could be used as part of the analysis, however you don't wish to release the actual voucher codes as this could be misused if the data was released.

Purchases
Purchase ID Product ID Value Voucher Code
0 1 2000
1 5 2500 HALF_STARTER
2 1 1800 20_OFF
3 5 5000
... ... ... ...

The user might configure the Voucher Code column using the Mapped type, with the ID settings set as incremental.

The synthetic data might look something like this:

Purchases
Purchase ID Product ID Value Voucher Code
0 1 1800 1
1 5 2500 2
2 1 2000
3 5 5000
... ... ... ...

The other columns would be set as follows:

  • Purchase ID -> ID using the incremental ID settings.
  • Product ID -> Category since we don't have any other tables in this example, the column can be treated purely as categorical data.
  • Value -> Integer.

If you wish for more control over the types of patterns that should appear and how you want to model the IDs you may be interested in the ID mixture settings inside the ID type.

Dealing with enums and sub-enums

Say you had a table with data relating to car models and their corresponding brands. The combinations of values between the brand and model columns only make sense when they are paired together correctly. For example, yielding a result such as BMW and A Class would not make sense.

Car Models
Model ID Brand Model
0 BMW M2
1 Mercedes A Class
2 BMW X5
3 Lotus Emira
... ... ...

the following configuration would enforce that only combinations that are observed in the source data will be generated in the synthetic data:

  • Model ID -> ID using the incremental ID settings.
  • Brand -> Category with the settings Combination ID=1 (where the 1 would be autogenerated when creating a new combination entity).
  • Model -> Category, with the settings Combination ID=1 (where the user would select the same combination that was just added).

It should be noted that the model will often be able to pick up on the relationships between truthful categorical columns, and therefore automatically enforce this behaviour without the need for linking the columns via the Combination ID. However, as the privacy of the model is increased, the chances of generating unseen categorical combinations in the synthetic data also increases. Therefore, it is recommended to use the Combination ID as outlined here for any columns with which it makes sense to do so.

Dealing with free text

Hazy doesn't naturally support free/unstructured text fields such as descriptions. Where data contains unsupported columns these can be dropped or an alternative generator can be used from Hazy's large list of supported types. In particular, the set of ID types described in detail here do not use any of the underlying data by default.

Dealing with high cardinality data

The approach of mapping string values to categories works well when there are a relatively small number of distinct values / categories that contain most of the distribution. Performance varies depending on the algorithms used and a number of settings. The max_cat parameter can be used to increase the number of categories modelled accurately. Increasing this parameter will significantly increase memory usage. 250 is a "rule of thumb" upper bound but it might make sense to go higher in some circumstances.