Data Types

The Hazy data types are provided to a Data Table and used to represent columns in the data.

A column should only be represented once in a Data Table.

Classes:

AgeType

Used only when a table contains both a date of birth column and an age column.

CategoryType

Used for categorical data where the column contains values from a fixed sized list.

CityType

Used for city location data.

ConstantType

Uses a single constant value for every row in the column.

CountryType

Used for country data.

CurrencyType

Allows support for currency value columns with a currency unit leading or trailing the numerical value, as well as thousands separators and decimal points such as £250,000.

CustomAddressType

Used for custom address data.

DatetimeType

Used for datetimes.

DistrictType

Used for district location data.

DrivingLicenseNumberType

Used for driving license numbers.

EmailType

Used to describe a column that contains email addresses.

FloatType

Used for floating point data.

ForeignKeyType

Used for foreign key columns.

GenderType

Used to handle gender information in a structured and consistent manner.

IdType

These columns will be uniquely generated to match certain patterns.

IntType

Used for integer data where a wide range of non-unique values can be entered.

ListType

Split a column based on a separator.

MappedType

Used for categorical columns containing sensitive labels that must be obfuscated.

NameType

Used to describe a column that contains names.

PassportType

Used for passport numbers.

PercentageType

Allows support for numerical columns with a leading or trailing % symbol, comma separators and decimal points such as %10,000.

PostcodeType

Used for postcode/zipcode data.

RawType

Used in combination with Custom Handlers to handle bespoke use cases.

RealType

Used for replicating a column exactly.

RegionType

Used for region location data.

StateType

Used for state location data.

StreetAddressType

Used for street address data.

SymbolType

Allows support for numerical columns with a symbol leading or trailing the numerical value, as well as thousand separators and decimal points such as VAL10,000.

TimedeltaType

Used for timedelta for example seconds, minutes, hours, days.

TitleType

Used for columns that represent titles such as Mr, Mrs, Dr, etc.

UsernameType

Used to describe a column that contains usernames.

Removed from release 4.0.0 onwards:

CombinationType

An Entity type which requires more than one column.

LocationType

Used to describe consistent addresses.

PersonType

Used to describe columns relating to people.

Age Type

class hazy_configurator.hazy_data_types.age_type.AgeType

Bases: NonEntityType

Used only when a table contains both a date of birth column and an age column.

This type handles a specific denormalisation case, typically when a view of the data has been created to contain both age and date of birth and they need to be consistent. If a date of birth column does not exist and only age, that column should be configured using the IntType instead.

For this type to work, the date of birth column must be configured using the DatetimeType. A reference date must be provided to allow us calculate the age on a given day.

Examples

from hazy_configurator import AgeType, ColId

AgeType(
    col="Age",
    dob_column=ColId(
        col="date_of_birth",
        table="data",
    ),
    ref_date="2022-12-25",
)
Fields:
field dob_column: ColId [Required]

Column which contains date of birth used to calculate age.

field ref_date: str [Required]

Reference date used for age calculation, i.e. what is their age on this date? Format %Y-%m-%d.

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Category Type

class hazy_configurator.hazy_data_types.category_type.CategoryType

Bases: PrimaryCapableEntityPartType

Used for categorical data where the column contains values from a fixed sized list.

In this case the distribution of the top items in the list will be learned, and the relationships to other columns will be learned. The synthetic data will then contain the items from the list in the same frequency as the items in the source data.

The maximum number of categories investigated by the model can be set using the max_cat parameter in the PrivBayesConfig class. If there are more items in the list that the max_cat property then the synthetic data will contain the correct distribution of the max_cat categories, and the other items will be sampled from the remaining categories

If the data is one-hot encoded (i.e. just 0 and 1) it should also be treated as a CategoryType.

When the combination of values across multiple categorical columns must be strictly adhered to, for example with car_model and car_manufacturer, then the categorical columns should be linked via an entity_id. This will ensure that only combinations of values that are observed in the source data are generated in the synthetic data, and that no new combinations are created.

Examples

from hazy_configurator import CategoryType

CategoryType(col="category_column")
Fields:
field repeat_by: Optional[RepeatBy] = None

The key with which the target column is repeated by. This should be used when the target column is denormal.

field entity_id: Optional[int] = None

Linking categorical columns via an entity ID will ensure that only combinations of values that are observed in the source data are generated in the synthetic data. This is useful when the combination of values across multiple categorical columns must be strictly adhered to, for example with car_model and car_manufacturer columns.

field primary_key: bool = False

Is this column a primary key?

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

City Type

class hazy_configurator.hazy_data_types.location.city_type.CityType

Bases: LocationTypeBase

Used for city location data.

Examples

from hazy_configurator import CityType

CityType(col="city_col", entity_id=1)
Fields:
field type: LocationTypes = LocationTypes.REGULAR

The type of district to be generated. regular will generate city names, while code will generate city codes.

field entity_id: int [Required]

The ID of the location entity that this column belongs to.

Constraints:
  • minimum = 0

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Combination Type

class hazy_configurator.hazy_data_types.combination_type.CombinationType

Bases: BaseType

An Entity type which requires more than one column.

Used to define a set of columns for which only certain combinations of the included columns make sense. An example would be state and city. By using this type you would ensure cities can only ever be matched with their corresponding states, even when noise is introduced to the training process.

In order to do this the model will treat these columns as category columns and the same underlying rules for categories also apply to this column.

Examples

from hazy_configurator import CombinationType

CombinationType(cols=["city", "state"])
Fields:
field cols: List[str] [Required]

The list of colums to target which make up the entity.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Constant Type

class hazy_configurator.hazy_data_types.constant_type.ConstantType

Bases: NonEntityType

Uses a single constant value for every row in the column.

Examples

from hazy_configurator import ConstantType

ConstantType(
    col="col1",
    value=5,
)
Fields:
field value: Union[None, StrictInt, StrictFloat, bool, datetime, timedelta, str] = None

Single value to set for the entire column

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Country Type

class hazy_configurator.hazy_data_types.location.country_type.CountryType

Bases: LocationTypeBase

Used for country data.

Examples

from hazy_configurator import CountryType

CountryType(col="country_col", entity_id=1)
Fields:
field type: CountryTypes = CountryTypes.COUNTRY

The type of country information to be generated.

field entity_id: int [Required]

The ID of the location entity that this column belongs to.

Constraints:
  • minimum = 0

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Currency Type

hazy_configurator.hazy_data_types.currency_type.iso_currencies = ['ADP', 'AED', 'AFA', 'AFN', 'ALK', 'ALL', 'AMD', 'ANG', 'AOA', 'AOK', 'AON', 'AOR', 'ARA', 'ARP', 'ARS', 'ARY', 'ATS', 'AUD', 'AWG', 'AYM', 'AZM', 'AZN', 'BAD', 'BAM', 'BBD', 'BDT', 'BEC', 'BEF', 'BEL', 'BGJ', 'BGK', 'BGL', 'BGN', 'BHD', 'BIF', 'BMD', 'BND', 'BOB', 'BOP', 'BOV', 'BRB', 'BRC', 'BRE', 'BRL', 'BRN', 'BRR', 'BSD', 'BTN', 'BUK', 'BWP', 'BYB', 'BYN', 'BYR', 'BZD', 'CAD', 'CDF', 'CHC', 'CHE', 'CHF', 'CHW', 'CLF', 'CLP', 'CNY', 'COP', 'COU', 'CRC', 'CSD', 'CSJ', 'CSK', 'CUC', 'CUP', 'CVE', 'CYP', 'CZK', 'DDM', 'DEM', 'DJF', 'DKK', 'DOP', 'DZD', 'ECS', 'ECV', 'EEK', 'EGP', 'ERN', 'ESA', 'ESB', 'ESP', 'ETB', 'EUR', 'FIM', 'FJD', 'FKP', 'FRF', 'GBP', 'GEK', 'GEL', 'GHC', 'GHP', 'GHS', 'GIP', 'GMD', 'GNE', 'GNF', 'GNS', 'GQE', 'GRD', 'GTQ', 'GWE', 'GWP', 'GYD', 'HKD', 'HNL', 'HRD', 'HRK', 'HTG', 'HUF', 'IDR', 'IEP', 'ILP', 'ILR', 'ILS', 'INR', 'IQD', 'IRR', 'ISJ', 'ISK', 'ITL', 'JMD', 'JOD', 'JPY', 'KES', 'KGS', 'KHR', 'KMF', 'KPW', 'KRW', 'KWD', 'KYD', 'KZT', 'LAJ', 'LAK', 'LBP', 'LKR', 'LRD', 'LSL', 'LSM', 'LTL', 'LTT', 'LUC', 'LUF', 'LUL', 'LVL', 'LVR', 'LYD', 'MAD', 'MDL', 'MGA', 'MGF', 'MKD', 'MLF', 'MMK', 'MNT', 'MOP', 'MRO', 'MRU', 'MTL', 'MTP', 'MUR', 'MVQ', 'MVR', 'MWK', 'MXN', 'MXP', 'MXV', 'MYR', 'MZE', 'MZM', 'MZN', 'NAD', 'NGN', 'NIC', 'NIO', 'NLG', 'NOK', 'NPR', 'NZD', 'OMR', 'PAB', 'PEH', 'PEI', 'PEN', 'PES', 'PGK', 'PHP', 'PKR', 'PLN', 'PLZ', 'PTE', 'PYG', 'QAR', 'RHD', 'ROK', 'ROL', 'RON', 'RSD', 'RUB', 'RUR', 'RWF', 'SAR', 'SBD', 'SCR', 'SDD', 'SDG', 'SDP', 'SEK', 'SGD', 'SHP', 'SIT', 'SKK', 'SLE', 'SLL', 'SOS', 'SRD', 'SRG', 'SSP', 'STD', 'STN', 'SUR', 'SVC', 'SYP', 'SZL', 'THB', 'TJR', 'TJS', 'TMM', 'TMT', 'TND', 'TOP', 'TPE', 'TRL', 'TRY', 'TTD', 'TWD', 'TZS', 'UAH', 'UAK', 'UGS', 'UGW', 'UGX', 'USD', 'USN', 'USS', 'UYI', 'UYN', 'UYP', 'UYU', 'UYW', 'UZS', 'VEB', 'VED', 'VEF', 'VES', 'VNC', 'VND', 'VUV', 'WST', 'XAF', 'XAG', 'XAU', 'XBA', 'XBB', 'XBC', 'XBD', 'XCD', 'XDR', 'XEU', 'XFO', 'XFU', 'XOF', 'XPD', 'XPF', 'XPT', 'XRE', 'XSU', 'XTS', 'XUA', 'XXX', 'YDD', 'YER', 'YUD', 'YUM', 'YUN', 'ZAL', 'ZAR', 'ZMK', 'ZMW', 'ZRN', 'ZRZ', 'ZWC', 'ZWD', 'ZWL', 'ZWN', 'ZWR']

An enum containing default ISO currency codes. Currency handling is aware of ISO currency codes. Any will be overwritten by data provided.

class hazy_configurator.hazy_data_types.currency_type.CurrencyType

Bases: NonEntityType

Allows support for currency value columns with a currency unit leading or trailing the numerical value, as well as thousands separators and decimal points such as £250,000. Currency values and units may be separate (one column for each), or together.

Examples

from hazy_configurator import CurrencyType, ColId

CurrencyType(
    col="transaction_value",
    currency_col=ColId(
        col="transaction_currency",
        table="data",
    ),
    decimal_separator=".",
    thousand_separator=",",
    date_col=ColId(
        col="transaction_time",
        table="data",
    ),
    currency_map={"euro": "EUR", "yen": "JPY"}
)
Fields:
field hazy_dtype: Literal['currency'] = 'currency'
field currency_col: Optional[ColId] = None

Column name in which the currency units are stored. None if currency units are lacking or in the same column.

field decimal_separator: Optional[str] = '.'

String used to separate integer from fractional part- only used when the currency amount requires parsing

field thousand_separator: Optional[str] = ','

String used to separate numbers at each 10^{3n}- only used when the currency amount requires parsing

field date_col: Optional[ColId] = None

Operation date column

field currency_map: Optional[Dict[str, str]] = {'$': 'USD', 'CHF': 'CHF', 'kr': 'SEK', '£': 'GBP', '¥': 'JPY', '₩': 'KRW', '€': 'EUR'}

Map from custom currency codes to ISO-4217 codes

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Custom Address Type

class hazy_configurator.hazy_data_types.location.custom_address_type.CustomAddressType

Bases: LocationTypeBase

Used for custom address data.

This type allows you to provide a format string that can be used to create custom address values that are a combination of other location values.

Examples

from hazy_configurator import CustomAddressType

CustomAddressType(
    col="custom_address_col",
    entity_id=1,
    format_string="{door}, {street}, {postcode}",
)
Fields:
field format_string: str [Required]

Location format string e.g. “{door}, {street}, {postcode}”. Any of the following supported location details can be provided in the format string: [‘door’, ‘floor’, ‘street_number’, ‘street’, ‘postcode’, ‘district’, ‘district_code’, ‘city’, ‘city_code’, ‘region’, ‘region_code’, ‘state’, ‘state_code’, ‘country’, ‘iso2’, ‘iso3’, ‘outcode’, ‘incode’].

field entity_id: int [Required]

The ID of the location entity that this column belongs to.

Constraints:
  • minimum = 0

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Datetime Type

class hazy_configurator.hazy_data_types.datetime_type.DatetimeType

Bases: PrimaryCapableNonEntityType

Used for datetimes.

The format string is a requirement and is used to parse the dates, see https://strftime.org for format codes.

Examples

from hazy_configurator import DatetimeType

DatetimeType(
    col="date_recorded",
    format="%Y-%m-%d",
)
Fields:
field format: Optional[str] = None

Format string used to parse datetime e.g. %Y-%m-%d for 2021-01-31 see https://strftime.org for format codes

field formula: Optional[FormulaSetting] = None

Takes a formula setting object to set how this field is calculated.

field bound: Optional[DatetimeBoundedSetting] = None

Bounded setting.

field max_unique_invalid_dates: int = 10

If the number of unique invalid dates is less than or equal to this threshold, invalid dates are preserved in the output. This is useful for when a date column is not nullable and has invalid date values instead of nulls. If False, invalid dates will be ouptut as missing values in the synthetic data.

field repeat_by: Optional[RepeatBy] = None

The key with which the target column is repeated by. This should be used when the target column is denormal.

field primary_key: bool = False

Is this column a primary key?

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

District Type

class hazy_configurator.hazy_data_types.location.district_type.DistrictType

Bases: LocationTypeBase

Used for district location data.

Examples

from hazy_configurator import DistrictType

DistrictType(col="district_col", entity_id=1)
Fields:
field type: LocationTypes = LocationTypes.REGULAR

The type of district to be generated. regular will generate district names, while code will generate district codes.

field entity_id: int [Required]

The ID of the location entity that this column belongs to.

Constraints:
  • minimum = 0

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Driving License Number Type

class hazy_configurator.hazy_data_types.driving_license_number_type.DrivingLicenseNumberType

Bases: BaseGeneratorType

Used for driving license numbers.

Example

from hazy_configurator import DrivingLicenseNumberType, DrivingLicenseLocales

DrivingLicenseNumberType(
    col="driving_license_num",
    locales=[DrivingLicenseLocales.US, DrivingLicenseLocales.FR],
)
Fields:
field locales: List[DrivingLicenseLocales] [Required]

The list of locales from which to sample when generating driving license numbers.

field preserve_dist: bool = False

If true, the distribution of the column will be preserved. If false, the generated values will be unique.

field repeat_by: Optional[RepeatBy] = None

The key with which the target column is repeated by. This should be used when the target column is denormal.

field primary_key: bool = False

Is this column a primary key?

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Email Type

class hazy_configurator.hazy_data_types.person.email_type.EmailType

Bases: PersonTypeBase

Used to describe a column that contains email addresses.

An entity ID must be provided so that any data generated across the person entity remains coherent.

Examples

from hazy_configurator import EmailType

EmailType(
    col="Name",
    entity_id=1,
)
Fields:
field entity_id: int [Required]

The ID of the person entity that this column belongs to.

Constraints:
  • minimum = 0

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Float Type

class hazy_configurator.hazy_data_types.float_type.FloatType

Bases: NumberType

Used for floating point data.

A formula can be provided to this type.

Examples

from hazy_configurator import FloatType

FloatType(col="amount")
Fields:
field formula: Optional[FormulaSetting] = None

Takes a formula setting object to set how this field is calculated.

field bound: Optional[NumericBoundedSetting] = None

Bounded setting.

field repeat_by: Optional[RepeatBy] = None

The key with which the target column is repeated by. This should be used when the target column is denormal.

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Foreign Key Type

class hazy_configurator.hazy_data_types.foreign_key_type.ForeignKeyType

Bases: PrimaryCapableNonEntityType

Used for foreign key columns.

A Hazy Synthesiser will match this columns type to the column that is referenced and will maintain referential integrity between the tables.

Examples

from hazy_configurator import ForeignKeyType

ForeignKeyType(
    col="customer_id",
    ref=("customers", "customer_id")
)
Fields:
field ref: Union[ColId, Tuple[TableName, ColumnName]] [Required]

Takes a tuple of (TableName: str, ColumnName: str) that this column points to

field primary_key: bool = False

Is this column a primary key?

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Gender Type

class hazy_configurator.hazy_data_types.person.gender_type.GenderType

Bases: EntityPartType

Used to handle gender information in a structured and consistent manner. By specifying the gender map once with GenderType, you do not need to repeat the mapping elsewhere such as when using PersonType or CPRSettings.

Examples

from hazy_configurator import GenderType

GenderType(
    col="Gender",
    gender_map={"Female": "f", "Male": "m", "Other": "o"},
)
Fields:
field gender_map: GENDER_MAP_TYPING = None

Mapping of gender categories. Each gender category should be a key in the dictionary with the values being one from a selection of ‘m’, ‘f’, ‘o’, which correspond to male, female and other.

field repeat_by: Optional[RepeatBy] = None

The key with which the target column is repeated by. This should be used when the target column is denormal.

field entity_id: Optional[int] = None

Person entity ID that this title column is associated with.

Constraints:
  • minimum = 0

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

ID Type

hazy_configurator.hazy_data_types.id_type.deprecation_warning(settings: IdSettingsUnion) None

Warns the user if they are using a deprecated ID type.

class hazy_configurator.hazy_data_types.id_type.IdType

Bases: IdTypeBase, PrimaryCapableNonEntityType

These columns will be uniquely generated to match certain patterns.

ID type columns support a wide range of different formats of ids ranging from the straight forward numerical IDs to IDs such as passport numbers and social security numbers. Each ID column needs settings to specify the type of ID generated.

This type can be a primary key. The Hazy synthesiser will ensure the generated IDs maintain referential integrity across the database structure, if this column is referenced by a ForeignKeyType.

Examples

from hazy_configurator import IdType, NumericalIdSettings

IdType(
    col="account_id",
    settings=NumericalIdSettings(length=9),
    primary_key=True,
)
Fields:
field settings: IdSettingsUnion [Required]

Can use any options in Standard IDs, Compound ID, Conditioned ID and Mixture ID

field repeat_by: Optional[RepeatBy] = None

The key with which the target column is repeated by. This should be used when the target column is denormal.

field primary_key: bool = False

Is this column a primary key?

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Integer Type

class hazy_configurator.hazy_data_types.int_type.IntType

Bases: NumberType

Used for integer data where a wide range of non-unique values can be entered.

It is possible to enter a formula so that the value is based on other columns.

If this column is a primary key, use the IDType instead.

Examples

from hazy_configurator import IntType

IntType(col="count")
Fields:
field formula: Optional[FormulaSetting] = None

Takes a formula setting object to set how this field is calculated.

field bound: Optional[NumericBoundedSetting] = None

Bounded setting.

field repeat_by: Optional[RepeatBy] = None

The key with which the target column is repeated by. This should be used when the target column is denormal.

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

List Type

class hazy_configurator.hazy_data_types.list_type.ListType

Bases: NonEntityType

Split a column based on a separator. For example a column holding a variable number of comma separated values.

Examples

from hazy_configurator import ListHandlerConfig

ListType(
    col="labels",
    separator=",",
)
Fields:
field separator: str = ','

Separator used to split column into list items.

Constraints:
  • minLength = 1

field max_columns: int = 10

Maximum number of columns to split (additional will be discarded).

Constraints:
  • minimum = 2

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Location Type

class hazy_configurator.hazy_data_types.location_type.LocationPart

Bases: HazyBaseModel

A component used to represent part of a full address such as street number, country or post code.

Fields:
field col: str [Required]

The column this location part is related to.

field type: Union[LocationPartType, List[LocationPartType]] [Required]

List of location part types to make up the address.

field format_string: Optional[str] = None

Location format string e.g. “{door}, {street}, {postcode}”. If the format_string is not provided the location will be made up of a comma separated list.

class hazy_configurator.hazy_data_types.location_type.LocationType

Bases: BaseType

Used to describe consistent addresses.

This is an entity type for which multiple columns can make up the type. One location part must be a postcode. That is because this feature is used to cluster data and provide location based features to our models.

Hazy generates consistent locations for most columns which match the provided zip/postcode.

If the data contained two addresses, for instance billing and shipping address, two of these types would need to be created, each containing the columns belonging to its part of the address.

Using this data type ensures that a similar distribution of addresses is used in the synth data as was found in the source data. The distribution is based on clusters learned from publicly available address data sets that are clustered into the number of clusters specified. Relationships between address location and other columns will be learned by the synthesiser.

Examples

from hazy_configurator import LocationType, LocationPart, LocationPartType, GeoLocales

LocationType(
    parts=[
        LocationPart(
            col="zipcode",
            type=LocationPartType.POSTCODE,
        ),
        LocationPart(
            col="street address",
            type=[LocationPartType.STREET_NUMBER, LocationPartType.STREET],
            format_string="{street_number} | {street}"
        )
    ],
    locales=[GeoLocales.en_US],
    mismatch=LocationTypeMismatch.RANDOM,
    num_clusters=500,
)
Fields:
field locales: List[GeoLocales] = []

Region(s) in which the location data is from. When multiple locales are given, one of the following country fields must be provided [‘country’, ‘iso2’, ‘iso3’]. Additionally, when no locale is given, the default behaviour is to assume that the data is multi-locale and therefore a country field must be provided.

field parts: List[LocationPart] [Required]

List of column identifiers with their corresponding location part.

field mismatch: LocationTypeMismatch = LocationTypeMismatch.RANDOM

When synthesizing data, the algorithm reproduces the geographic distribution of the source data. In order to learn the distribution it has to group records in the source data into the predetermined clusters. Some records will not match a cluster, either to being a new postcode, or because they were mistyped and this setting decides how to handle those mismatched addresses. The options are: “drop” - i.e. ignore this address, “approximate” i.e. find the closest matching address in the public database, “random” i.e. pick a random cluster.

field num_clusters: int = 1000

When synthesizing data, the algorithm reproduces the geographic distribution of the source data. It does this by grouping addresses in the source data into clusters and learning the distribution of addresses between the different clusters. The synthesized records reproduce the distribution of addresses between the clusters. When assigning an address to a synthesized record, the address is assigned randomly within the cluster from the publicly available addresses within that cluster. This setting sets the number of clusters to group the addresses within that locale into. Note: the clustering algorithm is trained on public data and not on the data provided to the the pipeline

Constraints:
  • exclusiveMinimum = 0

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Mapped Type

class hazy_configurator.hazy_data_types.mapped_type.MappedType

Bases: IdTypeBase

Used for categorical columns containing sensitive labels that must be obfuscated.

An ID sampler is configured to replace the source category values with.

Examples

from hazy_configurator import MappedType, NumericalIdSettings

MappedType(
    col="account_id",
    settings=NumericalIdSettings(length=9),
)
Fields:
field settings: NormalIdSettingsUnion [Required]

See Standard IDs for options.

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Name Type

class hazy_configurator.hazy_data_types.person.name_type.NameType

Bases: PersonTypeBase

Used to describe a column that contains names. These names can be of various types, such as first name, last name, full name, and even custom names that are combinations of other name types.

An entity ID must be provided so that any data generated across the person entity remains coherent.

Examples

from hazy_configurator import NameType

NameType(
    col="Name",
    entity_id=1,
    type="full_name",
)
Fields:
field type: NameTypes [Required]

The type of name to be generated.

field format_string: Optional[str] = None

Person format string used to create custom name types, for example the format string “{title} {first_name:.1} {last_name}” would create a custom name such as “Mr J Smith”.Any of the following supported name details can be provided in the format string: [‘first_name’, ‘second_name’, ‘third_name’, ‘fourth_name’, ‘fifth_name’, ‘sixth_name’, ‘last_name’, ‘initials’, ‘email’, ‘full_name’, ‘user_name’, ‘gender’, ‘title’, ‘custom_columns’].

field entity_id: int [Required]

The ID of the person entity that this column belongs to.

Constraints:
  • minimum = 0

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Passport Type

class hazy_configurator.hazy_data_types.passport_type.PassportType

Bases: BaseGeneratorType

Used for passport numbers.

Standard Example

from hazy_configurator import PassportType, PassportCountries

PassportType(
    col="passport_num",
    countries=[PassportCountries.US, PassportCountries.FR],
)

Cross-table Example

In the following example, the country_column exists in a separate table to that of the target column.

from hazy_configurator import PassportType, ColId

PassportType(
    col="passport_num",
    country_column=ColId(col="country", table="table2")
)
Fields:
field countries: List[PassportCountries] = [<PassportCountries.GB: 'GB'>]

The list of countries from which to sample when generating passport numbers.

field country_column: Union[ColId, str] = None

The name of the country column to use when generating passport numbers. For a given record, the generated passport number will match the corresponding country.

field country_map: Dict[str, str] = None

Dictionary mapping each value within the country_column to a 2-letter country code.

field preserve_dist: bool = False

If true, the distribution of the column will be preserved. If false, the generated values will be unique.

field repeat_by: Optional[RepeatBy] = None

The key with which the target column is repeated by. This should be used when the target column is denormal.

field primary_key: bool = False

Is this column a primary key?

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Percentage Type

class hazy_configurator.hazy_data_types.percentage_type.PercentageType

Bases: BaseSymbolType

Allows support for numerical columns with a leading or trailing % symbol, comma separators and decimal points such as %10,000.

Examples

from hazy_configurator import PercentageType

PercentageType(
    col="percent_increaase",
)
Fields:
field decimal: str = '.'

The symbol used to denote a decimal point.

field thousand_sep: str = ','

The symbol used to denote a thousands separator.

field repeat_by: Optional[RepeatBy] = None

The key with which the target column is repeated by. This should be used when the target column is denormal.

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Person Type

class hazy_configurator.hazy_data_types.person_type.PersonPart

Bases: HazyBaseModel

A component used to represent part of the identity of a person such as first name, gender or initials.

Fields:
field col: str [Required]

The column this person part is related to.

field type: Union[PersonPartType, List[PersonPartType]] [Required]

List of person part types to make up the person.

field format_string: Optional[str] = None

Person format string e.g. “{title} {first_name:.1} {last_name}”. If not specified and type is a list of length greater than, the column will be generated as a space-separated list of the specified types.

class hazy_configurator.hazy_data_types.person_type.PersonType

Bases: BaseType

Used to describe columns relating to people. Columns specified in this type are purely generated and are not statistically trained.

This is an entity type for which multiple columns can be provided. All parts of a Person will be generated with the aim of being consistent i.e. matching standard gender/title ratios.

Examples

from hazy_configurator import PersonType, PersonPart, PersonPartType, PersonLocales

PersonType(
    parts=[
        PersonPart(
            col="First Name",
            type=PersonPartType.FIRST_NAME,
        ),
        PersonPart(
            col="Surname",
            type=PersonPartType.LAST_NAME,
        ),
        PersonPart(
            col="Email",
            type=PersonPartType.EMAIL,
        )
    ],
    locales=[PersonLocales.en_US],
)
Fields:
field parts: List[PersonPart] [Required]

List of column identifiers with their corresponding person type.

field locales: List[PersonLocales] = [<PersonLocales.en_GB: 'en_GB'>, <PersonLocales.en_US: 'en_US'>]

A set of country locales for names to be picked from.

field gender_column: ColId = None

Gender used for generating person components

field gender_map: Optional[Dict[str, Literal['m', 'f', 'o']]] = None

Mapping of gender categories. Each gender category should be a key in the dictionary with the values being one from a selection of ‘m’, ‘f’, ‘o’, which correspond to male, female and other.

field title_column: ColId = None

Title used for generating person components

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Postcode Type

class hazy_configurator.hazy_data_types.location.postcode_type.PostcodeType

Bases: LocationTypeBase

Used for postcode/zipcode data.

Examples

from hazy_configurator import PostcodeType

PostcodeType(col="postcode_col", entity_id=1)
Fields:
field type: PostcodeTypes = PostcodeTypes.POSTCODE

The type of postcode to be generated. Incode/Outcode should only be used for UK postcodes.

field entity_id: int [Required]

The ID of the location entity that this column belongs to.

Constraints:
  • minimum = 0

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Raw Type

class hazy_configurator.hazy_data_types.raw_type.RawType

Bases: NonEntityType

Used in combination with Custom Handlers to handle bespoke use cases.

A different Hazy dtype is preferred to using this as configuration will be simpler. Custom handlers cannot be configured through configurator UI at the moment. Use of this type inside the GUI will generate a Placeholder Handler on export, which should be replaced by one of the Custom Handlers.

This dtype allows reading data in as its raw format. To be handled by the rest of the pipeline, Custom Handlers must be specified to process the column into a form the generative model can process.

Examples

from hazy_configurator import RawType

RawType(col="reference_number")
Fields:
field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Real Type

class hazy_configurator.hazy_data_types.real_type.RealType

Bases: PrimaryCapableNonEntityType

Used for replicating a column exactly.

It can only be used with reference tables, these columns will be replicated exactly. By specifying a column as real, it will not be used for conditioning other tables.

Examples

from hazy_configurator import RealType

RealType(col="real_col")
Fields:
field primary_key: bool = False

Is this column a primary key?

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Region Type

class hazy_configurator.hazy_data_types.location.region_type.RegionType

Bases: LocationTypeBase

Used for region location data.

Examples

from hazy_configurator import RegionType

RegionType(col="region_col", entity_id=1)
Fields:
field type: LocationTypes = LocationTypes.REGULAR

The type of region to be generated. regular will generate region names, while code will generate region codes.

field entity_id: int [Required]

The ID of the location entity that this column belongs to.

Constraints:
  • minimum = 0

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

State Type

class hazy_configurator.hazy_data_types.location.state_type.StateType

Bases: LocationTypeBase

Used for state location data.

Examples

from hazy_configurator import StateType

StateType(col="state_col", entity_id=1)
Fields:
field type: LocationTypes = LocationTypes.REGULAR

The type of state to be generated. regular will generate state names, while code will generate state codes.

field entity_id: int [Required]

The ID of the location entity that this column belongs to.

Constraints:
  • minimum = 0

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Street Address Type

class hazy_configurator.hazy_data_types.location.street_address_type.StreetAddressType

Bases: LocationTypeBase

Used for street address data.

Examples

from hazy_configurator import SreetAddressType

SreetAddressType(col="street_address_col", entity_id=1)
Fields:
field type: StreetAddressTypes = StreetAddressTypes.STREET

The type of street address to be generated.

field entity_id: int [Required]

The ID of the location entity that this column belongs to.

Constraints:
  • minimum = 0

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Symbol Type

class hazy_configurator.hazy_data_types.symbol_type.SymbolType

Bases: BaseSymbolType

Allows support for numerical columns with a symbol leading or trailing the numerical value, as well as thousand separators and decimal points such as VAL10,000.

Examples

from hazy_configurator import SymbolType

SymbolType(
    col="value",
    symbol="VAL",
    thousand_sep=","
)
Fields:
field symbol: str [Required]

Leading or trailing symbol or pattern to strip away.

field decimal: str = '.'

The symbol used to denote a decimal point.

field thousand_sep: str = ','

The symbol used to denote a thousands separator.

field repeat_by: Optional[RepeatBy] = None

The key with which the target column is repeated by. This should be used when the target column is denormal.

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Timedelta Type

class hazy_configurator.hazy_data_types.timedelta_type.TimedeltaType

Bases: NonEntityType

Used for timedelta for example seconds, minutes, hours, days.

Examples

from hazy_configurator import TimedeltaType

TimedeltaType(col="time_elapsed", unit=TimeDeltaUnit.SECOND)
Fields:
field unit: TimeDeltaUnit = None

Unit of time represented by this column

field formula: Optional[FormulaSetting] = None

Takes a formula setting object to set how this field is calculated.

field bound: Optional[TimedeltaBoundedSetting] = None

Bounded setting.

field repeat_by: Optional[RepeatBy] = None

The key with which the target column is repeated by. This should be used when the target column is denormal.

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Title Type

class hazy_configurator.hazy_data_types.person.title_type.TitleType

Bases: EntityPartType

Used for columns that represent titles such as Mr, Mrs, Dr, etc.

Examples

from hazy_configurator import TitleType

TitleType(
    col="Title",
)
Fields:
field repeat_by: Optional[RepeatBy] = None

The key with which the target column is repeated by. This should be used when the target column is denormal.

field entity_id: Optional[int] = None

Person entity ID that this title column is associated with.

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity

Username Type

class hazy_configurator.hazy_data_types.person.username_type.UsernameType

Bases: PersonTypeBase

Used to describe a column that contains usernames.

An entity ID must be provided so that any data generated across the person entity remains coherent.

Examples

from hazy_configurator import UsernameType

UsernameType(
    col="Name",
    entity_id=1,
)
Fields:
field entity_id: int [Required]

The ID of the person entity that this column belongs to.

Constraints:
  • minimum = 0

field col: str [Required]

Column name.

field pii_type: PIIType = PIIType.NON_PII

Level of PII sensitivity