Data Types¶
The Hazy data types are provided to a Data Table and used to represent columns in the data.
A column should only be represented once in a Data Table.
Classes:
Used only when a table contains both a date of birth column and an age column. |
|
Used for categorical data where the column contains values from a fixed sized list. |
|
Used for city location data. |
|
Uses a single constant value for every row in the column. |
|
Used for country data. |
|
Allows support for currency value columns with a currency unit leading or trailing the numerical value, as well as thousands separators and decimal points such as £250,000. |
|
Used for custom address data. |
|
Used for datetimes. |
|
Used for district location data. |
|
Used for driving license numbers. |
|
Used to describe a column that contains email addresses. |
|
Used for floating point data. |
|
Used for foreign key columns. |
|
Used to handle gender information in a structured and consistent manner. |
|
These columns will be uniquely generated to match certain patterns. |
|
Used for integer data where a wide range of non-unique values can be entered. |
|
Split a column based on a separator. |
|
Used for categorical columns containing sensitive labels that must be obfuscated. |
|
Used to describe a column that contains names. |
|
Used for passport numbers. |
|
Allows support for numerical columns with a leading or trailing % symbol, comma separators and decimal points such as %10,000. |
|
Used for postcode/zipcode data. |
|
Used in combination with Custom Handlers to handle bespoke use cases. |
|
Used for replicating a column exactly. |
|
Used for region location data. |
|
Used for state location data. |
|
Used for street address data. |
|
Allows support for numerical columns with a symbol leading or trailing the numerical value, as well as thousand separators and decimal points such as VAL10,000. |
|
Used for timedelta for example seconds, minutes, hours, days. |
|
Used for columns that represent titles such as Mr, Mrs, Dr, etc. |
|
Used to describe a column that contains usernames. |
Removed from release 4.0.0 onwards:
An Entity type which requires more than one column. |
|
Used to describe consistent addresses. |
|
Used to describe columns relating to people. |
Age Type¶
- class hazy_configurator.hazy_data_types.age_type.AgeType¶
Bases:
NonEntityType
Used only when a table contains both a date of birth column and an age column.
This type handles a specific denormalisation case, typically when a view of the data has been created to contain both age and date of birth and they need to be consistent. If a date of birth column does not exist and only age, that column should be configured using the IntType instead.
For this type to work, the date of birth column must be configured using the DatetimeType. A reference date must be provided to allow us calculate the age on a given day.
Examples
from hazy_configurator import AgeType, ColId AgeType( col="Age", dob_column=ColId( col="date_of_birth", table="data", ), ref_date="2022-12-25", )
{ "col": "Age", "hazy_dtype": "age", "dob_column": { "col": "date_of_birth", "table": "data" }, "ref_date": "2022-12-25" }
- Fields:
Category Type¶
- class hazy_configurator.hazy_data_types.category_type.CategoryType¶
Bases:
PrimaryCapableEntityPartType
Used for categorical data where the column contains values from a fixed sized list.
In this case the distribution of the top items in the list will be learned, and the relationships to other columns will be learned. The synthetic data will then contain the items from the list in the same frequency as the items in the source data.
The maximum number of categories investigated by the model can be set using the max_cat parameter in the PrivBayesConfig class. If there are more items in the list that the max_cat property then the synthetic data will contain the correct distribution of the max_cat categories, and the other items will be sampled from the remaining categories
If the data is one-hot encoded (i.e. just 0 and 1) it should also be treated as a CategoryType.
When the combination of values across multiple categorical columns must be strictly adhered to, for example with car_model and car_manufacturer, then the categorical columns should be linked via an entity_id. This will ensure that only combinations of values that are observed in the source data are generated in the synthetic data, and that no new combinations are created.
Examples
from hazy_configurator import CategoryType CategoryType(col="category_column")
{ "hazy_dtype": "category", "col": "category_column" }
- Fields:
- field repeat_by: Optional[RepeatBy] = None¶
The key with which the target column is repeated by. This should be used when the target column is denormal.
- field entity_id: Optional[int] = None¶
Linking categorical columns via an entity ID will ensure that only combinations of values that are observed in the source data are generated in the synthetic data. This is useful when the combination of values across multiple categorical columns must be strictly adhered to, for example with car_model and car_manufacturer columns.
City Type¶
- class hazy_configurator.hazy_data_types.location.city_type.CityType¶
Bases:
LocationTypeBase
Used for city location data.
Examples
from hazy_configurator import CityType CityType(col="city_col", entity_id=1)
{ "hazy_dtype": "city", "col": "city_col", "entity_id": 1 }
- Fields:
hazy_dtype (Literal['city'])
- field type: LocationTypes = LocationTypes.REGULAR¶
The type of district to be generated. regular will generate city names, while code will generate city codes.
Combination Type¶
- class hazy_configurator.hazy_data_types.combination_type.CombinationType¶
Bases:
BaseType
An Entity type which requires more than one column.
Used to define a set of columns for which only certain combinations of the included columns make sense. An example would be state and city. By using this type you would ensure cities can only ever be matched with their corresponding states, even when noise is introduced to the training process.
In order to do this the model will treat these columns as category columns and the same underlying rules for categories also apply to this column.
Examples
from hazy_configurator import CombinationType CombinationType(cols=["city", "state"])
{ "cols": [ "city", "state" ], "hazy_dtype": "combination" }
- Fields:
hazy_dtype (Literal['combination'])
Constant Type¶
- class hazy_configurator.hazy_data_types.constant_type.ConstantType¶
Bases:
NonEntityType
Uses a single constant value for every row in the column.
Examples
from hazy_configurator import ConstantType ConstantType( col="col1", value=5, )
{ "hazy_dtype": "constant", "col": "col1", "value": 5 }
- Fields:
Country Type¶
- class hazy_configurator.hazy_data_types.location.country_type.CountryType¶
Bases:
LocationTypeBase
Used for country data.
Examples
from hazy_configurator import CountryType CountryType(col="country_col", entity_id=1)
{ "hazy_dtype": "country", "col": "country_col", "entity_id": 1 }
- Fields:
hazy_dtype (Literal['country'])
- field type: CountryTypes = CountryTypes.COUNTRY¶
The type of country information to be generated.
Currency Type¶
- hazy_configurator.hazy_data_types.currency_type.iso_currencies = ['ADP', 'AED', 'AFA', 'AFN', 'ALK', 'ALL', 'AMD', 'ANG', 'AOA', 'AOK', 'AON', 'AOR', 'ARA', 'ARP', 'ARS', 'ARY', 'ATS', 'AUD', 'AWG', 'AYM', 'AZM', 'AZN', 'BAD', 'BAM', 'BBD', 'BDT', 'BEC', 'BEF', 'BEL', 'BGJ', 'BGK', 'BGL', 'BGN', 'BHD', 'BIF', 'BMD', 'BND', 'BOB', 'BOP', 'BOV', 'BRB', 'BRC', 'BRE', 'BRL', 'BRN', 'BRR', 'BSD', 'BTN', 'BUK', 'BWP', 'BYB', 'BYN', 'BYR', 'BZD', 'CAD', 'CDF', 'CHC', 'CHE', 'CHF', 'CHW', 'CLF', 'CLP', 'CNY', 'COP', 'COU', 'CRC', 'CSD', 'CSJ', 'CSK', 'CUC', 'CUP', 'CVE', 'CYP', 'CZK', 'DDM', 'DEM', 'DJF', 'DKK', 'DOP', 'DZD', 'ECS', 'ECV', 'EEK', 'EGP', 'ERN', 'ESA', 'ESB', 'ESP', 'ETB', 'EUR', 'FIM', 'FJD', 'FKP', 'FRF', 'GBP', 'GEK', 'GEL', 'GHC', 'GHP', 'GHS', 'GIP', 'GMD', 'GNE', 'GNF', 'GNS', 'GQE', 'GRD', 'GTQ', 'GWE', 'GWP', 'GYD', 'HKD', 'HNL', 'HRD', 'HRK', 'HTG', 'HUF', 'IDR', 'IEP', 'ILP', 'ILR', 'ILS', 'INR', 'IQD', 'IRR', 'ISJ', 'ISK', 'ITL', 'JMD', 'JOD', 'JPY', 'KES', 'KGS', 'KHR', 'KMF', 'KPW', 'KRW', 'KWD', 'KYD', 'KZT', 'LAJ', 'LAK', 'LBP', 'LKR', 'LRD', 'LSL', 'LSM', 'LTL', 'LTT', 'LUC', 'LUF', 'LUL', 'LVL', 'LVR', 'LYD', 'MAD', 'MDL', 'MGA', 'MGF', 'MKD', 'MLF', 'MMK', 'MNT', 'MOP', 'MRO', 'MRU', 'MTL', 'MTP', 'MUR', 'MVQ', 'MVR', 'MWK', 'MXN', 'MXP', 'MXV', 'MYR', 'MZE', 'MZM', 'MZN', 'NAD', 'NGN', 'NIC', 'NIO', 'NLG', 'NOK', 'NPR', 'NZD', 'OMR', 'PAB', 'PEH', 'PEI', 'PEN', 'PES', 'PGK', 'PHP', 'PKR', 'PLN', 'PLZ', 'PTE', 'PYG', 'QAR', 'RHD', 'ROK', 'ROL', 'RON', 'RSD', 'RUB', 'RUR', 'RWF', 'SAR', 'SBD', 'SCR', 'SDD', 'SDG', 'SDP', 'SEK', 'SGD', 'SHP', 'SIT', 'SKK', 'SLE', 'SLL', 'SOS', 'SRD', 'SRG', 'SSP', 'STD', 'STN', 'SUR', 'SVC', 'SYP', 'SZL', 'THB', 'TJR', 'TJS', 'TMM', 'TMT', 'TND', 'TOP', 'TPE', 'TRL', 'TRY', 'TTD', 'TWD', 'TZS', 'UAH', 'UAK', 'UGS', 'UGW', 'UGX', 'USD', 'USN', 'USS', 'UYI', 'UYN', 'UYP', 'UYU', 'UYW', 'UZS', 'VEB', 'VED', 'VEF', 'VES', 'VNC', 'VND', 'VUV', 'WST', 'XAF', 'XAG', 'XAU', 'XBA', 'XBB', 'XBC', 'XBD', 'XCD', 'XDR', 'XEU', 'XFO', 'XFU', 'XOF', 'XPD', 'XPF', 'XPT', 'XRE', 'XSU', 'XTS', 'XUA', 'XXX', 'YDD', 'YER', 'YUD', 'YUM', 'YUN', 'ZAL', 'ZAR', 'ZMK', 'ZMW', 'ZRN', 'ZRZ', 'ZWC', 'ZWD', 'ZWL', 'ZWN', 'ZWR']¶
An enum containing default ISO currency codes. Currency handling is aware of ISO currency codes. Any will be overwritten by data provided.
- class hazy_configurator.hazy_data_types.currency_type.CurrencyType¶
Bases:
NonEntityType
Allows support for currency value columns with a currency unit leading or trailing the numerical value, as well as thousands separators and decimal points such as £250,000. Currency values and units may be separate (one column for each), or together.
Examples
from hazy_configurator import CurrencyType, ColId CurrencyType( col="transaction_value", currency_col=ColId( col="transaction_currency", table="data", ), decimal_separator=".", thousand_separator=",", date_col=ColId( col="transaction_time", table="data", ), currency_map={"euro": "EUR", "yen": "JPY"} )
{ "hazy_dtype": "currency", "col": "transaction_value", "currency_col": { "col": "transaction_currency", "table": "data", }, "decimal_separator":".", "thousand_separator":",", "date_col": { "col": "transaction_time", "table": "data", }, "currency_map": {"euro": "EUR", "yen": "JPY"} }
- Fields:
- field hazy_dtype: Literal['currency'] = 'currency'¶
- field currency_col: Optional[ColId] = None¶
Column name in which the currency units are stored. None if currency units are lacking or in the same column.
- field decimal_separator: Optional[str] = '.'¶
String used to separate integer from fractional part- only used when the currency amount requires parsing
- field thousand_separator: Optional[str] = ','¶
String used to separate numbers at each 10^{3n}- only used when the currency amount requires parsing
Custom Address Type¶
- class hazy_configurator.hazy_data_types.location.custom_address_type.CustomAddressType¶
Bases:
LocationTypeBase
Used for custom address data.
This type allows you to provide a format string that can be used to create custom address values that are a combination of other location values.
Examples
from hazy_configurator import CustomAddressType CustomAddressType( col="custom_address_col", entity_id=1, format_string="{door}, {street}, {postcode}", )
{ "hazy_dtype": "custom_address", "col": "custom_address_col", "entity_id": 1, "format_string": "{door}, {street}, {postcode}", }
- Fields:
hazy_dtype (Literal['custom_address'])
- field format_string: str [Required]¶
Location format string e.g. “{door}, {street}, {postcode}”. Any of the following supported location details can be provided in the format string: [‘door’, ‘floor’, ‘street_number’, ‘street’, ‘postcode’, ‘district’, ‘district_code’, ‘city’, ‘city_code’, ‘region’, ‘region_code’, ‘state’, ‘state_code’, ‘country’, ‘iso2’, ‘iso3’, ‘outcode’, ‘incode’].
Datetime Type¶
- class hazy_configurator.hazy_data_types.datetime_type.DatetimeType¶
Bases:
PrimaryCapableNonEntityType
Used for datetimes.
The format string is a requirement and is used to parse the dates, see https://strftime.org for format codes.
Examples
from hazy_configurator import DatetimeType DatetimeType( col="date_recorded", format="%Y-%m-%d", )
{ "hazy_dtype": "date_time", "col": "date_recorded", "format": "%Y-%m-%d" }
- Fields:
- field format: Optional[str] = None¶
Format string used to parse datetime e.g. %Y-%m-%d for 2021-01-31 see https://strftime.org for format codes
- field formula: Optional[FormulaSetting] = None¶
Takes a formula setting object to set how this field is calculated.
- field bound: Optional[DatetimeBoundedSetting] = None¶
Bounded setting.
- field max_unique_invalid_dates: int = 10¶
If the number of unique invalid dates is less than or equal to this threshold, invalid dates are preserved in the output. This is useful for when a date column is not nullable and has invalid date values instead of nulls. If False, invalid dates will be ouptut as missing values in the synthetic data.
District Type¶
- class hazy_configurator.hazy_data_types.location.district_type.DistrictType¶
Bases:
LocationTypeBase
Used for district location data.
Examples
from hazy_configurator import DistrictType DistrictType(col="district_col", entity_id=1)
{ "hazy_dtype": "district", "col": "district_col", "entity_id": 1 }
- Fields:
hazy_dtype (Literal['district'])
- field type: LocationTypes = LocationTypes.REGULAR¶
The type of district to be generated. regular will generate district names, while code will generate district codes.
Driving License Number Type¶
- class hazy_configurator.hazy_data_types.driving_license_number_type.DrivingLicenseNumberType¶
Bases:
BaseGeneratorType
Used for driving license numbers.
Example
from hazy_configurator import DrivingLicenseNumberType, DrivingLicenseLocales DrivingLicenseNumberType( col="driving_license_num", locales=[DrivingLicenseLocales.US, DrivingLicenseLocales.FR], )
{ "hazy_dtype": "driving_license_number", "col": "driving_license_num", "locales": ["US", "FR"] }
- Fields:
hazy_dtype (Literal['driving_license_number'])
locales (List[hazy_configurator.base.enums.DrivingLicenseLocales])
max_iter (int)
repeat_by (Optional[hazy_configurator.settings.repeat_by.RepeatBy])
- field locales: List[DrivingLicenseLocales] [Required]¶
The list of locales from which to sample when generating driving license numbers.
- field preserve_dist: bool = False¶
If true, the distribution of the column will be preserved. If false, the generated values will be unique.
Email Type¶
- class hazy_configurator.hazy_data_types.person.email_type.EmailType¶
Bases:
PersonTypeBase
Used to describe a column that contains email addresses.
An entity ID must be provided so that any data generated across the person entity remains coherent.
Examples
from hazy_configurator import EmailType EmailType( col="Name", entity_id=1, )
{ "hazy_dtype": "email", "col": "Email", "entity_id": 1, }
- Fields:
hazy_dtype (Literal['email'])
Float Type¶
- class hazy_configurator.hazy_data_types.float_type.FloatType¶
Bases:
NumberType
Used for floating point data.
A formula can be provided to this type.
Examples
from hazy_configurator import FloatType FloatType(col="amount")
{ "hazy_dtype": "float", "col": "amount" }
- Fields:
- field formula: Optional[FormulaSetting] = None¶
Takes a formula setting object to set how this field is calculated.
- field bound: Optional[NumericBoundedSetting] = None¶
Bounded setting.
Foreign Key Type¶
- class hazy_configurator.hazy_data_types.foreign_key_type.ForeignKeyType¶
Bases:
PrimaryCapableNonEntityType
Used for foreign key columns.
A Hazy Synthesiser will match this columns type to the column that is referenced and will maintain referential integrity between the tables.
Examples
from hazy_configurator import ForeignKeyType ForeignKeyType( col="customer_id", ref=("customers", "customer_id") )
{ "hazy_dtype": "foreign_key", "col": "customer_id", "ref": ["customers", "customer_id"] }
- Fields:
hazy_dtype (Literal['foreign_key'])
ref (Union[hazy_configurator.base.col_identifier.ColId, Tuple[str, str]])
Gender Type¶
- class hazy_configurator.hazy_data_types.person.gender_type.GenderType¶
Bases:
EntityPartType
Used to handle gender information in a structured and consistent manner. By specifying the gender map once with GenderType, you do not need to repeat the mapping elsewhere such as when using PersonType or CPRSettings.
Examples
from hazy_configurator import GenderType GenderType( col="Gender", gender_map={"Female": "f", "Male": "m", "Other": "o"}, )
{ "hazy_dtype": "gender", "col": "Gender", "gender_map": {"Female": "f", "Male": "m", "Other": "o"}, }
- Fields:
- field gender_map: GENDER_MAP_TYPING = None¶
Mapping of gender categories. Each gender category should be a key in the dictionary with the values being one from a selection of ‘m’, ‘f’, ‘o’, which correspond to male, female and other.
- field repeat_by: Optional[RepeatBy] = None¶
The key with which the target column is repeated by. This should be used when the target column is denormal.
ID Type¶
- hazy_configurator.hazy_data_types.id_type.deprecation_warning(settings: IdSettingsUnion) None ¶
Warns the user if they are using a deprecated ID type.
- class hazy_configurator.hazy_data_types.id_type.IdType¶
Bases:
IdTypeBase
,PrimaryCapableNonEntityType
These columns will be uniquely generated to match certain patterns.
ID type columns support a wide range of different formats of ids ranging from the straight forward numerical IDs to IDs such as passport numbers and social security numbers. Each ID column needs settings to specify the type of ID generated.
This type can be a primary key. The Hazy synthesiser will ensure the generated IDs maintain referential integrity across the database structure, if this column is referenced by a ForeignKeyType.
Examples
from hazy_configurator import IdType, NumericalIdSettings IdType( col="account_id", settings=NumericalIdSettings(length=9), primary_key=True, )
{ "hazy_dtype": "id", "col": "account_id", "settings": { "id_type": "numerical", "length": 9 }, "primary_key": true }
- Fields:
- field settings: IdSettingsUnion [Required]¶
Can use any options in Standard IDs, Compound ID, Conditioned ID and Mixture ID
Integer Type¶
- class hazy_configurator.hazy_data_types.int_type.IntType¶
Bases:
NumberType
Used for integer data where a wide range of non-unique values can be entered.
It is possible to enter a formula so that the value is based on other columns.
If this column is a primary key, use the IDType instead.
Examples
from hazy_configurator import IntType IntType(col="count")
{ "hazy_dtype": "int", "col": "count" }
- Fields:
- field formula: Optional[FormulaSetting] = None¶
Takes a formula setting object to set how this field is calculated.
- field bound: Optional[NumericBoundedSetting] = None¶
Bounded setting.
List Type¶
- class hazy_configurator.hazy_data_types.list_type.ListType¶
Bases:
NonEntityType
Split a column based on a separator. For example a column holding a variable number of comma separated values.
Examples
from hazy_configurator import ListHandlerConfig ListType( col="labels", separator=",", )
{ "hazy_dtype": "list", "col": "labels", "separator": "," }
- Fields:
hazy_dtype (Literal['list'])
- field separator: str = ','¶
Separator used to split column into list items.
- Constraints:
minLength = 1
Location Type¶
- class hazy_configurator.hazy_data_types.location_type.LocationPart¶
Bases:
HazyBaseModel
A component used to represent part of a full address such as street number, country or post code.
- Fields:
- field type: Union[LocationPartType, List[LocationPartType]] [Required]¶
List of location part types to make up the address.
- class hazy_configurator.hazy_data_types.location_type.LocationType¶
Bases:
BaseType
Used to describe consistent addresses.
This is an entity type for which multiple columns can make up the type. One location part must be a postcode. That is because this feature is used to cluster data and provide location based features to our models.
Hazy generates consistent locations for most columns which match the provided zip/postcode.
If the data contained two addresses, for instance billing and shipping address, two of these types would need to be created, each containing the columns belonging to its part of the address.
Using this data type ensures that a similar distribution of addresses is used in the synth data as was found in the source data. The distribution is based on clusters learned from publicly available address data sets that are clustered into the number of clusters specified. Relationships between address location and other columns will be learned by the synthesiser.
Examples
from hazy_configurator import LocationType, LocationPart, LocationPartType, GeoLocales LocationType( parts=[ LocationPart( col="zipcode", type=LocationPartType.POSTCODE, ), LocationPart( col="street address", type=[LocationPartType.STREET_NUMBER, LocationPartType.STREET], format_string="{street_number} | {street}" ) ], locales=[GeoLocales.en_US], mismatch=LocationTypeMismatch.RANDOM, num_clusters=500, )
{ "hazy_dtype": "location", "parts": [ { "col": "zipcode", "type": "postcode" }, { "col": "street address", "type": ["street_number", "street"], "format_string": "{street_number} | {street}" } ], "locales": ["en_US"], "mismatch": "random", "num_clusters": 500, }
- Fields:
- field locales: List[GeoLocales] = []¶
Region(s) in which the location data is from. When multiple locales are given, one of the following country fields must be provided [‘country’, ‘iso2’, ‘iso3’]. Additionally, when no locale is given, the default behaviour is to assume that the data is multi-locale and therefore a country field must be provided.
- field parts: List[LocationPart] [Required]¶
List of column identifiers with their corresponding location part.
- field mismatch: LocationTypeMismatch = LocationTypeMismatch.RANDOM¶
When synthesizing data, the algorithm reproduces the geographic distribution of the source data. In order to learn the distribution it has to group records in the source data into the predetermined clusters. Some records will not match a cluster, either to being a new postcode, or because they were mistyped and this setting decides how to handle those mismatched addresses. The options are: “drop” - i.e. ignore this address, “approximate” i.e. find the closest matching address in the public database, “random” i.e. pick a random cluster.
- field num_clusters: int = 1000¶
When synthesizing data, the algorithm reproduces the geographic distribution of the source data. It does this by grouping addresses in the source data into clusters and learning the distribution of addresses between the different clusters. The synthesized records reproduce the distribution of addresses between the clusters. When assigning an address to a synthesized record, the address is assigned randomly within the cluster from the publicly available addresses within that cluster. This setting sets the number of clusters to group the addresses within that locale into. Note: the clustering algorithm is trained on public data and not on the data provided to the the pipeline
- Constraints:
exclusiveMinimum = 0
Mapped Type¶
- class hazy_configurator.hazy_data_types.mapped_type.MappedType¶
Bases:
IdTypeBase
Used for categorical columns containing sensitive labels that must be obfuscated.
An ID sampler is configured to replace the source category values with.
Examples
from hazy_configurator import MappedType, NumericalIdSettings MappedType( col="account_id", settings=NumericalIdSettings(length=9), )
{ "hazy_dtype": "mapped", "col": "account_id", "settings": { "id_type": "numerical", "length": 9 }, }
- Fields:
hazy_dtype (Literal['mapped'])
- field settings: NormalIdSettingsUnion [Required]¶
See Standard IDs for options.
Name Type¶
- class hazy_configurator.hazy_data_types.person.name_type.NameType¶
Bases:
PersonTypeBase
Used to describe a column that contains names. These names can be of various types, such as first name, last name, full name, and even custom names that are combinations of other name types.
An entity ID must be provided so that any data generated across the person entity remains coherent.
Examples
from hazy_configurator import NameType NameType( col="Name", entity_id=1, type="full_name", )
{ "hazy_dtype": "name", "col": "Name", "entity_id": 1, "type": "full_name", }
- Fields:
- field format_string: Optional[str] = None¶
Person format string used to create custom name types, for example the format string “{title} {first_name:.1} {last_name}” would create a custom name such as “Mr J Smith”.Any of the following supported name details can be provided in the format string: [‘first_name’, ‘second_name’, ‘third_name’, ‘fourth_name’, ‘fifth_name’, ‘sixth_name’, ‘last_name’, ‘initials’, ‘email’, ‘full_name’, ‘user_name’, ‘gender’, ‘title’, ‘custom_columns’].
Passport Type¶
- class hazy_configurator.hazy_data_types.passport_type.PassportType¶
Bases:
BaseGeneratorType
Used for passport numbers.
Standard Example
from hazy_configurator import PassportType, PassportCountries PassportType( col="passport_num", countries=[PassportCountries.US, PassportCountries.FR], )
{ "hazy_dtype": "passport", "col": "passport_num", "countries": ["US", "FR"] }
Cross-table Example
In the following example, the country_column exists in a separate table to that of the target column.
from hazy_configurator import PassportType, ColId PassportType( col="passport_num", country_column=ColId(col="country", table="table2") )
{ "hazy_dtype": "passport", "col": "passport_num", "country_column": {"col": "country", "table": "table2"} }
- Fields:
- field countries: List[PassportCountries] = [<PassportCountries.GB: 'GB'>]¶
The list of countries from which to sample when generating passport numbers.
- field country_column: Union[ColId, str] = None¶
The name of the country column to use when generating passport numbers. For a given record, the generated passport number will match the corresponding country.
- field country_map: Dict[str, str] = None¶
Dictionary mapping each value within the country_column to a 2-letter country code.
- field preserve_dist: bool = False¶
If true, the distribution of the column will be preserved. If false, the generated values will be unique.
Percentage Type¶
- class hazy_configurator.hazy_data_types.percentage_type.PercentageType¶
Bases:
BaseSymbolType
Allows support for numerical columns with a leading or trailing % symbol, comma separators and decimal points such as %10,000.
Examples
from hazy_configurator import PercentageType PercentageType( col="percent_increaase", )
{ "hazy_dtype": "percentage", "col": "percent_increase", }
- Fields:
hazy_dtype (Literal['percentage'])
repeat_by (Optional[hazy_configurator.settings.repeat_by.RepeatBy])
Person Type¶
- class hazy_configurator.hazy_data_types.person_type.PersonPart¶
Bases:
HazyBaseModel
A component used to represent part of the identity of a person such as first name, gender or initials.
- Fields:
- field type: Union[PersonPartType, List[PersonPartType]] [Required]¶
List of person part types to make up the person.
- class hazy_configurator.hazy_data_types.person_type.PersonType¶
Bases:
BaseType
Used to describe columns relating to people. Columns specified in this type are purely generated and are not statistically trained.
This is an entity type for which multiple columns can be provided. All parts of a Person will be generated with the aim of being consistent i.e. matching standard gender/title ratios.
Examples
from hazy_configurator import PersonType, PersonPart, PersonPartType, PersonLocales PersonType( parts=[ PersonPart( col="First Name", type=PersonPartType.FIRST_NAME, ), PersonPart( col="Surname", type=PersonPartType.LAST_NAME, ), PersonPart( col="Email", type=PersonPartType.EMAIL, ) ], locales=[PersonLocales.en_US], )
{ "hazy_dtype": "person", "parts": [ { "col": "First Name", "type": "first_name" }, { "col": "Surname", "type": "last_name" }, { "col": "Email", "type": "email" } ], "locales": ["en_US"] }
- Fields:
hazy_dtype (Literal['person'])
parts (List[hazy_configurator.hazy_data_types.person_type.PersonPart])
- field parts: List[PersonPart] [Required]¶
List of column identifiers with their corresponding person type.
- field locales: List[PersonLocales] = [<PersonLocales.en_GB: 'en_GB'>, <PersonLocales.en_US: 'en_US'>]¶
A set of country locales for names to be picked from.
Postcode Type¶
- class hazy_configurator.hazy_data_types.location.postcode_type.PostcodeType¶
Bases:
LocationTypeBase
Used for postcode/zipcode data.
Examples
from hazy_configurator import PostcodeType PostcodeType(col="postcode_col", entity_id=1)
{ "hazy_dtype": "postcode", "col": "postcode_col", "entity_id": 1 }
- Fields:
hazy_dtype (Literal['postcode'])
- field type: PostcodeTypes = PostcodeTypes.POSTCODE¶
The type of postcode to be generated. Incode/Outcode should only be used for UK postcodes.
Raw Type¶
- class hazy_configurator.hazy_data_types.raw_type.RawType¶
Bases:
NonEntityType
Used in combination with Custom Handlers to handle bespoke use cases.
A different Hazy dtype is preferred to using this as configuration will be simpler. Custom handlers cannot be configured through configurator UI at the moment. Use of this type inside the GUI will generate a Placeholder Handler on export, which should be replaced by one of the Custom Handlers.
This dtype allows reading data in as its raw format. To be handled by the rest of the pipeline, Custom Handlers must be specified to process the column into a form the generative model can process.
Examples
from hazy_configurator import RawType RawType(col="reference_number")
{ "hazy_dtype": "raw", "col": "reference_number", }
- Fields:
hazy_dtype (Literal['raw'])
Real Type¶
- class hazy_configurator.hazy_data_types.real_type.RealType¶
Bases:
PrimaryCapableNonEntityType
Used for replicating a column exactly.
It can only be used with reference tables, these columns will be replicated exactly. By specifying a column as real, it will not be used for conditioning other tables.
Examples
from hazy_configurator import RealType RealType(col="real_col")
{ "hazy_dtype": "real", "col": "real_col" }
- Fields:
hazy_dtype (Literal['real'])
Region Type¶
- class hazy_configurator.hazy_data_types.location.region_type.RegionType¶
Bases:
LocationTypeBase
Used for region location data.
Examples
from hazy_configurator import RegionType RegionType(col="region_col", entity_id=1)
{ "hazy_dtype": "region", "col": "region_col", "entity_id": 1 }
- Fields:
hazy_dtype (Literal['region'])
- field type: LocationTypes = LocationTypes.REGULAR¶
The type of region to be generated. regular will generate region names, while code will generate region codes.
State Type¶
- class hazy_configurator.hazy_data_types.location.state_type.StateType¶
Bases:
LocationTypeBase
Used for state location data.
Examples
from hazy_configurator import StateType StateType(col="state_col", entity_id=1)
{ "hazy_dtype": "state", "col": "state_col", "entity_id": 1 }
- Fields:
hazy_dtype (Literal['state'])
- field type: LocationTypes = LocationTypes.REGULAR¶
The type of state to be generated. regular will generate state names, while code will generate state codes.
Street Address Type¶
- class hazy_configurator.hazy_data_types.location.street_address_type.StreetAddressType¶
Bases:
LocationTypeBase
Used for street address data.
Examples
from hazy_configurator import SreetAddressType SreetAddressType(col="street_address_col", entity_id=1)
{ "hazy_dtype": "street_address", "col": "street_address_col", "entity_id": 1 }
- Fields:
hazy_dtype (Literal['street_address'])
- field type: StreetAddressTypes = StreetAddressTypes.STREET¶
The type of street address to be generated.
Symbol Type¶
- class hazy_configurator.hazy_data_types.symbol_type.SymbolType¶
Bases:
BaseSymbolType
Allows support for numerical columns with a symbol leading or trailing the numerical value, as well as thousand separators and decimal points such as VAL10,000.
Examples
from hazy_configurator import SymbolType SymbolType( col="value", symbol="VAL", thousand_sep="," )
{ "hazy_dtype": "symbol", "col": "value", "symbol": "VAL", "thousand_sep": "," }
- Fields:
Timedelta Type¶
- class hazy_configurator.hazy_data_types.timedelta_type.TimedeltaType¶
Bases:
NonEntityType
Used for timedelta for example seconds, minutes, hours, days.
Examples
from hazy_configurator import TimedeltaType TimedeltaType(col="time_elapsed", unit=TimeDeltaUnit.SECOND)
{ "hazy_dtype": "time_delta", "col": "time_elapsed", "unit": "s" }
- Fields:
- field unit: TimeDeltaUnit = None¶
Unit of time represented by this column
- field formula: Optional[FormulaSetting] = None¶
Takes a formula setting object to set how this field is calculated.
- field bound: Optional[TimedeltaBoundedSetting] = None¶
Bounded setting.
Title Type¶
- class hazy_configurator.hazy_data_types.person.title_type.TitleType¶
Bases:
EntityPartType
Used for columns that represent titles such as Mr, Mrs, Dr, etc.
Examples
from hazy_configurator import TitleType TitleType( col="Title", )
{ "hazy_dtype": "title", "col": "Title", }
- Fields:
Username Type¶
- class hazy_configurator.hazy_data_types.person.username_type.UsernameType¶
Bases:
PersonTypeBase
Used to describe a column that contains usernames.
An entity ID must be provided so that any data generated across the person entity remains coherent.
Examples
from hazy_configurator import UsernameType UsernameType( col="Name", entity_id=1, )
{ "hazy_dtype": "user_name", "col": "Username", "entity_id": 1, }
- Fields:
hazy_dtype (Literal['user_name'])