Custom Handlers

All handlers are listed here and can be used in place of CustomHandlerConfig. These are provided as a list on Data Schema.

Classes:

CatMapperHandlerConfig

Used when dealing with a column containing values that should be handled differently internally, such as in a gender column where values may not be limited to 'f', 'm', or 'o'.

ConditionedHandlerConfig

Specify when values within a target column are pre-determined whenever a condition is met.

ConditionedIDHandlerConfig

The ConditionedIDHandler can be used when the type of sampler to be applied for the target column is dependent upon the values in other columns.

DateReformatHandlerConfig

The DateReformatHandler can be used to convert datetimes to string representations of dates in a specified format.

DeterminedHandlerConfig

Used to specify when a column is entirely determined by another column.

IDMapperHandlerConfig

Used for modelling ID columns between tables.

OneHotHandlerConfig

Allows modelling of One Hot Encoded columns where only one column in a group of columns can have a positive value (i.e.

PatternHandlerConfig

Used when the target column consists of a specific pattern of other columns.

PlaceholderHandlerConfig

Dummy config used to export Raw Type from configurator UI.

SampleHandlerConfig

Used to preserve non-supported columns by randomly sampling values from the source dataset.

SequenceHandlerConfig

Used for orchestrating a sequence of handlers.

SingleColumnNormaliserConfig

Used to handle simple denormalisation cases where redundant information has been copied to child tables.

SymbolHandlerConfig

Allows support for numerical columns with a symbol leading or preceding the numerical value such as 10% or £250.

TextCategoryHandlerConfig

Used for columns that combine several different IDs into the same column.

CurrencyHandlerConfig

Used to convert all values to a base currency.

Classes Superseded by Hazy Data Types:

The following classes duplicate functionality provided by the Data Types.

AgeHandlerConfig

Used only when a table contains both a date of birth column and an age column.

BoundedHandlerConfig

Specify when a column is bounded by either other columns or a fixed value.

CombinationHandlerConfig

Used to define a set of columns for which only certain combinations of the included columns make sense.

DateFormatHandlerConfig

Used to convert object type data into datetime.

FormulaHandlerConfig

Models a column as a function of other columns.

IdHandlerConfig

Used to generate ID columns.

LocationHandlerConfig

Allows the modelling of location information.

MappedHandlerConfig

Used to generate ID columns.

PersonHandlerConfig

The PersonHandler can be used to generate all attributes of an individual that relate to their name.

TimeDeltaHandlerConfig

Converts object type columns to time delta.

Age Handler

class hazy_configurator.processing.age_handler_config.AgeHandlerConfig

Bases: ProcessingConfigItem

Used only when a table contains both a date of birth column and an age column.

Standard Examples

from hazy_configurator import AgeHandlerConfig

AgeHandlerConfig(
    age_column="Age",
    dob_column="Date Of Birth",
    ref_date="2022-12-25",
    table_name="table1"
)

Cross-table Example

In the following example, the Date of Birth column exists in a separate table to that of the Age column.

from hazy_configurator import AgeHandlerConfig, ColId

AgeHandlerConfig(
    age_column="Age",
    dob_column=ColId(col="Date Of Birth", table="table2"),
    ref_date="2022-12-25",
    table_name="table1"
)
Fields:
field type: Literal['age'] = 'age'
field age_column: str [Required]

Age column.

field dob_column: Union[ColId, str] [Required]

Date of birth column. If the column exists in a table different to that of the target column then use a ColId object and provide the name of the column and the table that it is in.

field ref_date: str [Required]

Reference date column.

field format: str = '%Y-%m-%d'

Format string.

field table_name: str [Required]

Target table name

property mapped_dob_column

Bounded Handler

class hazy_configurator.processing.bounded_handler_config.BoundedHandlerConfig

Bases: StandardHandlerConfigItem

Specify when a column is bounded by either other columns or a fixed value.

Standard Examples

from hazy_configurator import (
    BoundedHandlerConfig,
    ColumnBound,
    StaticBound,
)

BoundedHandlerConfig(
    target="rent",
    upper=ColumnBound(value="income"),
    lower=StaticBound(value=500),
    table_name="table1"
)

Cross-table Example

In the following example the upper bound column, income, exists in a different table to the target column rent.

from hazy_configurator import (
    BoundedHandlerConfig,
    ColId,
    ColumnBound,
    StaticBound,
)

BoundedHandlerConfig(
    target="rent",
    table_name="table1",
    upper=ColumnBound(value=ColId(col="income", table="table2")),
    lower=StaticBound(value=500)
)
Fields:
field type: Literal['bounded'] = 'bounded'
field upper: Optional[Union[ColumnBound, StaticBound]] = None
field lower: Optional[Union[ColumnBound, StaticBound]] = None
field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name

Cat Mapper Handler

class hazy_configurator.processing.cat_mapper_handler_config.CatMapperHandlerConfig

Bases: StandardHandlerConfigItem

Used when dealing with a column containing values that should be handled differently internally, such as in a gender column where values may not be limited to ‘f’, ‘m’, or ‘o’. In such cases, you can utilize the category_map feature to specify how specific values should be categorized as ‘f’, ‘m’, or ‘o’.

Examples

from hazy_configurator import CatMapperHandlerConfig

CatMapperHandlerConfig(
    target="Gender",
    category_map={"Male": "m", "Female":"f", "Other": "o"},
    table_name="table1"
)
Fields:
field type: Literal['category_mapper'] = 'category_mapper'
field category_map: Dict[str, str] [Required]

Mapping that defines the relationship between the values found in a column and how they should be used internally. In the category map, the keys represent the original values found in the columns, while the corresponding values indicate how those values should be interpreted internally.

field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name

Combination Handler

class hazy_configurator.processing.combination_handler_config.CombinationHandlerConfig

Bases: ProcessingConfigItem

Used to define a set of columns for which only certain combinations of the included columns make sense.

An example would be state and city. By using this type you would ensure cities can only ever be matched with their corresponding states, even when noise is introduced to the training process.

Examples

from hazy_configurator import CombinationHandlerConfig

CombinationHandlerConfig(target_columns=["city", "state"], table_name="table1")
Fields:
field type: Literal['combination'] = 'combination'
field target_columns: List[str] [Required]

List of target columns to treat unique combinations between columns as single entities.

field table_name: str [Required]

Target table name

Component Handler

class hazy_configurator.processing.component_handler_config.ComponentHandlerNodeSource

Bases: HazyBaseModel

Fields:
field sources: List[ColId] [Required]

List of ColId objects, each being a representation of the same ID column in different tables. For example, a primary/foreign key pair would be part of the same source list.

class hazy_configurator.processing.component_handler_config.ComponentHandlerConfig

Bases: StandardHandlerConfigItem

The component handler can be used to model redundant information that is sometimes shared across a single connected sub-component. For example, in a component that is formed from family members living in the same household, information such as last names, home telephone numbers and address data may be shared for each of these individuals.

For connected sub-components to be present in a database, at lease one table must have two foreign keys present. If this condition is met then the ComponentHandler will be able to identify keys that belong to the same component, and therefore can model how much redundancy exists across the components.

Examples

from hazy_configurator import (
    ComponentHandlerConfig,
    ComponentHandlerNodeSource,
    ColId,
)

ComponentHandlerConfig(
    target="CustomerID",
    table_name="table1",
    node_sources=[
        ComponentHandlerNodeSource(
            sources=[ColId(col="AccountID", table="table2")],
        ),
        ComponentHandlerNodeSource(
            sources=[
                ColId(col="CustomerID", table="table1"),
                ColId(col="CustomerID", table="table2"),
            ],
        ),
    ],
    redundant_columns=["LastName", "HomePhoneNumber"],
)
Fields:
field type: Literal['component'] = 'component'
field node_sources: List[ComponentHandlerNodeSource] [Required]

List of ComponentHandlerNodeSource where each object corresponds to a single type of node.

field redundant_columns: List[str] = []

Redundant columns are colmuns that contain duplicate information across records that belong to the same connected sub-component. For examples, families that have joint accounts would belong to the same component, and a redundant columns in this case could be any last name or address features as these may be shared across records. It should be noted that any columns included in this param must not exist in the node sources.

field threshold: float = 0.1

A float between 0 and 1. If no redundant columns are provided, the handler will search for redundant columns by using a threshold value. For a given column, each component group is iterated over and the value of unique values / group size is calculated (while ignoring missing values). Any columns that have an average value greater than the threshold is then considered a redundant column.

field categorical: bool = True

When set to False, categorical columns will not be included in the search for redundant columns.

property node_sources_as_dict: dict

Converts provided node sources to dictionary with node labels as the keys.

field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name

Conditioned Handler

class hazy_configurator.processing.conditioned_handler_config.ConditionedHandlerValueMap

Bases: HazyBaseModel

Fields:
field value: Union[None, int, float, str, datetime] = None

Condition value.

field mapped_value: Union[None, int, float, str, datetime] = None

Value that target column will be set to if condition is met.

class hazy_configurator.processing.conditioned_handler_config.ConditionedHandlerCondition

Bases: HazyBaseModel

Fields:
field column: Union[ColId, str] [Required]

Name of the condition column. If the column exists in a table different to that of the target column then use a ColId object and provide the name of the column and the table that it is in.

field value_mappers: List[ConditionedHandlerValueMap] [Required]

List of individual condition mappings for the condition column.

class hazy_configurator.processing.conditioned_handler_config.ConditionedHandlerConfig

Bases: StandardHandlerConfigItem

Specify when values within a target column are pre-determined whenever a condition is met.

In the examples below, this can be read as the column joint_income:

  • equals null when column status is equal to “single”

  • equals null when column status is equal to “divorced”

  • equals null when column emp_title is equal to “Unemployed”

Standard Example

from hazy_configurator import (
    ConditionedHandlerConfig,
    ConditionedHandlerCondition,
    ConditionedHandlerValueMap
)

ConditionedHandlerConfig(
    target="joint_income",
    table_name="table1",
    condition_map=[
        ConditionedHandlerCondition(
            column="status",
            value_mappers=[
                ConditionedHandlerValueMap(
                    value="single",
                    mapped_value=None
                ),
                ConditionedHandlerValueMap(
                    value="divorced",
                    mapped_value=None
                )
            ]
        ),
        ConditionedHandlerCondition(
            column="emp_title",
            value_mappers=[
                ConditionedHandlerValueMap(
                    value="Unemployed",
                    mapped_value=None
                )
            ]
        )
    ]
)

Cross-table Example

In the following example, the condition columns, status and emp_title, exist in a different table to the target column, joint_income.

from hazy_configurator import (
    ConditionedHandlerConfig,
    ConditionedHandlerCondition,
    ConditionedHandlerValueMap,
)

ConditionedHandlerConfig(
    target="joint_income",
    table_name="table1",
    condition_map=[
        ConditionedHandlerCondition(
            column=ColId(col="status", table="table2"),
            value_mappers=[
                ConditionedHandlerValueMap(
                    value="single",
                    mapped_value=None
                ),
                ConditionedHandlerValueMap(
                    value="divorced",
                    mapped_value=None
                )
            ]
        ),
        ConditionedHandlerCondition(
            column=ColId(col="emp_title", table="table2"),
            value_mappers=[
                ConditionedHandlerValueMap(
                    value="Unemployed",
                    mapped_value=None
                )
            ]
        )
    ]
)
Fields:
field type: Literal['conditioned_rule'] = 'conditioned_rule'
field condition_map: List[ConditionedHandlerCondition] [Required]

A set of conditions which define what the target column should be when each condition is met.

property condition_map_as_dict
field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name

Conditioned ID Handler

class hazy_configurator.processing.conditioned_id_handler_config.ConditionedIDHandlerConfig

Bases: StandardHandlerConfigItem

The ConditionedIDHandler can be used when the type of sampler to be applied for the target column is dependent upon the values in other columns. A query is used to search for matches and upon a match being found, a corresponding sampler will be used to populate the target column.

Standard Examples

In the following examples, when col1 == 'A' a numerical ID of length 5 is generated and when col1 == 'B' a numerical ID of length 6 is generated.

from hazy_configurator import (
    ConditionedIDHandlerConfig,
    ConditionedIdCondition,
    NumericalIdSettings,
)

ConditionedIDHandlerConfig(
    target='col2',
    table_name='table1',
    mismatch=MismatchBehaviour.REPLACE,
    conditions=[
        ConditionedIdCondition(
            query="col1 == 'A'",
            dependencies=['col1'],
            sampler=NumericalIdSettings(length=5)
        ),
        ConditionedIdCondition(
            query="col1 == 'B'",
            dependencies=['col1'],
            sampler=NumericalIdSettings(length=6)
        )
    ]
)

Cross-table Example

The following example shows how to configure the handler when the condition column exists in a separate table from the target column - col1 exists in table1 and col2 exists in table2. Again, when col1 == A a numerical ID of length 5 is generated and when col1 == B a numerical ID of length 6 is generated.

from hazy_configurator import (
    ConditionedIDHandlerConfig,
    ConditionedIdCondition,
    NumericalIdSettings,
    ColId
)

ConditionedIDHandlerConfig(
    target='col2',
    table_name="table2",
    mismatch='replace',
    conditions=[
        ConditionedIdCondition(
            query="`('table1', 'col1')` == 'A'",
            dependencies=[ColId(col="col1", table="table1")],
            sampler=NumericalIdSettings(length=5)
        ),
        ConditionedIdCondition(
            query="`('table1', 'col1')` == 'B'",
            dependencies=[ColId(col="col1", table="table1")],
            sampler=NumericalIdSettings(length=6)
        )
    ]
)
Fields:
field type: Literal['conditioned_id'] = 'conditioned_id'
field conditions: List[ConditionedIdCondition] [Required]

Parameters for matching text and sampling.

field mismatch: IdMismatchBehaviour = IdMismatchBehaviour.REPLACE

Behaviour when there are values that do not match any of the specified conditions. 'replace' will replaced unmatched values with other conditions. 'preserve' will leave any unmatched values as they are and treat them as categories.

field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name

field unique: bool = True

When set to True the generated values will be unique. This is not guaranteed as the number of records generated could be greater than the possible amount of unique values.

Currency Handler

class hazy_configurator.processing.currency_handler_config.CurrencyHandlerConfig

Bases: ProcessingConfigItem

Used to convert all values to a base currency.

Examples

from hazy_configurator import CurrencyHandlerConfig

CurrencyHandlerConfig(
    table_name="table1",
    amount_col="transaction_value",
    currency_col="transaction_currency",
    decimal_separator=".",
    thousand_separator=",",
    date_col="transaction_time",
    currency_map={"euro": "EUR", "yen": "JPY"}
)
Fields:
field type: Literal['currency'] = 'currency'
field amount_col: str [Required]

Column name in which the amounts are stored.

field currency_col: Optional[str] = None

Column name in which the currency units are stored. None if currency units are lacking or in the same column.

field decimal_separator: Optional[str] = '.'

String used to separate integer from fractional part- only used when the currency amount requires parsing

field thousand_separator: Optional[str] = ''

String used to separate numbers at each 10^{3n}- only used when the currency amount requires parsing

field date_col: Optional[str] = None

Operation date column

field currency_map: Optional[Dict[str, str]] = None

Map from custom currency codes to ISO-4217 codes

field table_name: str [Required]

Target table name

Date Format Handler

class hazy_configurator.processing.date_format_handler_config.DateFormatMapping

Bases: HazyBaseModel

Fields:
field dest_col: str [Required]

Destination column to be formatted.

field format: str [Required]

Strftime format string used to reformat the destination column.

class hazy_configurator.processing.date_format_handler_config.DateFormatHandlerConfig

Bases: StandardHandlerConfigItem

Used to convert object type data into datetime.

Examples

from hazy_configurator import DateFormatHandlerConfig

DateFormatHandlerConfig(target="date_recorded", format="%Y-%m-%d", table_name="table1")
Fields:
field type: Literal['date_format'] = 'date_format'
field format: Optional[str] = None

Format string.

field handle_errors: PandasSupportedInvalidType = PandasSupportedInvalidType.COERCE

How errors are handled.

field return_as_dt: bool = False

Return as datetime at the end of generation.

field mappings: List[DateFormatMapping] = []

List of columns and date formats, to create from the same datetime on generation

field max_unique_invalid_dates: int = 10

If the number of unique invalid dates is less than or equal to this threshold, invalid dates are preserved in the output. This is useful for when a date column is not nullable and has invalid date values instead of nulls. If False, invalid dates will be ouptut as missing values in the synthetic data.

field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name

Date Reformat Handler

class hazy_configurator.processing.date_reformat_handler_config.DateReformatHandlerConfig

Bases: ProcessingConfigItem

The DateReformatHandler can be used to convert datetimes to string representations of dates in a specified format. It is used if a table contains two different representations of the same date.

Standard Examples

from hazy_configurator import DateReformatHandlerConfig

DateReformatHandlerConfig(
    source_col="date1",
    dest_col="date2",
    format="%Y-%m-%d",
    table_name="table1"
)

Cross Table Example

from hazy_configurator import DateReformatHandlerConfig, ColId

DateReformatHandlerConfig(
    source_col=ColId(col="date1", table="table2"),
    dest_col="date2",
    format="%Y-%m-%d",
    table_name="table1"
)
Fields:
field type: Literal['date_reformat'] = 'date_reformat'
field source_col: Union[ColId, str] [Required]

Source column containing datetime. If the column exists in a table that is different to the destination column then use a ColId object and provide the name of the column and the table that it is in.

field dest_col: str [Required]

Destination column to be formatted.

field format: str [Required]

Format string used to reformat the destination column.

field table_name: str [Required]

Target table name

property mapped_source_col

Determined Handler

class hazy_configurator.processing.determined_handler_config.DeterminedHandlerValueMap

Bases: HazyBaseModel

Fields:
field value: Union[None, int, float, str, datetime] = None

Condition value.

field mapped_value: Union[None, int, float, str, datetime] = None

Value that target column will be set to if condition is met.

class hazy_configurator.processing.determined_handler_config.DeterminedHandlerConfig

Bases: StandardHandlerConfigItem

Used to specify when a column is entirely determined by another column. For example, country is entirely determined by city.

Standard Examples

from hazy_configurator import DeterminedHandlerConfig, DeterminedHandlerValueMap

DeterminedHandlerConfig(
    target="country",
    table_name="table1",
    condition_column="city",
    condition_map=[
        DeterminedHandlerValueMap(value="Paris", mapped_value="France"),
        DeterminedHandlerValueMap(value="London", mapped_value="United Kingdom"),
        DeterminedHandlerValueMap(value="New York", mapped_value="United States")
    ]
)

Cross-table example

In the following example the condition column, city, exists in a different table to the target column country.

from hazy_configurator import (
    DeterminedHandlerConfig,
    DeterminedHandlerValueMap,
    ColId
)

DeterminedHandlerConfig(
    target="country",
    table_name="table1",
    condition_column=ColId(col="city", table="table2"),
    condition_map=[
        DeterminedHandlerValueMap(value="Paris", mapped_value="France"),
        DeterminedHandlerValueMap(value="London", mapped_value="United Kingdom"),
        DeterminedHandlerValueMap(value="New York", mapped_value="United States")
    ]
)
Fields:
field type: Literal['determined_rule'] = 'determined_rule'
field condition_column: Union[ColId, str] [Required]

Name of the condition column. If the column exists in a table different to that of the target column then use a ColId object and provide the name of the column and the table that it is in.

field condition_map: List[DeterminedHandlerValueMap] [Required]

A set of conditions which define what the target column should be when each condition is met.

field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name

Formula Handler

class hazy_configurator.processing.formula_handler_config.FormulaHandlerConfig

Bases: StandardHandlerConfigItem

Models a column as a function of other columns.

Standard Examples

from hazy_configurator import FormulaHandlerConfig

FormulaHandlerConfig(
    target="joint_income",
    table_name="table1",
    expression="a + b",
    column_map={
        "a": "income1",
        "b": "income2"
    }
)

Cross-table Example

In the following example the dependency columns of the expression, income1 and income2, exist in a different table to the target column joint_income.

from hazy_configurator import FormulaHandlerConfig, ColId

FormulaHandlerConfig(
    target="joint_income",
    table_name="table1",
    expression="a + b",
    column_map={
        "a": ColId(col="income1", table="table2"),
        "b": ColId(col="income2", table="table2")
    }
)
Fields:
field type: Literal['formula_rule'] = 'formula_rule'
field expression: str [Required]

Formula to apply to the column.

Some examples of valid formulas are "a + b + c" or "if(is_last(x), y, z)". Note that the formula cannot contain static values such as integers or strings. If this is required use a static value in the column map. See Expression syntax for available syntax and examples.

field column_map: Dict[str, Union[str, StaticValue, ColId]] [Required]

A mapping between variables defined in the formula and the columns of the provided data.

The dictionary value can either be a string specifying the name of a column or a ColId object (for when the column exists in different table to the target column), or another dictionary specifying a static value . If it is a dictionary specifying a static value then it should be in the format StaticValue(value="30 days, 2 hours", dtype="timedelta") where the type is always static, the value is the value and the dtype is a required parameter which is used to convert the provided value into a datatype. The dtype can be either "string", "float", "integer", "boolean", "datetime" or "timedelta".

When using “timedelta” in the column_map the following units can be used in strings:

  • W

  • D / days / day

  • hours / hour / hr / h

  • m / minute / min / minutes / T

  • S / seconds / sec / second

  • ms / milliseconds / millisecond / milli / millis / L

  • us / microseconds / microsecond / micro / micros / U

  • ns / nanoseconds / nano / nanos / nanosecond / N

It is recommended to use these in a comma separated list. Some examples are:

  • 30 days, 2 hours

  • 1W - i.e. 1 week

  • 1 hour, 30 mins, 30 seconds

When using a “datetime” value in the column_map it is recommended to use isoformat following YYYY-MM-DD[*HH[:MM[:SS[.fff[fff]]]][+HH:MM[:SS[.ffffff]]]] where * can match any single character.

Some examples are:

  • 2011-11-04

  • 2011-11-04T00:05:23

  • 2011-11-04 00:05:23.283

  • 2011-11-04 00:05:23.283+00:00

  • 2011-11-04T00:05:23+04:00

field condition: str = None

Query condition to use when attempting to apply the formula to only a subset of the data verifying the specified condition.

Examples of condition syntax, where a, b, c, d, col with spaces are column names. Note backticks ` should be used around column names which contain spaces:

  • (a < b) & (b < c) i.e. apply formula when column b is between a and c.

  • a not in b i.e. apply formula when the value in column a is not a value in column b.

  • a in b and c < d i.e. apply formula when the value in column a is in column b and value in c is less than value in d.

  • a in (b + c + d) i.e. apply formula when value in a is in a column which is the sum of b, c and d.

  • b == ["a", "b", "c"] i.e. apply formula when column b is equal to the value “a”, “b” or “c”. Notice quotes are used for possible values.

  • c != [1, 2] i.e. apply formula when column c is not equal to 1 or 2.

  • [1, 2] in c i.e. apply formula when column c is equal to 1 or 2. Can also be written c == [1, 2].

  • `col with spaces` < b i.e. apply formula when values in column col with spaces is less than values in column b.

field model_error: bool = False

When set to True the system examines the source data to see if there is any difference between the result of the calculation in the source data and the actual value seen in the source data. If there is a difference then this difference will be modelled and the difference replicated in the synthetic data.

An example of where this is useful is when looking at bank transactions where each transaction row includes the account balance which is a sum of the transactions so far plus a starting balance. In this case the error will be the starting balance and so if this parameter is set to true then the synthetic data will have a similar distribution of starting balances. The system will either model the difference for each row, or if the error in the source is constant for all the records in a sequence then the error added in the synthetic data will be constant over the sequence.

field group_by: List[Union[str, ExpressionConfig]] = []

When provided the data is grouped according to the provided set of keys or expressions before applying the handler to each group.

This parameter must be provided as a list of either:

  • Column name as a string

  • A dictionary to be used to specify an expression. For example, this could be used to group rows by the month of the year based on a datetime column e.g.:

{
    "type": "expression",
    "expression": "a+b",
    "column_map": {
        "a": "col_1",
        "b": "col_2",
    }
}
field sort_by: List[Union[str, ExpressionConfig]] = []

When provided the data is sorted by the provided columns or expressions before applying the formula.

The parameters follows the same syntax as group_by.

property column_map_as_dict: Dict[str, Union[str, dict]]

ID Handler

class hazy_configurator.processing.id_handler_config.IdHandlerConfig

Bases: StandardHandlerConfigItem

Used to generate ID columns.

Examples

from hazy_configurator import IdHandlerConfig, NumericalIdSettings

IdHandlerConfig(
    target="account_id",
    id_settings=NumericalIdSettings(length=9),
    table_name="table1"
)
Fields:
field type: Literal['id'] = 'id'
field id_settings: IdHandlerSettingsUnion [Required]

Can use any options in Standard IDs, Real ID, Compound ID and Mixture ID.

field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name

ID Mapper Handler

class hazy_configurator.processing.id_mapper_handler_config.IDMapperHandlerConfig

Bases: ProcessingConfigItem

Used for modelling ID columns between tables.

An advanced feature which should only be used with Hazy supervision.

Fields:
field id_settings: IdHandlerSettingsUnion [Required]

Can use any options in Standard IDs, Real ID, Compound ID and Mixture ID.

field source: ColId [Required]

The table and column name of the source id column.

field ref_map: List[ColId] [Required]
field type: Literal['id_mapper'] = 'id_mapper'

Location Handler

class hazy_configurator.processing.location_handler_config.LocationHandlerConfig

Bases: ProcessingConfigItem

Allows the modelling of location information.

Examples

from hazy_configurator import LocationHandlerConfig, GeoLocales

LocationHandlerConfig(
    country="Country",
    postcode="Zipcode",
    locales=[GeoLocales.en_US],
    mismatch="random",
    num_clusters=500,
    table_name="table1"
)
Fields:
field type: Literal['geo_cluster'] = 'geo_cluster'
field table_name: str [Required]

Target table name

field locales: Optional[List[GeoLocales]] = None

Region(s) in which the location data is from - can be any selection from ['en_GB', 'en_US', 'en_CA', 'en_AU', 'en_IE', 'es_ES', 'fr_FR', 'da_DK', 'de_DE', 'sv_SE', 'no_NO', 'fi_FI', 'cs_CZ']. When multiple locales are given, one of the following country fields must be provided ['country', 'iso2', 'iso3']. Additionally, when no locale is given, the default behaviour is to assume that the data is multi-locale and therefore a country field must be provided.

field postcode: Optional[str] = None

Post code column.

field door: Optional[str] = None

Door number column.

field floor: Optional[str] = None

Floor column.

field street_number: Optional[str] = None

Street number column.

field street: Optional[str] = None

Street column.

field district: Optional[str] = None

District column.

field district_code: Optional[str] = None

District code column.

field city: Optional[str] = None

City column.

field city_code: Optional[str] = None

City code column.

field region: Optional[str] = None

Region column.

field region_code: Optional[str] = None

Region code column.

field state: Optional[str] = None

State column.

field state_code: Optional[str] = None

State code column.

field country: Optional[str] = None

Country column.

field iso2: Optional[str] = None

ISO2 country code column.

field iso3: Optional[str] = None

ISO3 country code column.

field outcode: Optional[str] = None

Outcode column.

field incode: Optional[str] = None

Incode column.

field custom_columns: dict = {}

Custom configuration used to create a feature that is comprised of multiple other location features. For example, if we wanted to create a full address column that combines door, street and postcode we can add a custom_columns config - {“FullAddress”: {“dependencies”: [“door”, “street”, “postcode”], “pattern”: “{door} {street} {postcode}”}}. The dependencies here are not columns but are items in the location handler.

field mismatch: Union[Literal['random'], Literal['drop'], Literal['approximate']] = 'random'

When synthesizing data, the algorithm reproduces the geographic distribution of the source data. In order to learn the distribution it has to group records in the source data into the predetermined clusters. Some records will not match a cluster, either to being a new postcode, or because they were mistyped and this setting decides how to handle those mismatched addresses. The options are: “drop” - i.e. ignore this address, “approximate” i.e. find the closest matching address in the public database, “random” i.e. pick a random cluster.

field num_clusters: int = 500

When synthesizing data, the algorithm reproduces the geographic distribution of the source data. It does this by grouping addresses in the source data into clusters and learning the distribution of addresses between the different clusters. The synthesized records reproduce the distribution of addresses between the clusters. When assigning an address to a synthesized record, the address is assigned randomly within the cluster from the publicly available addresses within that cluster. This setting sets the number of clusters to group the addresses within that locale into. Note: the clustering algorithm is trained on public data and not on the data provided to the the pipeline

field territory_modelling: LocationTerritoryModellingType = LocationTerritoryModellingType.ASSET_SAMPLING

How lower specificity locations than post/zip code ie country, state, district are modelled. ‘combination’ means sample from combinations of the source country/state/district provided. This will mean source distributions are preserved. And allows locations outside of Hazy’s known locales. ‘asset_sampling’ means sample from Hazy location assets using the provided locales.

Mapped Handler

class hazy_configurator.processing.mapped_handler_config.MappedHandlerConfig

Bases: IdHandlerConfig

Used to generate ID columns.

Examples

from hazy_configurator import MappedHandlerConfig, NumericalIdSettings

MappedHandlerConfig(
    target="account_id",
    id_settings=NumericalIdSettings(length=9),
    table_name="table1"
)
Fields:
field type: Literal['mapped'] = 'mapped'
field id_settings: Union[NormalIdSettingsUnion, DrivingLicenseNumberSettings] [Required]
field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name

One-Hot Handler

class hazy_configurator.processing.one_hot_handler_config.OneHotHandlerConfig

Bases: ProcessingConfigItem

Allows modelling of One Hot Encoded columns where only one column in a group of columns can have a positive value (i.e. boolean / binary columns that are part of a mutually exclusive group).

Note that it is possible to model theses columns without this handler, but there is no guarantee that the mutual exclusivity in a group of columns would be adhered to. Increasing nparents reduces the probability of this happening however it is still non-zero and quickly becomes infeasible due to memory limitations.

Using this handler on mutually exclusive columns enforces the mutual exclusivity (as the model can only generate categories that are observed in the source data), therefore eliminating the need to increase the nparents parameter.

Fields:
field type: Literal['one_hot'] = 'one_hot'
field column_map: dict [Required]
field as_category: bool [Required]
field table_name: str [Required]

Target table name

Person Handler

class hazy_configurator.processing.person_handler_config.PersonHandlerCustomColumnsConfig

Bases: HazyBaseModel

Fields:
field col: str [Required]

Name of the target custom column

field pattern: str [Required]

Python string formatting pattern to construct the custom column. For example, if we wanted to create a print name column that combines title, the initials of the first and second name and then the last name we can use the following pattern {title} {first_name:.1}{second_name:.1} {last_name}.

class hazy_configurator.processing.person_handler_config.PersonHandlerConfig

Bases: ProcessingConfigItem

The PersonHandler can be used to generate all attributes of an individual that relate to their name. This includes names (first, middle, last) as well as email addresses and more.

If gender or title attributes are specified, then they are treated as categories, and the distributions of their corresponding columns in the source data will be learned during training. As such, these columns should be specified as CategoryType in the corresponding data schema. If the title and gender attributes are not specified, then they will be generated by the PersonHandler in the background to ensure that all the names of an individual are consistently gendered, and so that titles can be used in any custom column configuration.

Standard Examples

from hazy_configurator import (
    PersonHandlerConfig,
    PersonHandlerCustomColumnsConfig,
    PersonLocales
)

PersonHandlerConfig(
    first_name="FirstName",
    second_name="SecondName",
    third_name="ThirdName",
    last_name="FamilyName",
    gender="Gender",
    title="Title",
    email="email",
    custom_columns=[
        PersonHandlerCustomColumnsConfig(
            col="CommsName",
            pattern="{title} {first_name:.1} {last_name}",
            dependencies=["title", "first_name", "last_name"]
        )
    ],
    locales=[PersonLocales.en_GB, PersonLocales.en_US],
    table_name="table1"
)

Cross-table Example

In the following example the gender and title columns exist in a separate table to that of the other person features.

from hazy_configurator import (
    PersonHandlerConfig,
    PersonHandlerCustomColumnsConfig,
    PersonLocales,
    ColId
)

PersonHandlerConfig(
    first_name="FirstName",
    second_name="SecondName",
    third_name="ThirdName",
    last_name="FamilyName",
    gender=ColId(col="Gender", table="table2"),
    title=ColId(col="Title", table="table2"),
    email="email",
    custom_columns=[
        PersonHandlerCustomColumnsConfig(
            col="CommsName",
            pattern="{title} {first_name:.1} {last_name}",
            dependencies=["title", "first_name", "last_name"]
        )
    ],
    locales=[PersonLocales.en_GB, PersonLocales.en_US],
    table_name="table1"
)
Fields:
field type: Literal['person'] = 'person'
field table_name: str [Required]

Target table name

field first_name: str = None
field second_name: str = None
field third_name: str = None
field fourth_name: str = None
field fifth_name: str = None
field sixth_name: str = None
field last_name: str = None
field title: Union[ColId, str] = None

Name of the title column. If the column exists in a table different to that of the target column then use a ColId object and provide the name of the column and the table that it is in.

field initials: str = None

Initials column

field gender: Union[ColId, str] = None

Name of the gender column. If the column exists in a table different to that of the target column then use a ColId object and provide the name of the column and the table that it is in.

field gender_map: Dict[str, Literal['m', 'f', 'o']] = {}

Mapping of gender categories. Each gender category should be a key in the dictionary with the values being one from a selection of ‘m’, ‘f’, ‘o’, which correspond to male, female and other.

field full_name: str = None
field user_name: str = None

This is the typical format for a users login.

field email: str = None
field custom_columns: List[PersonHandlerCustomColumnsConfig] = None

Custom configuration used to create a feature that is comprised of multiple other person features.

field locales: List[PersonLocales] = [<PersonLocales.en_GB: 'en_GB'>, <PersonLocales.en_US: 'en_US'>]

A set of locales from PersonLocales to be provided.

Pattern Handler

class hazy_configurator.processing.pattern_handler_config.PatternHandlerConfig

Bases: StandardHandlerConfigItem

Used when the target column consists of a specific pattern of other columns.

Examples

from hazy_configurator import PatternHandlerConfig


PatternHandlerConfig(
    target="FullName",
    pattern="{title}. {first_name:.1} {last_name}",
    column_map={
        "title": {"id_type": "column", "id_settings": {"column": "Title"}},
        "first_name": {"id_type": "column", "id_settings": {"column": "FirstName"}},
        "last_name": {"id_type": "column", "id_settings": {"column": "LastName"}},
    },
    table_name="table1"
)
Fields:
field type: Literal['pattern'] = 'pattern'
field pattern: str [Required]

Python string formatting pattern to construct the target column.

field column_map: dict [Required]

A mapping between variables defined in the pattern and their respective column and settings.

field fill_char: str = ''

String to replace null values by.

field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name

Placeholder Handler

class hazy_configurator.processing.placeholder_handler_config.PlaceholderHandlerConfig

Bases: StandardHandlerConfigItem

Dummy config used to export Raw Type from configurator UI.

Custom Handlers are not available through configurator UI and so this class allows the export to take place, and passes validation.

Note

Training will break if this class is present. This class should be replaced with a different handler before training.

Examples

from hazy_configurator import PlaceholderHandlerConfig

PlaceholderHandlerConfig(target="raw_column_requiring_processing", table_name="table1")
Fields:
field type: Literal['placeholder'] = 'placeholder'
field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name

Sample Handler

class hazy_configurator.processing.sample_handler_config.SampleHandlerConfig

Bases: StandardHandlerConfigItem

Used to preserve non-supported columns by randomly sampling values from the source dataset.

In most cases a CategoryType should be used for this column as it is very similar. This handler should be used when you don’t want to condition against any other columns and you just want to randomly sample entirely from the distribution observed in the source data.

Examples

from hazy_configurator import SampleHandlerConfig

SampleHandlerConfig(
    target="foo",
    preserve_dist=True,
    table_name="table1"
)
Fields:
field type: Literal['sample'] = 'sample'
field preserve_dist: bool = True

Preserve distribution?

field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name

Sequence Handler

class hazy_configurator.processing.main.SequenceHandlerConfig

Bases: ProcessingConfigItem

Used for orchestrating a sequence of handlers.

Fields:
field type: Literal['sequence'] = 'sequence'
field handlers: List[CustomHandlerConfig] [Required]

List of handlers.

field table_name: str [Required]

Target table name

Single Column Normaliser Handler

class hazy_configurator.processing.single_column_normaliser_config.SingleColumnNormaliserConfig

Bases: ProcessingConfigItem

Used to handle simple denormalisation cases where redundant information has been copied to child tables.

Examples

from hazy_configurator import SingleColumnNormaliserConfig

SingleColumnNormaliserConfig(
    source_table_name="Child table",
    source_key="Parent ID",
    source_column="Parent Attribute on Child",
    target_table_name="Parent table",
    target_key="Parent ID",
    target_column="Parent Attribute",
)
Fields:
field type: Literal['static_attribute'] = 'static_attribute'
field source_table_name: str [Required]

Name of the table containing the source attribute to match against the referring attribute column in the target table.

field source_key: str [Required]

Name of column with uniqueness constraint used to look up attribute column values in the source table.

field source_column: str [Required]

Name of the static attribute column in the source table.

field target_table_name: str [Required]

Name of the table containing the referring attribute.

field target_key: str [Required]

Name of foreign key column that points to the source key column in the source table and is used to look up attribute values.

field target_column: str [Required]

Name of the referring attribute column in the target table. This column should consist of the same set of values as the source attribute column.

field merge_column: bool = False

Merges the target column to the source table. Should be used when the column is not present in the parent table already.

Symbol Handler

class hazy_configurator.processing.symbol_handler_config.SymbolHandlerConfig

Bases: StandardHandlerConfigItem

Allows support for numerical columns with a symbol leading or preceding the numerical value such as 10% or £250.

Examples

from hazy_configurator import SymbolHandlerConfig

SymbolHandlerConfig(
    target="rate",
    symbol="%",
    table_name="table1"
)
Fields:
field type: Literal['symbol'] = 'symbol'
field symbol: str [Required]

Symbol or pattern to strip away.

field decimal: str = '.'

Decimal symbol.

field thousand_sep: str = ''

Thousand symbol.

field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name

Text Category Handler

class hazy_configurator.processing.text_category_handler_config.TextCategoryHandlerConfig

Bases: StandardHandlerConfigItem

Used for columns that combine several different IDs into the same column. For example, this could be used for an ID column that contains IDs that consist of a letter followed by 5 numbers where the number either starts with an A or a B. This could be modelled with a IDType using a regex, but that would not model the distribution between IDs starting with an A and those starting with a B. The TextCategoryHandler will model this.

It could also be used for a column that contains two different IDs such as either a passport number or a social security number.

The example below shows the configuration for a target column, col1 and is set up so that when a value in col1 matches a numerical REGEX pattern with the prefix A an ID with the prefix A will be generated. Similarly, when a value in the target column matches the REGEX pattern with the prefix B, an ID with the prefix B will be generated.

Examples

from hazy_configurator import (
    IdMismatchBehaviour,
    IdMixturePatternConfig,
    RegexIdSettings,
    TextCategoryHandlerConfig,
)

TextCategoryHandlerConfig(
    target="ID",
    patterns=[
        IdMixturePatternConfig(
            match="^A[1-5]{5}$",
            label=None,
            case=False,
            sampler=RegexIdSettings(id_type="regex", pattern="A[1-5]{5}", unique=True),
        ),
        IdMixturePatternConfig(
            match="^B[1-5]{5}$",
            label=None,
            case=False,
            sampler=RegexIdSettings(id_type="regex", pattern="B[1-5]{5}", unique=True),
        ),
    ],
    mismatch=IdMismatchBehaviour.REPLACE,
    table_name="table1",
)
Fields:
field type: Literal['text_category'] = 'text_category'
field patterns: List[IdMixturePatternConfig] [Required]

Parameters for matching text and sampling.

field mismatch: IdMismatchBehaviour = IdMismatchBehaviour.REPLACE

Behaviour when there are values that do not match any of the specified patterns. 'replace' will replaced unmatched values with other patterns. 'preserve' will leave any unmatched values as they are and treat them as categories.

field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name

field unique: bool = False

When set to True the generated values will be unique. This is not guaranteed as the number of records generated could be greater than the possible amount of unique values.

Timedelta Handler

class hazy_configurator.processing.time_delta_handler_config.TimeDeltaHandlerConfig

Bases: StandardHandlerConfigItem

Converts object type columns to time delta.

Examples

from hazy_configurator import TimeDeltaHandlerConfig, TimeDeltaUnit

TimeDeltaHandlerConfig(target="time_elapsed", unit=TimeDeltaUnit.SECOND, table_name="table1)
Fields:
field type: Literal['time_delta'] = 'time_delta'
field unit: TimeDeltaUnit [Required]

Taken from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_timedelta.html

field target: str [Required]

Target column name.

field table_name: str [Required]

Target table name