Custom Handlers¶
All handlers are listed here and can be used in place of CustomHandlerConfig
.
These are provided as a list on Data Schema.
Classes:
Used when dealing with a column containing values that should be handled differently internally, such as in a gender column where values may not be limited to 'f', 'm', or 'o'. |
|
Specify when values within a target column are pre-determined whenever a condition is met. |
|
The ConditionedIDHandler can be used when the type of sampler to be applied for the target column is dependent upon the values in other columns. |
|
The DateReformatHandler can be used to convert datetimes to string representations of dates in a specified format. |
|
Used to specify when a column is entirely determined by another column. |
|
Used for modelling ID columns between tables. |
|
Allows modelling of One Hot Encoded columns where only one column in a group of columns can have a positive value (i.e. |
|
Used when the target column consists of a specific pattern of other columns. |
|
Dummy config used to export Raw Type from configurator UI. |
|
Used to preserve non-supported columns by randomly sampling values from the source dataset. |
|
Used for orchestrating a sequence of handlers. |
|
Used to handle simple denormalisation cases where redundant information has been copied to child tables. |
|
Allows support for numerical columns with a symbol leading or preceding the numerical value such as 10% or £250. |
|
Used for columns that combine several different IDs into the same column. |
|
Used to convert all values to a base currency. |
Classes Superseded by Hazy Data Types:
The following classes duplicate functionality provided by the Data Types.
Used only when a table contains both a date of birth column and an age column. |
|
Specify when a column is bounded by either other columns or a fixed value. |
|
Used to define a set of columns for which only certain combinations of the included columns make sense. |
|
Used to convert object type data into datetime. |
|
Models a column as a function of other columns. |
|
Used to generate ID columns. |
|
Allows the modelling of location information. |
|
Used to generate ID columns. |
|
The PersonHandler can be used to generate all attributes of an individual that relate to their name. |
|
Converts object type columns to time delta. |
Age Handler¶
- class hazy_configurator.processing.age_handler_config.AgeHandlerConfig¶
Bases:
ProcessingConfigItem
Used only when a table contains both a date of birth column and an age column.
Standard Examples
from hazy_configurator import AgeHandlerConfig AgeHandlerConfig( age_column="Age", dob_column="Date Of Birth", ref_date="2022-12-25", table_name="table1" )
{ "type": "age", "age_column": "Age", "dob_column": "Date Of Birth", "ref_date": "2022-12-25", "table_name": "table1" }
Cross-table Example
In the following example, the Date of Birth column exists in a separate table to that of the Age column.
from hazy_configurator import AgeHandlerConfig, ColId AgeHandlerConfig( age_column="Age", dob_column=ColId(col="Date Of Birth", table="table2"), ref_date="2022-12-25", table_name="table1" )
- Fields:
- field dob_column: Union[ColId, str] [Required]¶
Date of birth column. If the column exists in a table different to that of the target column then use a
ColId
object and provide the name of the column and the table that it is in.
- property mapped_dob_column¶
Bounded Handler¶
- class hazy_configurator.processing.bounded_handler_config.BoundedHandlerConfig¶
Bases:
StandardHandlerConfigItem
Specify when a column is bounded by either other columns or a fixed value.
Standard Examples
from hazy_configurator import ( BoundedHandlerConfig, ColumnBound, StaticBound, ) BoundedHandlerConfig( target="rent", upper=ColumnBound(value="income"), lower=StaticBound(value=500), table_name="table1" )
{ "type": "bounded", "target": "rent", "upper": {"type": "column", "value": "income"}, "lower": {"type": "static", "value": 500}, "table_name": "table1" }
Cross-table Example
In the following example the upper bound column, income, exists in a different table to the target column rent.
from hazy_configurator import ( BoundedHandlerConfig, ColId, ColumnBound, StaticBound, ) BoundedHandlerConfig( target="rent", table_name="table1", upper=ColumnBound(value=ColId(col="income", table="table2")), lower=StaticBound(value=500) )
- Fields:
- field upper: Optional[Union[ColumnBound, StaticBound]] = None¶
- field lower: Optional[Union[ColumnBound, StaticBound]] = None¶
Cat Mapper Handler¶
- class hazy_configurator.processing.cat_mapper_handler_config.CatMapperHandlerConfig¶
Bases:
StandardHandlerConfigItem
Used when dealing with a column containing values that should be handled differently internally, such as in a gender column where values may not be limited to ‘f’, ‘m’, or ‘o’. In such cases, you can utilize the category_map feature to specify how specific values should be categorized as ‘f’, ‘m’, or ‘o’.
Examples
from hazy_configurator import CatMapperHandlerConfig CatMapperHandlerConfig( target="Gender", category_map={"Male": "m", "Female":"f", "Other": "o"}, table_name="table1" )
{ "type": "category_mapper", "col": "Gender", "category_map": {"Female": "f", "Male": "m", "Other": "o"}, "table_name": "table1" }
- Fields:
- field category_map: Dict[str, str] [Required]¶
Mapping that defines the relationship between the values found in a column and how they should be used internally. In the category map, the keys represent the original values found in the columns, while the corresponding values indicate how those values should be interpreted internally.
Combination Handler¶
- class hazy_configurator.processing.combination_handler_config.CombinationHandlerConfig¶
Bases:
ProcessingConfigItem
Used to define a set of columns for which only certain combinations of the included columns make sense.
An example would be state and city. By using this type you would ensure cities can only ever be matched with their corresponding states, even when noise is introduced to the training process.
Examples
from hazy_configurator import CombinationHandlerConfig CombinationHandlerConfig(target_columns=["city", "state"], table_name="table1")
{ "type": "combination", "target_columns": ["city", "state"], "table_name": "table1" }
Component Handler¶
- class hazy_configurator.processing.component_handler_config.ComponentHandlerNodeSource¶
Bases:
HazyBaseModel
- class hazy_configurator.processing.component_handler_config.ComponentHandlerConfig¶
Bases:
StandardHandlerConfigItem
The component handler can be used to model redundant information that is sometimes shared across a single connected sub-component. For example, in a component that is formed from family members living in the same household, information such as last names, home telephone numbers and address data may be shared for each of these individuals.
For connected sub-components to be present in a database, at lease one table must have two foreign keys present. If this condition is met then the ComponentHandler will be able to identify keys that belong to the same component, and therefore can model how much redundancy exists across the components.
Examples
from hazy_configurator import ( ComponentHandlerConfig, ComponentHandlerNodeSource, ColId, ) ComponentHandlerConfig( target="CustomerID", table_name="table1", node_sources=[ ComponentHandlerNodeSource( sources=[ColId(col="AccountID", table="table2")], ), ComponentHandlerNodeSource( sources=[ ColId(col="CustomerID", table="table1"), ColId(col="CustomerID", table="table2"), ], ), ], redundant_columns=["LastName", "HomePhoneNumber"], )
{ "type": "component", "target": "CustomerID", "table_name": "table1", "node_sources": [ { "sources": [{"col": "AccountID", "table": "table2"}] }, { "sources": [ {"col": "CustomerID", "table": "table1"}, {"col": "CustomerID", "table": "table2"} ] } ], "redundant_columns": ["LastName", "HomePhoneNumber"] }
- Fields:
- field node_sources: List[ComponentHandlerNodeSource] [Required]¶
List of
ComponentHandlerNodeSource
where each object corresponds to a single type of node.
- field redundant_columns: List[str] = []¶
Redundant columns are colmuns that contain duplicate information across records that belong to the same connected sub-component. For examples, families that have joint accounts would belong to the same component, and a redundant columns in this case could be any last name or address features as these may be shared across records. It should be noted that any columns included in this param must not exist in the node sources.
- field threshold: float = 0.1¶
A float between 0 and 1. If no redundant columns are provided, the handler will search for redundant columns by using a threshold value. For a given column, each component group is iterated over and the value of unique values / group size is calculated (while ignoring missing values). Any columns that have an average value greater than the threshold is then considered a redundant column.
- field categorical: bool = True¶
When set to False, categorical columns will not be included in the search for redundant columns.
Conditioned Handler¶
- class hazy_configurator.processing.conditioned_handler_config.ConditionedHandlerValueMap¶
Bases:
HazyBaseModel
- Fields:
- class hazy_configurator.processing.conditioned_handler_config.ConditionedHandlerCondition¶
Bases:
HazyBaseModel
- Fields:
- field column: Union[ColId, str] [Required]¶
Name of the condition column. If the column exists in a table different to that of the target column then use a
ColId
object and provide the name of the column and the table that it is in.
- field value_mappers: List[ConditionedHandlerValueMap] [Required]¶
List of individual condition mappings for the condition column.
- class hazy_configurator.processing.conditioned_handler_config.ConditionedHandlerConfig¶
Bases:
StandardHandlerConfigItem
Specify when values within a target column are pre-determined whenever a condition is met.
In the examples below, this can be read as the column joint_income:
equals
null
when columnstatus
is equal to “single”equals
null
when columnstatus
is equal to “divorced”equals
null
when columnemp_title
is equal to “Unemployed”
Standard Example
from hazy_configurator import ( ConditionedHandlerConfig, ConditionedHandlerCondition, ConditionedHandlerValueMap ) ConditionedHandlerConfig( target="joint_income", table_name="table1", condition_map=[ ConditionedHandlerCondition( column="status", value_mappers=[ ConditionedHandlerValueMap( value="single", mapped_value=None ), ConditionedHandlerValueMap( value="divorced", mapped_value=None ) ] ), ConditionedHandlerCondition( column="emp_title", value_mappers=[ ConditionedHandlerValueMap( value="Unemployed", mapped_value=None ) ] ) ] )
{ "type": "conditioned_rule", "target": "joint_income", "table_name": "table1", "condition_map": [ {"column": "status", "value_mappers": [{"value": "single", "mapped_value": null}, {"value": "divorced", "mapped_value": null}]}, {"column": "emp_title","value_mappers": [{"value": "Unemployed", "mapped_value": null}]} ] }
Cross-table Example
In the following example, the condition columns, status and emp_title, exist in a different table to the target column, joint_income.
from hazy_configurator import ( ConditionedHandlerConfig, ConditionedHandlerCondition, ConditionedHandlerValueMap, ) ConditionedHandlerConfig( target="joint_income", table_name="table1", condition_map=[ ConditionedHandlerCondition( column=ColId(col="status", table="table2"), value_mappers=[ ConditionedHandlerValueMap( value="single", mapped_value=None ), ConditionedHandlerValueMap( value="divorced", mapped_value=None ) ] ), ConditionedHandlerCondition( column=ColId(col="emp_title", table="table2"), value_mappers=[ ConditionedHandlerValueMap( value="Unemployed", mapped_value=None ) ] ) ] )
- Fields:
- field condition_map: List[ConditionedHandlerCondition] [Required]¶
A set of conditions which define what the target column should be when each condition is met.
- property condition_map_as_dict¶
Conditioned ID Handler¶
- class hazy_configurator.processing.conditioned_id_handler_config.ConditionedIDHandlerConfig¶
Bases:
StandardHandlerConfigItem
The ConditionedIDHandler can be used when the type of sampler to be applied for the target column is dependent upon the values in other columns. A query is used to search for matches and upon a match being found, a corresponding sampler will be used to populate the target column.
Standard Examples
In the following examples, when
col1 == 'A'
a numerical ID of length 5 is generated and whencol1 == 'B'
a numerical ID of length 6 is generated.from hazy_configurator import ( ConditionedIDHandlerConfig, ConditionedIdCondition, NumericalIdSettings, ) ConditionedIDHandlerConfig( target='col2', table_name='table1', mismatch=MismatchBehaviour.REPLACE, conditions=[ ConditionedIdCondition( query="col1 == 'A'", dependencies=['col1'], sampler=NumericalIdSettings(length=5) ), ConditionedIdCondition( query="col1 == 'B'", dependencies=['col1'], sampler=NumericalIdSettings(length=6) ) ] )
{ "type": "conditioned_id", "target": "col2", "table_name": "table1", "mismatch": "replace", "conditions": [ { "query": "col1 == 'A'", "sampler": { "id_type": "numerical", "id_settings": {"length": 5}}, "dependencies": ["col1"] } }, { "query": "col1 == 'B'", "sampler": { "id_type": "numerical", "id_settings": {"length": 6}}, "dependencies": ["col1"] } ] }
Cross-table Example
The following example shows how to configure the handler when the condition column exists in a separate table from the target column - col1 exists in table1 and col2 exists in table2. Again, when col1 == A a numerical ID of length 5 is generated and when col1 == B a numerical ID of length 6 is generated.
from hazy_configurator import ( ConditionedIDHandlerConfig, ConditionedIdCondition, NumericalIdSettings, ColId ) ConditionedIDHandlerConfig( target='col2', table_name="table2", mismatch='replace', conditions=[ ConditionedIdCondition( query="`('table1', 'col1')` == 'A'", dependencies=[ColId(col="col1", table="table1")], sampler=NumericalIdSettings(length=5) ), ConditionedIdCondition( query="`('table1', 'col1')` == 'B'", dependencies=[ColId(col="col1", table="table1")], sampler=NumericalIdSettings(length=6) ) ] )
- Fields:
- field conditions: List[ConditionedIdCondition] [Required]¶
Parameters for matching text and sampling.
- field mismatch: IdMismatchBehaviour = IdMismatchBehaviour.REPLACE¶
Behaviour when there are values that do not match any of the specified conditions.
'replace'
will replaced unmatched values with other conditions.'preserve'
will leave any unmatched values as they are and treat them as categories.
Currency Handler¶
- class hazy_configurator.processing.currency_handler_config.CurrencyHandlerConfig¶
Bases:
ProcessingConfigItem
Used to convert all values to a base currency.
Examples
from hazy_configurator import CurrencyHandlerConfig CurrencyHandlerConfig( table_name="table1", amount_col="transaction_value", currency_col="transaction_currency", decimal_separator=".", thousand_separator=",", date_col="transaction_time", currency_map={"euro": "EUR", "yen": "JPY"} )
{ "type": "currency", "table_name": "table1", "amount_col": "transaction_value", "currency_col": "transaction_currency", "decimal_separator":".", "thousand_separator":",", "date_col": "transaction_time", "currency_map": {"euro": "EUR", "yen": "JPY"} }
- Fields:
- field currency_col: Optional[str] = None¶
Column name in which the currency units are stored. None if currency units are lacking or in the same column.
- field decimal_separator: Optional[str] = '.'¶
String used to separate integer from fractional part- only used when the currency amount requires parsing
- field thousand_separator: Optional[str] = ''¶
String used to separate numbers at each 10^{3n}- only used when the currency amount requires parsing
Date Format Handler¶
- class hazy_configurator.processing.date_format_handler_config.DateFormatMapping¶
Bases:
HazyBaseModel
- Fields:
- class hazy_configurator.processing.date_format_handler_config.DateFormatHandlerConfig¶
Bases:
StandardHandlerConfigItem
Used to convert object type data into datetime.
Examples
from hazy_configurator import DateFormatHandlerConfig DateFormatHandlerConfig(target="date_recorded", format="%Y-%m-%d", table_name="table1")
{ "type": "date_format", "target": "date_recorded", "format": "%Y-%m-%d", "table_name": "table1" }
- Fields:
- field handle_errors: PandasSupportedInvalidType = PandasSupportedInvalidType.COERCE¶
How errors are handled.
- field mappings: List[DateFormatMapping] = []¶
List of columns and date formats, to create from the same datetime on generation
- field max_unique_invalid_dates: int = 10¶
If the number of unique invalid dates is less than or equal to this threshold, invalid dates are preserved in the output. This is useful for when a date column is not nullable and has invalid date values instead of nulls. If False, invalid dates will be ouptut as missing values in the synthetic data.
Date Reformat Handler¶
- class hazy_configurator.processing.date_reformat_handler_config.DateReformatHandlerConfig¶
Bases:
ProcessingConfigItem
The DateReformatHandler can be used to convert datetimes to string representations of dates in a specified format. It is used if a table contains two different representations of the same date.
Standard Examples
from hazy_configurator import DateReformatHandlerConfig DateReformatHandlerConfig( source_col="date1", dest_col="date2", format="%Y-%m-%d", table_name="table1" )
{ "type": "date_reformat", "source_col": "date1", "dest_col": "date2", "format": "%Y-%m-%d", "table_name": "table1" }
Cross Table Example
from hazy_configurator import DateReformatHandlerConfig, ColId DateReformatHandlerConfig( source_col=ColId(col="date1", table="table2"), dest_col="date2", format="%Y-%m-%d", table_name="table1" )
- Fields:
- field source_col: Union[ColId, str] [Required]¶
Source column containing datetime. If the column exists in a table that is different to the destination column then use a
ColId
object and provide the name of the column and the table that it is in.
- property mapped_source_col¶
Determined Handler¶
- class hazy_configurator.processing.determined_handler_config.DeterminedHandlerValueMap¶
Bases:
HazyBaseModel
- Fields:
- class hazy_configurator.processing.determined_handler_config.DeterminedHandlerConfig¶
Bases:
StandardHandlerConfigItem
Used to specify when a column is entirely determined by another column. For example, country is entirely determined by city.
Standard Examples
from hazy_configurator import DeterminedHandlerConfig, DeterminedHandlerValueMap DeterminedHandlerConfig( target="country", table_name="table1", condition_column="city", condition_map=[ DeterminedHandlerValueMap(value="Paris", mapped_value="France"), DeterminedHandlerValueMap(value="London", mapped_value="United Kingdom"), DeterminedHandlerValueMap(value="New York", mapped_value="United States") ] )
{ "type": "determined", "target": "country", "table_name": "table1", "condition_column": "city", "condition_map": { "Paris": "France", "London": "United Kingdom", "New York": "United States" } }
Cross-table example
In the following example the condition column, city, exists in a different table to the target column country.
from hazy_configurator import ( DeterminedHandlerConfig, DeterminedHandlerValueMap, ColId ) DeterminedHandlerConfig( target="country", table_name="table1", condition_column=ColId(col="city", table="table2"), condition_map=[ DeterminedHandlerValueMap(value="Paris", mapped_value="France"), DeterminedHandlerValueMap(value="London", mapped_value="United Kingdom"), DeterminedHandlerValueMap(value="New York", mapped_value="United States") ] )
- Fields:
- field condition_column: Union[ColId, str] [Required]¶
Name of the condition column. If the column exists in a table different to that of the target column then use a
ColId
object and provide the name of the column and the table that it is in.
- field condition_map: List[DeterminedHandlerValueMap] [Required]¶
A set of conditions which define what the target column should be when each condition is met.
Formula Handler¶
- class hazy_configurator.processing.formula_handler_config.FormulaHandlerConfig¶
Bases:
StandardHandlerConfigItem
Models a column as a function of other columns.
Standard Examples
from hazy_configurator import FormulaHandlerConfig FormulaHandlerConfig( target="joint_income", table_name="table1", expression="a + b", column_map={ "a": "income1", "b": "income2" } )
{ "type": "formula_rule", "target": "joint_income", "table_name": "table1", "expression": "a + b", "column_map": { "a": "income1", "b": "income2" } }
Cross-table Example
In the following example the dependency columns of the expression, income1 and income2, exist in a different table to the target column joint_income.
from hazy_configurator import FormulaHandlerConfig, ColId FormulaHandlerConfig( target="joint_income", table_name="table1", expression="a + b", column_map={ "a": ColId(col="income1", table="table2"), "b": ColId(col="income2", table="table2") } )
- Fields:
- field expression: str [Required]¶
Formula to apply to the column.
Some examples of valid formulas are
"a + b + c"
or"if(is_last(x), y, z)"
. Note that the formula cannot contain static values such as integers or strings. If this is required use a static value in the column map. See Expression syntax for available syntax and examples.
- field column_map: Dict[str, Union[str, StaticValue, ColId]] [Required]¶
A mapping between variables defined in the formula and the columns of the provided data.
The dictionary value can either be a string specifying the name of a column or a ColId object (for when the column exists in different table to the target column), or another dictionary specifying a static value . If it is a dictionary specifying a static value then it should be in the format
StaticValue(value="30 days, 2 hours", dtype="timedelta")
where the type is always static, the value is the value and the dtype is a required parameter which is used to convert the provided value into a datatype. The dtype can be either"string"
,"float"
,"integer"
,"boolean"
,"datetime"
or"timedelta"
.When using “timedelta” in the
column_map
the following units can be used in strings:W
D
/days
/day
hours
/hour
/hr
/h
m
/minute
/min
/minutes
/T
S
/seconds
/sec
/second
ms
/milliseconds
/millisecond
/milli
/millis
/L
us
/microseconds
/microsecond
/micro
/micros
/U
ns
/nanoseconds
/nano
/nanos
/nanosecond
/N
It is recommended to use these in a comma separated list. Some examples are:
30 days, 2 hours
1W
- i.e. 1 week1 hour, 30 mins, 30 seconds
When using a “datetime” value in the
column_map
it is recommended to use isoformat followingYYYY-MM-DD[*HH[:MM[:SS[.fff[fff]]]][+HH:MM[:SS[.ffffff]]]]
where * can match any single character.Some examples are:
2011-11-04
2011-11-04T00:05:23
2011-11-04 00:05:23.283
2011-11-04 00:05:23.283+00:00
2011-11-04T00:05:23+04:00
- field condition: str = None¶
Query condition to use when attempting to apply the formula to only a subset of the data verifying the specified condition.
Examples of condition syntax, where
a
,b
,c
,d
,col with spaces
are column names. Note backticks`
should be used around column names which contain spaces:(a < b) & (b < c)
i.e. apply formula when columnb
is betweena
andc
.a not in b
i.e. apply formula when the value in columna
is not a value in columnb
.a in b and c < d
i.e. apply formula when the value in columna
is in columnb
and value inc
is less than value ind
.a in (b + c + d)
i.e. apply formula when value ina
is in a column which is the sum ofb
,c
andd
.b == ["a", "b", "c"]
i.e. apply formula when columnb
is equal to the value “a”, “b” or “c”. Notice quotes are used for possible values.c != [1, 2]
i.e. apply formula when columnc
is not equal to 1 or 2.[1, 2] in c
i.e. apply formula when columnc
is equal to 1 or 2. Can also be writtenc == [1, 2]
.`col with spaces` < b
i.e. apply formula when values in columncol with spaces
is less than values in columnb
.
- field model_error: bool = False¶
When set to True the system examines the source data to see if there is any difference between the result of the calculation in the source data and the actual value seen in the source data. If there is a difference then this difference will be modelled and the difference replicated in the synthetic data.
An example of where this is useful is when looking at bank transactions where each transaction row includes the account balance which is a sum of the transactions so far plus a starting balance. In this case the error will be the starting balance and so if this parameter is set to true then the synthetic data will have a similar distribution of starting balances. The system will either model the difference for each row, or if the error in the source is constant for all the records in a sequence then the error added in the synthetic data will be constant over the sequence.
- field group_by: List[Union[str, ExpressionConfig]] = []¶
When provided the data is grouped according to the provided set of keys or expressions before applying the handler to each group.
This parameter must be provided as a list of either:
Column name as a string
A dictionary to be used to specify an expression. For example, this could be used to group rows by the month of the year based on a datetime column e.g.:
{ "type": "expression", "expression": "a+b", "column_map": { "a": "col_1", "b": "col_2", } }
- field sort_by: List[Union[str, ExpressionConfig]] = []¶
When provided the data is sorted by the provided columns or expressions before applying the formula.
The parameters follows the same syntax as
group_by
.
ID Handler¶
- class hazy_configurator.processing.id_handler_config.IdHandlerConfig¶
Bases:
StandardHandlerConfigItem
Used to generate ID columns.
Examples
from hazy_configurator import IdHandlerConfig, NumericalIdSettings IdHandlerConfig( target="account_id", id_settings=NumericalIdSettings(length=9), table_name="table1" )
{ "type": "id", "target": "account_id", "id_settings": { "id_type": "numerical", "length": 9 }, "table_name": "table1" }
- field id_settings: IdHandlerSettingsUnion [Required]¶
Can use any options in Standard IDs, Real ID, Compound ID and Mixture ID.
ID Mapper Handler¶
- class hazy_configurator.processing.id_mapper_handler_config.IDMapperHandlerConfig¶
Bases:
ProcessingConfigItem
Used for modelling ID columns between tables.
An advanced feature which should only be used with Hazy supervision.
- Fields:
- field id_settings: IdHandlerSettingsUnion [Required]¶
Can use any options in Standard IDs, Real ID, Compound ID and Mixture ID.
Location Handler¶
- class hazy_configurator.processing.location_handler_config.LocationHandlerConfig¶
Bases:
ProcessingConfigItem
Allows the modelling of location information.
Examples
from hazy_configurator import LocationHandlerConfig, GeoLocales LocationHandlerConfig( country="Country", postcode="Zipcode", locales=[GeoLocales.en_US], mismatch="random", num_clusters=500, table_name="table1" )
{ "type": "geo_cluster", "country": "Country", "postcode": "Zipcode", "locales": ["en_US"], "mismatch": "random", "num_clusters": 500 "table_name": "table1" }
- Fields:
- field locales: Optional[List[GeoLocales]] = None¶
Region(s) in which the location data is from - can be any selection from
['en_GB', 'en_US', 'en_CA', 'en_AU', 'en_IE', 'es_ES', 'fr_FR', 'da_DK', 'de_DE', 'sv_SE', 'no_NO', 'fi_FI', 'cs_CZ']
. When multiple locales are given, one of the following country fields must be provided['country', 'iso2', 'iso3']
. Additionally, when no locale is given, the default behaviour is to assume that the data is multi-locale and therefore a country field must be provided.
- field custom_columns: dict = {}¶
Custom configuration used to create a feature that is comprised of multiple other location features. For example, if we wanted to create a full address column that combines door, street and postcode we can add a custom_columns config - {“FullAddress”: {“dependencies”: [“door”, “street”, “postcode”], “pattern”: “{door} {street} {postcode}”}}. The dependencies here are not columns but are items in the location handler.
- field mismatch: Union[Literal['random'], Literal['drop'], Literal['approximate']] = 'random'¶
When synthesizing data, the algorithm reproduces the geographic distribution of the source data. In order to learn the distribution it has to group records in the source data into the predetermined clusters. Some records will not match a cluster, either to being a new postcode, or because they were mistyped and this setting decides how to handle those mismatched addresses. The options are: “drop” - i.e. ignore this address, “approximate” i.e. find the closest matching address in the public database, “random” i.e. pick a random cluster.
- field num_clusters: int = 500¶
When synthesizing data, the algorithm reproduces the geographic distribution of the source data. It does this by grouping addresses in the source data into clusters and learning the distribution of addresses between the different clusters. The synthesized records reproduce the distribution of addresses between the clusters. When assigning an address to a synthesized record, the address is assigned randomly within the cluster from the publicly available addresses within that cluster. This setting sets the number of clusters to group the addresses within that locale into. Note: the clustering algorithm is trained on public data and not on the data provided to the the pipeline
- field territory_modelling: LocationTerritoryModellingType = LocationTerritoryModellingType.ASSET_SAMPLING¶
How lower specificity locations than post/zip code ie country, state, district are modelled. ‘combination’ means sample from combinations of the source country/state/district provided. This will mean source distributions are preserved. And allows locations outside of Hazy’s known locales. ‘asset_sampling’ means sample from Hazy location assets using the provided locales.
Mapped Handler¶
- class hazy_configurator.processing.mapped_handler_config.MappedHandlerConfig¶
Bases:
IdHandlerConfig
Used to generate ID columns.
Examples
from hazy_configurator import MappedHandlerConfig, NumericalIdSettings MappedHandlerConfig( target="account_id", id_settings=NumericalIdSettings(length=9), table_name="table1" )
{ "type": "mapped", "target": "account_id", "id_settings": { "id_type": "numerical", "length": 9 }, "table_name": "table1" }
- Fields:
One-Hot Handler¶
- class hazy_configurator.processing.one_hot_handler_config.OneHotHandlerConfig¶
Bases:
ProcessingConfigItem
Allows modelling of One Hot Encoded columns where only one column in a group of columns can have a positive value (i.e. boolean / binary columns that are part of a mutually exclusive group).
Note that it is possible to model theses columns without this handler, but there is no guarantee that the mutual exclusivity in a group of columns would be adhered to. Increasing nparents reduces the probability of this happening however it is still non-zero and quickly becomes infeasible due to memory limitations.
Using this handler on mutually exclusive columns enforces the mutual exclusivity (as the model can only generate categories that are observed in the source data), therefore eliminating the need to increase the nparents parameter.
Person Handler¶
- class hazy_configurator.processing.person_handler_config.PersonHandlerCustomColumnsConfig¶
Bases:
HazyBaseModel
- Fields:
- field pattern: str [Required]¶
Python string formatting pattern to construct the custom column. For example, if we wanted to create a print name column that combines title, the initials of the first and second name and then the last name we can use the following pattern {title} {first_name:.1}{second_name:.1} {last_name}.
- class hazy_configurator.processing.person_handler_config.PersonHandlerConfig¶
Bases:
ProcessingConfigItem
The PersonHandler can be used to generate all attributes of an individual that relate to their name. This includes names (first, middle, last) as well as email addresses and more.
If gender or title attributes are specified, then they are treated as categories, and the distributions of their corresponding columns in the source data will be learned during training. As such, these columns should be specified as
CategoryType
in the corresponding data schema. If the title and gender attributes are not specified, then they will be generated by the PersonHandler in the background to ensure that all the names of an individual are consistently gendered, and so that titles can be used in any custom column configuration.Standard Examples
from hazy_configurator import ( PersonHandlerConfig, PersonHandlerCustomColumnsConfig, PersonLocales ) PersonHandlerConfig( first_name="FirstName", second_name="SecondName", third_name="ThirdName", last_name="FamilyName", gender="Gender", title="Title", email="email", custom_columns=[ PersonHandlerCustomColumnsConfig( col="CommsName", pattern="{title} {first_name:.1} {last_name}", dependencies=["title", "first_name", "last_name"] ) ], locales=[PersonLocales.en_GB, PersonLocales.en_US], table_name="table1" )
{ "type": "person", "first_name": "FirstName", "second_name": "SecondName", "third_name": "ThirdName", "last_name": "FamilyName", "gender": "Gender", "title": "Title", "email": "email", "custom_columns": [ { "col": "CommsName", "pattern": "{title} {first_name:.1} {last_name}", "dependencies": ["title", "first_name", "last_name"], } ], "locales": ["en_GB", "en_US"], "table_name": "table1" }
Cross-table Example
In the following example the gender and title columns exist in a separate table to that of the other person features.
from hazy_configurator import ( PersonHandlerConfig, PersonHandlerCustomColumnsConfig, PersonLocales, ColId ) PersonHandlerConfig( first_name="FirstName", second_name="SecondName", third_name="ThirdName", last_name="FamilyName", gender=ColId(col="Gender", table="table2"), title=ColId(col="Title", table="table2"), email="email", custom_columns=[ PersonHandlerCustomColumnsConfig( col="CommsName", pattern="{title} {first_name:.1} {last_name}", dependencies=["title", "first_name", "last_name"] ) ], locales=[PersonLocales.en_GB, PersonLocales.en_US], table_name="table1" )
- Fields:
- field title: Union[ColId, str] = None¶
Name of the title column. If the column exists in a table different to that of the target column then use a
ColId
object and provide the name of the column and the table that it is in.
- field gender: Union[ColId, str] = None¶
Name of the gender column. If the column exists in a table different to that of the target column then use a
ColId
object and provide the name of the column and the table that it is in.
- field gender_map: Dict[str, Literal['m', 'f', 'o']] = {}¶
Mapping of gender categories. Each gender category should be a key in the dictionary with the values being one from a selection of ‘m’, ‘f’, ‘o’, which correspond to male, female and other.
- field custom_columns: List[PersonHandlerCustomColumnsConfig] = None¶
Custom configuration used to create a feature that is comprised of multiple other person features.
- field locales: List[PersonLocales] = [<PersonLocales.en_GB: 'en_GB'>, <PersonLocales.en_US: 'en_US'>]¶
A set of locales from
PersonLocales
to be provided.
Pattern Handler¶
- class hazy_configurator.processing.pattern_handler_config.PatternHandlerConfig¶
Bases:
StandardHandlerConfigItem
Used when the target column consists of a specific pattern of other columns.
Examples
from hazy_configurator import PatternHandlerConfig PatternHandlerConfig( target="FullName", pattern="{title}. {first_name:.1} {last_name}", column_map={ "title": {"id_type": "column", "id_settings": {"column": "Title"}}, "first_name": {"id_type": "column", "id_settings": {"column": "FirstName"}}, "last_name": {"id_type": "column", "id_settings": {"column": "LastName"}}, }, table_name="table1" )
{ "target": "FullName", "pattern": "{title}. {first_name:.1} {last_name}", "column_map": { "title": {"id_type": "column", "id_settings": {"column": "Title"}}, "first_name": {"id_type": "column", "id_settings": {"column": "FirstName"}}, "last_name": {"id_type": "column", "id_settings": {"column": "LastName"}}, }, "table_name": "table1" }
- Fields:
Placeholder Handler¶
- class hazy_configurator.processing.placeholder_handler_config.PlaceholderHandlerConfig¶
Bases:
StandardHandlerConfigItem
Dummy config used to export Raw Type from configurator UI.
Custom Handlers are not available through configurator UI and so this class allows the export to take place, and passes validation.
Note
Training will break if this class is present. This class should be replaced with a different handler before training.
Examples
from hazy_configurator import PlaceholderHandlerConfig PlaceholderHandlerConfig(target="raw_column_requiring_processing", table_name="table1")
{ "type": "placeholder", "target": "raw_column_requiring_processing", "table_name": "table1" }
Sample Handler¶
- class hazy_configurator.processing.sample_handler_config.SampleHandlerConfig¶
Bases:
StandardHandlerConfigItem
Used to preserve non-supported columns by randomly sampling values from the source dataset.
In most cases a CategoryType should be used for this column as it is very similar. This handler should be used when you don’t want to condition against any other columns and you just want to randomly sample entirely from the distribution observed in the source data.
Examples
from hazy_configurator import SampleHandlerConfig SampleHandlerConfig( target="foo", preserve_dist=True, table_name="table1" )
{ "type": "sample", "target": "foo", "preserve_dist": true "table_name": "table1" }
Sequence Handler¶
- class hazy_configurator.processing.main.SequenceHandlerConfig¶
Bases:
ProcessingConfigItem
Used for orchestrating a sequence of handlers.
Single Column Normaliser Handler¶
- class hazy_configurator.processing.single_column_normaliser_config.SingleColumnNormaliserConfig¶
Bases:
ProcessingConfigItem
Used to handle simple denormalisation cases where redundant information has been copied to child tables.
Examples
from hazy_configurator import SingleColumnNormaliserConfig SingleColumnNormaliserConfig( source_table_name="Child table", source_key="Parent ID", source_column="Parent Attribute on Child", target_table_name="Parent table", target_key="Parent ID", target_column="Parent Attribute", )
{ "type": "static_attribute", "source_table_name": "Child table", "source_key": "Parent ID", "source_column": "Parent Attribute on Child", "target_table_name": "Parent table", "target_key": "Parent ID", "target_column": "Parent Attribute", }
- Fields:
- field source_table_name: str [Required]¶
Name of the table containing the source attribute to match against the referring attribute column in the target table.
- field source_key: str [Required]¶
Name of column with uniqueness constraint used to look up attribute column values in the source table.
- field target_key: str [Required]¶
Name of foreign key column that points to the source key column in the source table and is used to look up attribute values.
Symbol Handler¶
- class hazy_configurator.processing.symbol_handler_config.SymbolHandlerConfig¶
Bases:
StandardHandlerConfigItem
Allows support for numerical columns with a symbol leading or preceding the numerical value such as 10% or £250.
Examples
from hazy_configurator import SymbolHandlerConfig SymbolHandlerConfig( target="rate", symbol="%", table_name="table1" )
{ "type": "symbol", "target": "rate", "symbol": "%" "table_name": "table1" }
- Fields:
Text Category Handler¶
- class hazy_configurator.processing.text_category_handler_config.TextCategoryHandlerConfig¶
Bases:
StandardHandlerConfigItem
Used for columns that combine several different IDs into the same column. For example, this could be used for an ID column that contains IDs that consist of a letter followed by 5 numbers where the number either starts with an A or a B. This could be modelled with a IDType using a regex, but that would not model the distribution between IDs starting with an A and those starting with a B. The TextCategoryHandler will model this.
It could also be used for a column that contains two different IDs such as either a passport number or a social security number.
The example below shows the configuration for a target column, col1 and is set up so that when a value in col1 matches a numerical REGEX pattern with the prefix A an ID with the prefix A will be generated. Similarly, when a value in the target column matches the REGEX pattern with the prefix B, an ID with the prefix B will be generated.
Examples
from hazy_configurator import ( IdMismatchBehaviour, IdMixturePatternConfig, RegexIdSettings, TextCategoryHandlerConfig, ) TextCategoryHandlerConfig( target="ID", patterns=[ IdMixturePatternConfig( match="^A[1-5]{5}$", label=None, case=False, sampler=RegexIdSettings(id_type="regex", pattern="A[1-5]{5}", unique=True), ), IdMixturePatternConfig( match="^B[1-5]{5}$", label=None, case=False, sampler=RegexIdSettings(id_type="regex", pattern="B[1-5]{5}", unique=True), ), ], mismatch=IdMismatchBehaviour.REPLACE, table_name="table1", )
{ "type": "text_category", "target": "ID", "table_name": "table1", "patterns": [ { "match": "^A[1-5]{5}$", "label": null, "case": false, "sampler": { "id_type": "regex", "pattern": "A[1-5]{5}", "unique": true } }, { "match": "^B[1-5]{5}$", "label": null, "case": false, "sampler": { "id_type": "regex", "pattern": "B[1-5]{5}", "unique": true } } ], "mismatch": "replace" }
- Fields:
- field patterns: List[IdMixturePatternConfig] [Required]¶
Parameters for matching text and sampling.
- field mismatch: IdMismatchBehaviour = IdMismatchBehaviour.REPLACE¶
Behaviour when there are values that do not match any of the specified patterns.
'replace'
will replaced unmatched values with other patterns.'preserve'
will leave any unmatched values as they are and treat them as categories.
Timedelta Handler¶
- class hazy_configurator.processing.time_delta_handler_config.TimeDeltaHandlerConfig¶
Bases:
StandardHandlerConfigItem
Converts object type columns to time delta.
Examples
from hazy_configurator import TimeDeltaHandlerConfig, TimeDeltaUnit TimeDeltaHandlerConfig(target="time_elapsed", unit=TimeDeltaUnit.SECOND, table_name="table1)
{ "type": "time_delta", "target": "time_elapsed", "unit": "s", "table_name": "table1" }
- Fields:
- field unit: TimeDeltaUnit [Required]¶
Taken from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_timedelta.html