Model Parameters¶

The full model training parameters, including generative model hyperparameters, processing parameters and multi table model training parameters.

Model Parameters ¶

class hazy_configurator.general_params.model_parameters.ModelParameters¶

Bases: HazyBaseModel

Container for all training and model parameters.

Examples

from hazy_configurator import (
    ModelParameters,
    PrivBayesConfig,
    MultiTableTrainingParams,
    AdjacencyType,
)

model_params = ModelParameters(
    generator=PrivBayesConfig(
        epsilon=0.001,
        n_parents=2,
        n_bins=50,
        max_cat=100,
    ),
    multi_table=MultiTableTrainingParams(
        adjacency_type=AdjacencyType.DEGREE_PRESERVING,
    ),
)

{
    "generator": {
        "generator_type": "priv_bayes",
        "epsilon": 0.001,
        "n_parents": 2,
        "n_bins": 50,
        "max_cat": 100
    },
    "multi_table": {
        "adjacency_type": "degree_preserving"
    }
}

Fields:

generator (GeneratorConfigUnion)
multi_table (hazy_configurator.general_params.model_parameters.MultiTableTrainingParams)
sequential (Union[hazy_configurator.general_params.model_parameters.SequentialTrainingParams, hazy_configurator.general_params.model_parameters.SequentialDGANTrainingParams, hazy_configurator.sequential_params.rnn.SequentialRNNTrainingParams])

field generator: GeneratorConfigUnion = PrivBayesConfig(skew_threshold=None, split_time=False, n_bins=100, single_threshold=None, max_cat=100, bin_strategy_default=<BinningStrategyType.UNIFORM: 'uniform'>, drop_empty_bins=False, processing_epsilon=None, preserve_datetime_range=True, generator_type='priv_bayes', epsilon=1000.0, n_parents=3, network_sample_rows=None, sample_parents=25, default_strategy=<PrivBayesUnknownCombinationStrategyType.UNIFORM: 'uniform'>)¶: Generative model hyperparameters and processing requirements. Choose from : PrivBayesConfig, CLFSamplerConfig, MSTConfig, AIMConfig.

field multi_table: MultiTableTrainingParams = MultiTableTrainingParams(adjacency_type=<AdjacencyType.DEGREE_PRESERVING: 'degree_preserving'>, parent_compression=None, core_version=1, use_cache=False)¶: Multi table training parameters.

field sequential: Union[SequentialTrainingParams, SequentialDGANTrainingParams, SequentialRNNTrainingParams] = SequentialTrainingParams(window_size=20, n_predict=1)¶: Sequential table training parameters. Choose from SequentialTrainingParams, SequentialDGANTrainingParams, or SequentialRNNTrainingParams.

Multi Table Training Parameters ¶

class hazy_configurator.general_params.model_parameters.MultiTableTrainingParams¶

Bases: HazyBaseModel

Multi table specific training parameters.

Examples

from hazy_configurator import MultiTableTrainingParams, AdjacencyType

multi_params = MultiTableTrainingParams(
    adjacency_type=AdjacencyType.DEGREE_PRESERVING,
)

{
    "adjacency_type": "degree_preserving"
}

Fields:

adjacency_type (hazy_configurator.base.enums.AdjacencyType)
core_version (int)
parent_compression (Optional[hazy_configurator.general_params.model_parameters.LowMIParentColumnCompressionConfig])
use_cache (bool)

field adjacency_type: AdjacencyType = AdjacencyType.DEGREE_PRESERVING¶: Adjacency method to use. Degree preserving is best for multi purpose. Component preserving is the best for when connected sub-components exist in the data, this may be the case when a single table has two or more foreign keys present. Identity gives the best results by reusing the same bipartite graph, however this comes with the caveat of only being able to generate data of fixed magnitude.

field parent_compression: Optional[LowMIParentColumnCompressionConfig] = None¶: The maximum number of parent column candidates to condition on for any child table.

field core_version: int = 1¶

Version of the multi-table core algorithm to use for training and generation. Version 2 is an improvement over the original, allowing the use of advanced features such as caching and incremental training & generation.

Constraints:

minimum = 1
maximum = 2

field use_cache: bool = False¶: Requires HAZY_CACHE_FOLDER and HAZY_CACHE_PASSWORD environment variables to be set during training and generation. Caching is only available for core version 2, and is ineffective unless a random state is specified for training and generation. Specifies whether or not to cache and reuse table sub-models during training and generation. Table sub-models are cached so that subsequent training runs involving these tables will reuse existing cached sub-models if the tables and their configurations have not been modified. Similarly for generation, synthetic data will be cached for tables in subsequent generation runs provided that the source data, training configuration and generation configuration for the tables have not been modified.

class hazy_configurator.general_params.model_parameters.LowMIParentColumnCompressionConfig¶

Bases: HazyBaseModel

The configuration for specifying how to compress the number of parent columns that the model uses to condition on for a column in a child table.

Parent columns are ranked by their mutual information scores with the child columns, and the lowest columns are dropped from the conditioning set.

Applying parent compression is a memory optimization technique that can be used to reduce the memory footprint of the training program, potentially at the cost of distribution similarity between training and generated data.

Note that there can be cases where the parent-child relationship between tables in a database can be ambiguous, and either ordering is valid. In this case the ordering is defined by the order of Foreign Key Type objects in the junction tables of the user-specified Data Schema. The first foreign key column will belong to the parent table, and the second foreign key column will belong to the child table.

Examples

from hazy_configurator import (
    LowMIParentColumnCompressionConfig,
    ReductionType,
)

pc_config = LowMIParentColumnCompressionConfig(
    max_parent_columns=4,
    reduction=ReductionType.MAX,
)

Fields:

max_parent_columns (int)
reduction (Optional[hazy_configurator.base.enums.ReductionType])
sample_records (Optional[int])

field max_parent_columns: int [Required]¶: The maximum number of columns from parent tables to condition on.

field reduction: Optional[ReductionType] = ReductionType.MEAN¶: How scores for parent columns are reduced over child columns.

field sample_records: Optional[int] = 100000¶: If the child table has more than sample_records rows, then a sample of this size is taken for computing the parent compression.

Generator Configuration ¶

Priv Bayes ¶

class hazy_configurator.general_params.model_parameters.PrivBayesConfig¶

Bases: DiscretisorParams, BasicProcessingParams, SaasPrivBayesConfig

PrivBayes is an algorithm introduced in 2017 that relies on discretising the data and approximating its low-dimensional marginals by fitting an optimal & differentially private Bayesian network in order to allow for efficient data generation.

Examples

from hazy_configurator import PrivBayesConfig, PrivBayesUnknownCombinationStrategyType

generative_model_config = PrivBayesConfig(
    epsilon=0.001,
    n_parents=2,
    default_strategy=PrivBayesUnknownCombinationStrategyType.MARGINAL,
    n_bins=50,
    max_cat=100,
)

{
    "generator_type": "priv_bayes",
    "epsilon": 0.001,
    "n_parents": 2,
    "default_strategy": "marginal",
    "n_bins": 50,
    "max_cat": 100
}

Fields:

bin_strategy_default (hazy_configurator.base.enums.BinningStrategyType)
default_strategy (hazy_configurator.base.enums.PrivBayesUnknownCombinationStrategyType)
drop_empty_bins (bool)
epsilon (Optional[float])
generator_type (Literal['priv_bayes'])
max_cat (Optional[int])
n_bins (int)
n_parents (int)
network_sample_rows (Optional[int])
preserve_datetime_range (bool)
processing_epsilon (Optional[float])
sample_parents (Optional[int])
single_threshold (Optional[float])
skew_threshold (Optional[float])
split_time (bool)

field generator_type: Literal['priv_bayes'] = 'priv_bayes'¶

field epsilon: Optional[float] = 1000¶

Constraints:

exclusiveMinimum = 0.0

field n_parents: int = 3¶

Number of parents per node to use in network building.

Constraints:

exclusiveMinimum = 0

field network_sample_rows: Optional[int] = None¶

Sets the number or rows to sample for the network building step. A small value will create a less accurate model but will reduce training time.

Constraints:

exclusiveMinimum = 0

field sample_parents: Optional[int] = 25¶

Sets the number of parents to consider for each node during the network building step. A small value will create a less accurate model but will reduce training time.

Constraints:

exclusiveMinimum = 0

field default_strategy: PrivBayesUnknownCombinationStrategyType = PrivBayesUnknownCombinationStrategyType.UNIFORM¶: Strategy used on generation when the combination of variables sampled from the learnt network has not been seen during training. The marginal option gives higher statistical similarity but is not differentially private.

field n_bins: int = 100¶

The number of bins to use when discretizing numerical data.

Constraints:

minimum = 1

field single_threshold: Optional[float] = None¶

Frequency above which singular numerical values are considered as separate categories, if not provided it will be set to the reciprocal of the number of bins.

Constraints:

minimum = 0

field max_cat: Optional[int] = 100¶

Maximum amount of categorical information to preserve in each categorical column.

Constraints:

minimum = 1

field bin_strategy_default: BinningStrategyType = BinningStrategyType.UNIFORM¶: Method to use to discretise the data.

field drop_empty_bins: bool = False¶: When set to True ensures that no empty bin is used.

field processing_epsilon: Optional[float] = None¶: Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. This is an extra budget spent per table in processing.

field preserve_datetime_range: bool = True¶: Leaves out datetime columns from differentially private approximate min/max bound calculation. Turn this setting off to ensure all numeric columns use the processing privacy budget. Using a processing privacy budget ie epsilon causes the range of the synthetic data to be different from the original. If the real data is from a specific extracted date time range, the min/max not being close to the original can cause the marginal similarity of columns to perform poorly.

field skew_threshold: Optional[float] = None¶

Skewness value at which to consider a column is skewed, if None no transforms performed for skew columns.

Constraints:

minimum = 0

field split_time: bool = False¶: Controls whether to model date/datetime columns as separate components (year, week, …, etc.) or as a singular value.

MST ¶

class hazy_configurator.general_params.model_parameters.MSTConfig¶

Bases: DiscretisorParams, BasicProcessingParams

MST is an algorithm introduced in 2021 that relies on discretising the data and fitting a differentially private graphical model to low-dimensional marginals in order to allow for efficient data generation.

Examples

from hazy_configurator import MSTConfig

generative_model_config = MSTConfig(
    epsilon=0.001,
    delta=1e-5,
    n_iters=5_000,
    compress_domain=True,
)

{
    "generator_type": "mst",
    "epsilon": 0.001,
    "delta": 0.00001,
    "n_iters": 5000,
    "compress_domain": true
}

Fields:

bin_strategy_default (hazy_configurator.base.enums.BinningStrategyType)
compress_domain (Optional[bool])
delta (Optional[float])
drop_empty_bins (bool)
epsilon (Optional[float])
generator_type (Literal['mst'])
max_cat (Optional[int])
n_bins (int)
n_iters (Optional[int])
preserve_datetime_range (bool)
processing_epsilon (Optional[float])
single_threshold (Optional[float])
skew_threshold (Optional[float])
split_time (bool)

field generator_type: Literal['mst'] = 'mst'¶

field epsilon: Optional[float] = 1000¶

Constraints:

exclusiveMinimum = 0.0

field delta: Optional[float] = 1e-05¶

Probability of information accidentally being leaked.Second Differential Privacy Parameter.

Constraints:

exclusiveMinimum = 0.0
exclusiveMaximum = 1.0

field n_iters: Optional[int] = 1000¶

Number of training iterations.

Constraints:

exclusiveMinimum = 0

field compress_domain: Optional[bool] = True¶: When set to True the domain of the 1-way marginals is compressed according to the available privacy budget.

field n_bins: int = 100¶

The number of bins to use when discretizing numerical data.

Constraints:

minimum = 1

field single_threshold: Optional[float] = None¶

Frequency above which singular numerical values are considered as separate categories, if not provided it will be set to the reciprocal of the number of bins.

Constraints:

minimum = 0

field max_cat: Optional[int] = 100¶

Maximum amount of categorical information to preserve in each categorical column.

Constraints:

minimum = 1

field bin_strategy_default: BinningStrategyType = BinningStrategyType.UNIFORM¶: Method to use to discretise the data.

field drop_empty_bins: bool = False¶: When set to True ensures that no empty bin is used.

field processing_epsilon: Optional[float] = None¶: Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. This is an extra budget spent per table in processing.

field preserve_datetime_range: bool = True¶: Leaves out datetime columns from differentially private approximate min/max bound calculation. Turn this setting off to ensure all numeric columns use the processing privacy budget. Using a processing privacy budget ie epsilon causes the range of the synthetic data to be different from the original. If the real data is from a specific extracted date time range, the min/max not being close to the original can cause the marginal similarity of columns to perform poorly.

field skew_threshold: Optional[float] = None¶

Skewness value at which to consider a column is skewed, if None no transforms performed for skew columns.

Constraints:

minimum = 0

field split_time: bool = False¶: Controls whether to model date/datetime columns as separate components (year, week, …, etc.) or as a singular value.

AIM ¶

class hazy_configurator.general_params.model_parameters.AIMConfig¶

Bases: DiscretisorParams, BasicProcessingParams

Adaptive and iterative mechanism (AIM) for differentially private synthetic data. It relies on a graphical model approach to select a workload defined as a set of queries to approximate.

Examples

from hazy_configurator import AIMConfig

generative_model_config = AIMConfig(
    epsilon=0.001,
    delta=1e-5
)

{
    "generator_type": "aim",
    "epsilon": 0.001,
    "delta": 0.00001
}

Fields:

bin_strategy_default (hazy_configurator.base.enums.BinningStrategyType)
degree (Optional[int])
delta (Optional[float])
drop_empty_bins (bool)
epsilon (Optional[float])
generator_type (Literal['aim'])
max_cat (Optional[int])
max_model_size (Optional[int])
n_bins (int)
n_iters (Optional[int])
num_marginals (Optional[int])
preserve_datetime_range (bool)
processing_epsilon (Optional[float])
single_threshold (Optional[float])
skew_threshold (Optional[float])
split_time (bool)

field generator_type: Literal['aim'] = 'aim'¶

field epsilon: Optional[float] = 1000¶

Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. Set as None to turn off differential privacy. It should be noted that for AIM, as epsilon increases, training becomes slower. Training time with epsilon=None will take the longest.

Constraints:

exclusiveMinimum = 0.0

field delta: Optional[float] = 1e-05¶

Probability of information accidentally being leaked.Second Differential Privacy Parameter.

Constraints:

exclusiveMinimum = 0.0
exclusiveMaximum = 1.0

field max_model_size: Optional[int] = 512¶

Maximum Size (in Megabytes) the trained model can occupy in memory.

Constraints:

exclusiveMinimum = 0

field degree: Optional[int] = 2¶

Maximum size of the marginals to be modelled.

Constraints:

exclusiveMinimum = 0
maximum = 3

field num_marginals: Optional[int] = None¶

Maximum number of marginals to model.

Constraints:

exclusiveMinimum = 0

field n_iters: Optional[int] = 1000¶

Number of training iterations.

Constraints:

exclusiveMinimum = 0

field n_bins: int = 100¶

The number of bins to use when discretizing numerical data.

Constraints:

minimum = 1

field single_threshold: Optional[float] = None¶

Frequency above which singular numerical values are considered as separate categories, if not provided it will be set to the reciprocal of the number of bins.

Constraints:

minimum = 0

field max_cat: Optional[int] = 100¶

Maximum amount of categorical information to preserve in each categorical column.

Constraints:

minimum = 1

field bin_strategy_default: BinningStrategyType = BinningStrategyType.UNIFORM¶: Method to use to discretise the data.

field drop_empty_bins: bool = False¶: When set to True ensures that no empty bin is used.

field processing_epsilon: Optional[float] = None¶: Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. This is an extra budget spent per table in processing.

field preserve_datetime_range: bool = True¶: Leaves out datetime columns from differentially private approximate min/max bound calculation. Turn this setting off to ensure all numeric columns use the processing privacy budget. Using a processing privacy budget ie epsilon causes the range of the synthetic data to be different from the original. If the real data is from a specific extracted date time range, the min/max not being close to the original can cause the marginal similarity of columns to perform poorly.

field skew_threshold: Optional[float] = None¶

Skewness value at which to consider a column is skewed, if None no transforms performed for skew columns.

Constraints:

minimum = 0

field split_time: bool = False¶: Controls whether to model date/datetime columns as separate components (year, week, …, etc.) or as a singular value.

CLF Sampler ¶

class hazy_configurator.general_params.model_parameters.CLFSamplerConfig¶

Bases: DiscretisorParams, BasicProcessingParams

CLF Sampler is an algorithm that makes use of predictive models to conditionally generate data by synthesising columns based on pairwise mutual information. The algorithm is somewhat inspired by PrivBayes and shares some similarities including conditional generation and discretisation of data.

Note

CLF Sampler currently does not provide any differential privacy guarantees.

Please use PrivBayes (via PrivBayesConfig) if differential privacy is a requirement for your synthetic data use case.

Examples

from hazy_configurator import (
    CLFSamplerConfig,
    LGBMConfig,
)

generative_model_config = CLFSamplerConfig(
    classifier_params=LGBMConfig(
        boosting_type="rf",
        n_estimators=25,
    ),
    sort_visit=True,
    sample_parents=5,
    n_bins=50,
    max_cat=100,
)

{
    "generator_type": "clf_sampler",
    "classifier_params": {
        "classifier_type": "lgbm",
        "boosting_type": "rf",
        "n_estimators": 25
    },
    "sort_visit": true,
    "sample_parents": 5,
    "n_bins": 50,
    "max_cat": 100
}

Fields:

bin_strategy_default (hazy_configurator.base.enums.BinningStrategyType)
classifier_params (Union[hazy_configurator.general_params.generators.clf_sampler.classifier_config.DecisionTreeConfig, hazy_configurator.general_params.generators.clf_sampler.classifier_config.LGBMConfig, hazy_configurator.general_params.generators.clf_sampler.classifier_config.LogisticRegressionConfig, hazy_configurator.general_params.generators.clf_sampler.classifier_config.RandomForestConfig])
drop_empty_bins (bool)
generator_type (Literal['clf_sampler'])
max_cat (Optional[int])
n_bins (int)
preserve_datetime_range (bool)
processing_epsilon (Optional[float])
sample_parents (Optional[int])
single_threshold (Optional[float])
skew_threshold (Optional[float])
sort_visit (bool)
split_time (bool)
visit_order (Optional[List[hazy_configurator.base.col_identifier.ColId]])

field generator_type: Literal['clf_sampler'] = 'clf_sampler'¶

field classifier_params: Union[DecisionTreeConfig, LGBMConfig, LogisticRegressionConfig, RandomForestConfig] = DecisionTreeConfig(classifier_type='decision_tree', criterion='gini', splitter='best', max_depth=None, min_samples_split_count=2, min_samples_split_frac=None, min_samples_leaf_count=1, min_samples_leaf_frac=None, min_weight_fraction_leaf=0.0, max_features_func=None, max_features_count=None, max_features_frac=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)¶: Additional arguments to be provided during initialisation of the base classifier. Choose from DecisionTreeConfig, LGBMConfig, LogisticRegressionConfig or RandomForestConfig.

field visit_order: Optional[List[ColId]] = None¶: The order in which to generate the columns in a table. Note that providing a sample_parents value is likely to undo any changes done by visit_order. Similarly, setting sort_visit to true will attempt to automatically determine the optimal visit order.

field sort_visit: bool = False¶: Whether or not to automatically sort table columns in order of importance during generation. When enabled this aims to optimise the quality of the generated data.

field sample_parents: Optional[int] = None¶

Sets the number of parents to consider for each node during the network building step. A small value will create a less accurate model but will reduce training time.

Constraints:

exclusiveMinimum = 0

field n_bins: int = 100¶

The number of bins to use when discretizing numerical data.

Constraints:

minimum = 1

field single_threshold: Optional[float] = None¶

Frequency above which singular numerical values are considered as separate categories, if not provided it will be set to the reciprocal of the number of bins.

Constraints:

minimum = 0

field max_cat: Optional[int] = 100¶

Maximum amount of categorical information to preserve in each categorical column.

Constraints:

minimum = 1

field bin_strategy_default: BinningStrategyType = BinningStrategyType.UNIFORM¶: Method to use to discretise the data.

field drop_empty_bins: bool = False¶: When set to True ensures that no empty bin is used.

field processing_epsilon: Optional[float] = None¶: Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. This is an extra budget spent per table in processing.

field preserve_datetime_range: bool = True¶: Leaves out datetime columns from differentially private approximate min/max bound calculation. Turn this setting off to ensure all numeric columns use the processing privacy budget. Using a processing privacy budget ie epsilon causes the range of the synthetic data to be different from the original. If the real data is from a specific extracted date time range, the min/max not being close to the original can cause the marginal similarity of columns to perform poorly.

field skew_threshold: Optional[float] = None¶

Skewness value at which to consider a column is skewed, if None no transforms performed for skew columns.

Constraints:

minimum = 0

field split_time: bool = False¶: Controls whether to model date/datetime columns as separate components (year, week, …, etc.) or as a singular value.

Classifiers ¶

CLF Sampler relies on a base classifier whose hyperparameters are specified in the classifier_params field.

The supported classifiers are a selection from scikit-learn and LightGBM, each having their own additional parameters that can be configured.

Classes:

`DecisionTreeConfig`	Parameter configuration for the Decision Tree classifier.
`LGBMConfig`	Parameter configuration for the LGBM classifier.
`LogisticRegressionConfig`	Parameter configuration for the Logistic Regression classifier.
`RandomForestConfig`	Parameter configuration for the Random Forest classifier.

class hazy_configurator.general_params.generators.clf_sampler.DecisionTreeConfig¶

Bases: BaseClassifierConfig

Parameter configuration for the Decision Tree classifier. See the scikit-learn decision tree documentation for more details.

Fields:

ccp_alpha (float)
class_weight (Optional[Literal['balanced']])
classifier_type (Literal['decision_tree'])
criterion (Literal['gini', 'entropy', 'log_loss'])
max_depth (Optional[int])
max_features_count (Optional[int])
max_features_frac (Optional[float])
max_features_func (Optional[Literal['auto', 'sqrt', 'log2']])
max_leaf_nodes (Optional[int])
min_impurity_decrease (float)
min_samples_leaf_count (int)
min_samples_leaf_frac (Optional[float])
min_samples_split_count (int)
min_samples_split_frac (Optional[float])
min_weight_fraction_leaf (float)
splitter (Literal['best', 'random'])

field classifier_type: Literal['decision_tree'] = 'decision_tree'¶

field criterion: Literal['gini', 'entropy', 'log_loss'] = 'gini'¶: The function to measure the quality of a split.

field splitter: Literal['best', 'random'] = 'best'¶: The strategy used to choose the split at each node. "best" chooses the best split, "random" chooses the best random split.

field max_depth: Optional[int] = None¶

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split_count (or min_samples_split_frac) samples.

Constraints:

exclusiveMinimum = 0

field min_samples_split_count: int = 2¶

The minimum number of samples required to split an internal node. Note that you may only set one of min_samples_split_count and min_samples_split_frac.

Constraints:

exclusiveMinimum = 0

field min_samples_split_frac: Optional[float] = None¶

The minimum number of samples required to split an internal node, as a fraction of the total number of samples. Note that you may only set one of min_samples_split_count and min_samples_split_frac.

Constraints:

minimum = 0.0
maximum = 1.0

field min_samples_leaf_count: int = 1¶

The minimum number of samples required to be at a leaf node. Note that you may only set one of min_samples_leaf_count and min_samples_leaf_frac.

Constraints:

exclusiveMinimum = 0

field min_samples_leaf_frac: Optional[float] = None¶

The minimum number of samples required to be at a leaf node, as a fraction of the total number of samples. Note that you may only set one of min_samples_leaf_count and min_samples_leaf_frac.

Constraints:

minimum = 0.0
maximum = 1.0

field min_weight_fraction_leaf: float = 0.0¶

The minimum weighted fraction of the sum total of weights required to be at a leaf node.

Constraints:

minimum = 0.0
maximum = 1.0

field max_features_func: Optional[Literal['auto', 'sqrt', 'log2']] = None¶: The number of features to consider when looking for the best split at each node, as a function of the total number of features. "auto"/"sqrt" and "log2" use the square root and base-2 logarithm of the total number of features. None corresponds to using all features. Note that you may only set one of max_features_func, max_features_count and max_features_frac.

field max_features_count: Optional[int] = None¶

The number of features to consider when looking for the best split at each node. None corresponds to using all features. Note that you may only set one of max_features_func, max_features_count and max_features_frac.

Constraints:

exclusiveMinimum = 0

field max_features_frac: Optional[float] = None¶

The number of features to consider when looking for the best split at each node, as a fraction of the total number of features. None corresponds to using all features. Note that you may only set one of max_features_func, max_features_count and max_features_frac.

Constraints:

minimum = 0.0
maximum = 1.0

field max_leaf_nodes: Optional[int] = None¶

Grows a tree in best-first fashion with the specified number of maximum leaf nodes. If None, then unlimited number of leaf nodes.

Constraints:

exclusiveMinimum = 0

field min_impurity_decrease: float = 0.0¶

A node will be split if the split induces a decrease of the impurity greater than or equal to this value.

Constraints:

minimum = 0.0

field class_weight: Optional[Literal['balanced']] = None¶: Weights associated with classes. If None, all classes have an equal weight of one. If "balanced", the classes are weighted in inverse proportion to the class frequencies in the data.

field ccp_alpha: float = 0.0¶

Complexity parameter used for Minimal Cost-Complexity Pruning.

Constraints:

minimum = 0.0

class hazy_configurator.general_params.generators.clf_sampler.LGBMConfig¶

Bases: BaseClassifierConfig

Parameter configuration for the LGBM classifier. See the LightGBM documentation for more details.

Fields:

boosting_type (Optional[Literal['gbdt', 'dart', 'goss', 'rf']])
class_weight (Optional[Literal['balanced']])
classifier_type (Literal['lgbm'])
colsample_bytree (float)
importance_type (Literal['split', 'gain'])
learning_rate (float)
max_depth (int)
min_child_samples (int)
min_child_weight (float)
min_split_gain (float)
n_estimators (int)
n_jobs (Optional[int])
num_leaves (int)
objective (Optional[Literal['binary', 'multiclass']])
reg_alpha (float)
reg_lambda (float)
subsample (float)
subsample_for_bin (int)
subsample_freq (int)

field classifier_type: Literal['lgbm'] = 'lgbm'¶

field boosting_type: Optional[Literal['gbdt', 'dart', 'goss', 'rf']] = 'gbdt'¶: Type of gradient boosting model to use. "gbdt" is traditional Gradient Boosting Decision Tree. "dart" is Dropout meets Multiple Additive Regression Trees. "goss" is Gradient-based One-Side Sampling. "rf" is Random Forest.

field num_leaves: int = 31¶

Maximum number of leaf nodes for base learners.

Constraints:

exclusiveMinimum = 0

field max_depth: int = -1¶: Maximum depth of the tree for base learners.

field learning_rate: float = 0.1¶

Learning rate used for training.

Constraints:

exclusiveMinimum = 0.0

field n_estimators: int = 100¶

Number of boosted trees to fit.

Constraints:

exclusiveMinimum = 0

field subsample_for_bin: int = 200000¶

Number of samples for constructing bins.

Constraints:

exclusiveMinimum = 0

field objective: Optional[Literal['binary', 'multiclass']] = None¶: Learning objective function to be used for training.

field class_weight: Optional[Literal['balanced']] = None¶: Weights associated with classes. If None, all classes have an equal weight of one. If "balanced", the classes are weighted in inverse proportion to the class frequencies in the data.

field min_split_gain: float = 0.0¶

Minimum loss reduction required to make a further partition on a leaf node of a tree.

Constraints:

minimum = 0.0

field min_child_weight: float = 0.001¶

Minimum sum of instance weights required in a child node.

Constraints:

minimum = 0.0

field min_child_samples: int = 20¶

Minimum number of samples required in a child node.

Constraints:

exclusiveMinimum = 0

field subsample: float = 1.0¶

Subsample ratio of the training instance.

Constraints:

exclusiveMinimum = 0.0

field subsample_freq: int = 0¶: Subsample frequency. To disable, set a value less than or equal to zero.

field colsample_bytree: float = 1.0¶

Subsample ratio of columns when constructing each tree.

Constraints:

exclusiveMinimum = 0.0

field reg_alpha: float = 0.0¶

L1 regularisation term on weights.

Constraints:

minimum = 0.0

field reg_lambda: float = 0.0¶

L2 regularisation term on weights.

Constraints:

minimum = 0.0

field n_jobs: Optional[int] = None¶: Number of parallel threads to use for training. For better performance, it is recommended to set this to the number of physical cores in the CPU. If -1, all threads are used. If None, the number of physical cores in the system is used. Negative integers are interpreted as (n_cpus + 1 + n_jobs).

field importance_type: Literal['split', 'gain'] = 'split'¶: Type of feature importance to be used as a criterion to measure the quality of a split. "split" will measure the number of times the feature is used in a model. "gain" will measure the total gain of splits which use the feature.

class hazy_configurator.general_params.generators.clf_sampler.LogisticRegressionConfig¶

Bases: BaseClassifierConfig

Parameter configuration for the Logistic Regression classifier. See the scikit-learn logistic regression documentation for more details.

Fields:

C (float)
class_weight (Optional[Literal['balanced']])
classifier_type (Literal['logistic_regression'])
dual (bool)
fit_intercept (bool)
intercept_scaling (float)
l1_ratio (Optional[float])
max_iter (int)
multi_class (Literal['auto', 'ovr', 'multinomial'])
n_jobs (Optional[int])
penalty (Literal['l1', 'l2', 'elasticnet', 'none'])
solver (Literal['newton-cg', 'lbfgs', 'liblibear', 'sag', 'saga'])
tol (float)
verbose (int)
warm_start (bool)

field classifier_type: Literal['logistic_regression'] = 'logistic_regression'¶

field penalty: Literal['l1', 'l2', 'elasticnet', 'none'] = 'l2'¶: Norm of the penalty used for regularisation during training

field dual: bool = False¶: Whether to use dual or primal formulation for regularisation. Dual formulation is only supported when using an "l2" penalty and "liblinear" solver.

field tol: float = 0.0001¶

Tolerance for the stopping criteria used during training.

Constraints:

exclusiveMinimum = 0.0

field C: float = 1.0¶

Inverse of regularisation strength. Smaller values specify stronger regularisation.

Constraints:

exclusiveMinimum = 0.0

field fit_intercept: bool = True¶: Specifies whether an intercept term should be fitted.

field intercept_scaling: float = 1¶: Amount to scale the intercept term by. Intercept scaling is only applicable when fit_intercept is True and a "liblinear" solver is used.

field class_weight: Optional[Literal['balanced']] = None¶: Weights associated with classes. If None, all classes have an equal weight of one. If "balanced", the classes are weighted in inverse proportion to the class frequencies in the data.

field solver: Literal['newton-cg', 'lbfgs', 'liblibear', 'sag', 'saga'] = 'lbfgs'¶: Algorithm to use in the optimization problem.

field max_iter: int = 100¶

Maximum number of iterations taken for the solver to converge.

Constraints:

exclusiveMinimum = 0

field multi_class: Literal['auto', 'ovr', 'multinomial'] = 'auto'¶: Multi-class classification problem approach. If "ovr", then a binary problem is fit for each label. If "multinomial", then the multinomial loss across all classes is used for optimization.

field verbose: int = 0¶: Verbosity level for logging optimization progress. Only supported for the "liblinear" and "lbfgs" solver.

field warm_start: bool = False¶: Whether or not to re-use the solution of the previous call to fit as initialization. Not supported for "liblinear" solver.

field n_jobs: Optional[int] = None¶: Number of parallel threads to use for training. For better performance, it is recommended to set this to the number of physical cores in the CPU. If -1, all threads are used. If None, a single thread is used. Negative integers are interpreted as (n_cpus + 1 + n_jobs).

field l1_ratio: Optional[float] = None¶

The Elastic-Net mixing parameter between 0 and 1. Only used if an "elasticnet" penalty is specified.

Constraints:

minimum = 0.0
maximum = 1.0

class hazy_configurator.general_params.generators.clf_sampler.RandomForestConfig¶

Bases: BaseClassifierConfig

Parameter configuration for the Random Forest classifier. See the scikit-learn random forest classifier documentation for more details.

Fields:

bootstrap (bool)
ccp_alpha (float)
class_weight (Optional[Literal['balanced', 'balanced_subsample']])
classifier_type (Literal['random_forest'])
criterion (Literal['gini', 'entropy', 'log_loss'])
max_depth (Optional[int])
max_features_count (Optional[int])
max_features_frac (Optional[float])
max_features_func (Optional[Literal['auto', 'sqrt', 'log2']])
max_leaf_nodes (Optional[int])
max_samples_count (Optional[int])
max_samples_frac (Optional[float])
min_impurity_decrease (float)
min_samples_leaf_count (int)
min_samples_leaf_frac (Optional[float])
min_samples_split_count (int)
min_samples_split_frac (Optional[float])
min_weight_fraction_leaf (float)
n_estimators (int)
n_jobs (Optional[int])
verbose (int)
warm_start (bool)

field classifier_type: Literal['random_forest'] = 'random_forest'¶

field n_estimators: int = 100¶

Number of trees in the forest.

Constraints:

exclusiveMinimum = 0

field criterion: Literal['gini', 'entropy', 'log_loss'] = 'gini'¶: The function to measure the quality of a split.

field max_depth: Optional[int] = None¶

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split_count (or min_samples_split_frac) samples.

Constraints:

exclusiveMinimum = 0

field min_samples_split_count: int = 2¶

The minimum number of samples required to split an internal node. Note that you may only set one of min_samples_split_count and min_samples_split_frac.

Constraints:

exclusiveMinimum = 0

field min_samples_split_frac: Optional[float] = None¶

Constraints:

minimum = 0
maximum = 1

field min_samples_leaf_count: int = 1¶

The minimum number of samples required to be at a leaf node. Note that you may only set one of min_samples_leaf_count and min_samples_leaf_frac.

Constraints:

exclusiveMinimum = 0

field min_samples_leaf_frac: Optional[float] = None¶

The minimum number of samples required to be at a leaf node, as a fraction of the total number of samples. Note that you may only set one of min_samples_leaf_count and min_samples_leaf_frac.

Constraints:

minimum = 0.0
maximum = 1.0

field min_weight_fraction_leaf: float = 0.0¶

The minimum weighted fraction of the sum total of weights required to be at a leaf node.

Constraints:

minimum = 0.0
maximum = 1.0

field max_features_func: Optional[Literal['auto', 'sqrt', 'log2']] = 'sqrt'¶: The number of features to consider when looking for the best split at each node, as a function of the total number of features. "auto"/"sqrt" and "log2" use the square root and base-2 logarithm of the total number of features. None corresponds to using all features. Note that you may only set one of max_features_func, max_features_count and max_features_frac.

field max_features_count: Optional[int] = None¶

Constraints:

exclusiveMinimum = 0

field max_features_frac: Optional[float] = None¶

Constraints:

minimum = 0.0
maximum = 1.0

field max_leaf_nodes: Optional[int] = None¶

Grows a tree in best-first fashion with the specified number of maximum leaf nodes. If None, then unlimited number of leaf nodes.

Constraints:

exclusiveMinimum = 0

field min_impurity_decrease: float = 0.0¶

A node will be split if the split induces a decrease of the impurity greater than or equal to this value.

Constraints:

minimum = 0.0

field bootstrap: bool = True¶: Whether bootstrap samples are used when building trees.

field n_jobs: Optional[int] = None¶: Number of parallel threads to use for training. For better performance, it is recommended to set this to the number of physical cores in the CPU. If -1, all threads are used. If None, the number of physical cores in the system is used. Negative integers are interpreted as (n_cpus + 1 + n_jobs).

field verbose: int = 0¶: Verbosity level for fitting and prediction progress.

field warm_start: bool = False¶: Whether to re-use the solution of the previous call to fit and add more estimators to the ensemble or fit a whole new forest.

field class_weight: Optional[Literal['balanced', 'balanced_subsample']] = None¶: Weights associated with classes. If None, all classes have an equal weight of one. If "balanced", the classes are weighted in inverse proportion to the class frequencies in the data. If "balanced_subsample", this is the same as "balanced" with weights computed for every tree grown.

field ccp_alpha: float = 0.0¶

Complexity parameter used for Minimal Cost-Complexity Pruning.

Constraints:

minimum = 0.0

field max_samples_count: Optional[int] = None¶

Number of training samples to draw to train each base estimator (only if bootstrap is True). None corresponds to using all samples.

Constraints:

exclusiveMinimum = 0

field max_samples_frac: Optional[float] = None¶

Number of training samples to draw to train each base estimator, as a fraction of the total number of samples (only if bootstrap is True). None corresponds to using all samples. Note that you may only set one of max_samples_count and max_samples_frac.

Constraints:

minimum = 0.0
maximum = 1.0

Sequential Training Parameters ¶

class hazy_configurator.general_params.model_parameters.SequentialTrainingParams¶

Bases: SaasSequentialTrainingParams

Training parameters for the basic sequential model.

Examples

from hazy_configurator import SequentialTrainingParams

seq_params = SequentialTrainingParams(
    window_size=6,
    n_predict=2,
)

{
    "window_size": 6,
    "n_predict": 2
}

Fields:

n_predict (int)
window_size (int)

field window_size: int = 20¶

The number of previous elements in the sequence to condition on during each sequential generation step.

Constraints:

exclusiveMinimum = 0

field n_predict: int = 1¶

The number of elements in the sequence to predict at each sequential generation step.

Constraints:

exclusiveMinimum = 0

class hazy_configurator.general_params.model_parameters.SequentialDGANTrainingParams¶

Bases: SaasSequentialDGANTrainingParams

Training parameters for the DoppelGANger sequential model. DoppelGANger (DGAN) is a generative adversarial network with a specialized architecture for modelling time series data.

Examples

from hazy_configurator import SequentialDGANTrainingParams

seq_params = SequentialDGANTrainingParams(
    nb_batch_generation=4,
    max_cat=100,
)

{
    "nb_batch_generation": 4,
    "max_cat": 2
}

Fields:

max_cat (int)
nb_batch_generation (int)
num_epochs (int)

field nb_batch_generation: int = 4¶

Number of records generated at each RNN pass.

Constraints:

exclusiveMinimum = 0

field max_cat: int = 100¶

Maximum amount of categorical information to preserve in each categorical column.

Constraints:

exclusiveMinimum = 0

field num_epochs: int = 500¶

Number of epochs to run training for. An epoch is one passing of the entire training table through the model.

Constraints:

exclusiveMinimum = 0

class hazy_configurator.sequential_params.SequentialRNNTrainingParams¶

Bases: HazyBaseModel

Configuration for the sequential RNN pipeline.

The RNN pipeline is an autoregressive model that generates sequences conditioned on static attributes.

Examples

from hazy_configurator import SequentialRNNTrainingParams

seq_params = SequentialRNNTrainingParams(
    dropout=0.1,
    learning_rate=0.0003,
    hidden_size=256,
    num_layers=2,
)

{
    "dropout": 0.1,
    "learning_rate": 0.0003,
    "hidden_size": 256,
    "num_layers": 2,
}

Fields:

batch_size (int)
device (hazy_configurator.base.enums.TorchDevice)
dropout (float)
hidden_size (int)
learning_rate (float)
max_cat (int)
max_epochs (int)
num_layers (int)
patience (int)
shuffle (bool)
val_size (float)

field batch_size: int = 64¶

Number of sequences per training (and generation) batch. This should be decreased if running with constrained memory limits.

Constraints:

minimum = 1

field max_epochs: int = 100¶

Maximum number of epochs to train each neural network for. Note that training is usually terminated before this number is reached, based on the patience setting.

Constraints:

minimum = 1

field patience: int = 7¶

Number of epochs after which training is terminated if there has been no improvement in the validation loss.

Constraints:

minimum = 1

field learning_rate: float = 0.001¶

Initial learning rate for the Adam optimizer.

Constraints:

minimum = 0.0

field device: TorchDevice = TorchDevice.CPU¶: Accelerator to use for training the model. NOTE: As this model relies on deep learning, it will take longer to train on a CPU.

field val_size: float = 0.2¶

Proportion of data to use for validation.

Constraints:

minimum = 0.0
maximum = 1.0

field shuffle: bool = True¶: Whether or not to shuffle training data at every epoch. Also controls shuffling before splitting data into training and validation sets.

field hidden_size: int = 512¶

Number of units in each hidden layer. A higher number allows for more complex patterns to be captured at the cost of longer training times and potential over-fitting.

Constraints:

minimum = 2

field num_layers: int = 3¶

Number of hidden layers in each network. A higher number allows for more complex patterns to be captured at the cost of longer training times and potential overfitting.

Constraints:

minimum = 1

field dropout: float = 0.0¶

Probability of masking input values during training. Useful for mitigating overfitting and encouraging generalization.

Constraints:

minimum = 0.0
maximum = 1.0

field max_cat: int = 100¶

Maximum amount of categorical information to preserve in each categorical column.

Constraints:

minimum = 1

Evaluation

Custom Handlers