Model Parameters

The full model training parameters, including generative model hyperparameters, processing parameters and multi table model training parameters.

Model Parameters

class hazy_configurator.general_params.model_parameters.ModelParameters

Bases: HazyBaseModel

Container for all training and model parameters.

Examples

from hazy_configurator import (
    ModelParameters,
    PrivBayesConfig,
    MultiTableTrainingParams,
    AdjacencyType,
)

model_params = ModelParameters(
    generator=PrivBayesConfig(
        epsilon=0.001,
        n_parents=2,
        n_bins=50,
        max_cat=100,
    ),
    multi_table=MultiTableTrainingParams(
        adjacency_type=AdjacencyType.DEGREE_PRESERVING,
    ),
)
Fields:
field generator: GeneratorConfigUnion = PrivBayesConfig(skew_threshold=None, split_time=False, n_bins=100, single_threshold=None, max_cat=100, bin_strategy_default=<BinningStrategyType.UNIFORM: 'uniform'>, drop_empty_bins=False, processing_epsilon=None, preserve_datetime_range=True, generator_type='priv_bayes', epsilon=1000.0, n_parents=3, network_sample_rows=None, sample_parents=25, default_strategy=<PrivBayesUnknownCombinationStrategyType.UNIFORM: 'uniform'>)

Generative model hyperparameters and processing requirements. Choose from : PrivBayesConfig, CLFSamplerConfig, MSTConfig, AIMConfig, DPGANConfig

field multi_table: MultiTableTrainingParams = MultiTableTrainingParams(adjacency_type=<AdjacencyType.DEGREE_PRESERVING: 'degree_preserving'>, parent_compression=None, core_version=1, use_cache=False)

Multi table training parameters.

field sequential: Union[SequentialTrainingParams, SequentialDGANTrainingParams, SequentialRNNTrainingParams] = SequentialTrainingParams(window_size=20, n_predict=1)

Sequential table training parameters. Choose from SequentialTrainingParams, SequentialDGANTrainingParams, or SequentialRNNTrainingParams.

Multi Table Training Parameters

class hazy_configurator.general_params.model_parameters.MultiTableTrainingParams

Bases: HazyBaseModel

Multi table specific training parameters.

Examples

from hazy_configurator import MultiTableTrainingParams, AdjacencyType

multi_params = MultiTableTrainingParams(
    adjacency_type=AdjacencyType.DEGREE_PRESERVING,
)
Fields:
field adjacency_type: AdjacencyType = AdjacencyType.DEGREE_PRESERVING

Adjacency method to use. Degree preserving is best for multi purpose. Component preserving is the best for when connected sub-components exist in the data, this may be the case when a single table has two or more foreign keys present. Identity gives the best results by reusing the same bipartite graph, however this comes with the caveat of only being able to generate data of fixed magnitude.

field parent_compression: Optional[LowMIParentColumnCompressionConfig] = None

The maximum number of parent column candidates to condition on for any child table.

field core_version: int = 1

Version of the multi-table core algorithm to use for training and generation. Version 2 is an improvement over the original, allowing the use of advanced features such as caching and incremental training & generation.

Constraints:
  • minimum = 1

  • maximum = 2

field use_cache: bool = False

Requires HAZY_CACHE_FOLDER and HAZY_CACHE_PASSWORD environment variables to be set during training and generation. Caching is only available for core version 2, and is ineffective unless a random state is specified for training and generation. Specifies whether or not to cache and reuse table sub-models during training and generation. Table sub-models are cached so that subsequent training runs involving these tables will reuse existing cached sub-models if the tables and their configurations have not been modified. Similarly for generation, synthetic data will be cached for tables in subsequent generation runs provided that the source data, training configuration and generation configuration for the tables have not been modified.

class hazy_configurator.general_params.model_parameters.LowMIParentColumnCompressionConfig

Bases: HazyBaseModel

The configuration for specifying how to compress the number of parent columns that the model uses to condition on for a column in a child table.

Parent columns are ranked by their mutual information scores with the child columns, and the lowest columns are dropped from the conditioning set.

Applying parent compression is a memory optimization technique that can be used to reduce the memory footprint of the training program, potentially at the cost of distribution similarity between training and generated data.

Note that there can be cases where the parent-child relationship between tables in a database can be ambiguous, and either ordering is valid. In this case the ordering is defined by the order of Foreign Key Type objects in the junction tables of the user-specified Data Schema. The first foreign key column will belong to the parent table, and the second foreign key column will belong to the child table.

Examples

from hazy_configurator import (
    LowMIParentColumnCompressionConfig,
    ReductionType,
)

pc_config = LowMIParentColumnCompressionConfig(
    max_parent_columns=4,
    reduction=ReductionType.MAX,
)
Fields:
field max_parent_columns: int [Required]

The maximum number of columns from parent tables to condition on.

field reduction: Optional[ReductionType] = ReductionType.MEAN

How scores for parent columns are reduced over child columns.

field sample_records: Optional[int] = 100000

If the child table has more than sample_records rows, then a sample of this size is taken for computing the parent compression.

Generator Configuration

Priv Bayes
class hazy_configurator.general_params.model_parameters.PrivBayesConfig

Bases: DiscretisorParams, BasicProcessingParams, SaasPrivBayesConfig

PrivBayes is an algorithm introduced in 2017 that relies on discretising the data and approximating its low-dimensional marginals by fitting an optimal & differentially private Bayesian network in order to allow for efficient data generation.

Examples

from hazy_configurator import PrivBayesConfig, PrivBayesUnknownCombinationStrategyType

generative_model_config = PrivBayesConfig(
    epsilon=0.001,
    n_parents=2,
    default_strategy=PrivBayesUnknownCombinationStrategyType.MARGINAL,
    n_bins=50,
    max_cat=100,
)
Fields:
field generator_type: Literal['priv_bayes'] = 'priv_bayes'
field epsilon: Optional[float] = 1000

Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. Set as None to turn off differential privacy.

Constraints:
  • exclusiveMinimum = 0.0

field n_parents: int = 3

Number of parents per node to use in network building.

Constraints:
  • exclusiveMinimum = 0

field network_sample_rows: Optional[int] = None

Sets the number or rows to sample for the network building step. A small value will create a less accurate model but will reduce training time.

Constraints:
  • exclusiveMinimum = 0

field sample_parents: Optional[int] = 25

Sets the number of parents to consider for each node during the network building step. A small value will create a less accurate model but will reduce training time.

Constraints:
  • exclusiveMinimum = 0

field default_strategy: PrivBayesUnknownCombinationStrategyType = PrivBayesUnknownCombinationStrategyType.UNIFORM

Strategy used on generation when the combination of variables sampled from the learnt network has not been seen during training. The marginal option gives higher statistical similarity but is not differentially private.

field n_bins: int = 100

The number of bins to use when discretizing numerical data.

Constraints:
  • minimum = 1

field single_threshold: Optional[float] = None

Frequency above which singular numerical values are considered as separate categories, if not provided it will be set to the reciprocal of the number of bins.

Constraints:
  • minimum = 0

field max_cat: Optional[int] = 100

Maximum amount of categorical information to preserve in each categorical column.

Constraints:
  • minimum = 1

field bin_strategy_default: BinningStrategyType = BinningStrategyType.UNIFORM

Method to use to discretise the data.

field drop_empty_bins: bool = False

When set to True ensures that no empty bin is used.

field processing_epsilon: Optional[float] = None

Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. This is an extra budget spent per table in processing.

field preserve_datetime_range: bool = True

Leaves out datetime columns from differentially private approximate min/max bound calculation. Turn this setting off to ensure all numeric columns use the processing privacy budget. Using a processing privacy budget ie epsilon causes the range of the synthetic data to be different from the original. If the real data is from a specific extracted date time range, the min/max not being close to the original can cause the marginal similarity of columns to perform poorly.

field skew_threshold: Optional[float] = None

Skewness value at which to consider a column is skewed, if None no transforms performed for skew columns.

Constraints:
  • minimum = 0

field split_time: bool = False

Controls whether to model date/datetime columns as separate components (year, week, …, etc.) or as a singular value.

MST
class hazy_configurator.general_params.model_parameters.MSTConfig

Bases: DiscretisorParams, BasicProcessingParams

MST is an algorithm introduced in 2021 that relies on discretising the data and fitting a differentially private graphical model to low-dimensional marginals in order to allow for efficient data generation.

Examples

from hazy_configurator import MSTConfig

generative_model_config = MSTConfig(
    epsilon=0.001,
    delta=1e-5,
    n_iters=5_000,
    compress_domain=True,
)
Fields:
field generator_type: Literal['mst'] = 'mst'
field epsilon: Optional[float] = 1000

Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. Set as None to turn off differential privacy.

Constraints:
  • exclusiveMinimum = 0.0

field delta: Optional[float] = 1e-05

Probability of information accidentally being leaked.Second Differential Privacy Parameter.

Constraints:
  • exclusiveMinimum = 0.0

  • exclusiveMaximum = 1.0

field n_iters: Optional[int] = 1000

Number of training iterations.

Constraints:
  • exclusiveMinimum = 0

field compress_domain: Optional[bool] = True

When set to True the domain of the 1-way marginals is compressed according to the available privacy budget.

field n_bins: int = 100

The number of bins to use when discretizing numerical data.

Constraints:
  • minimum = 1

field single_threshold: Optional[float] = None

Frequency above which singular numerical values are considered as separate categories, if not provided it will be set to the reciprocal of the number of bins.

Constraints:
  • minimum = 0

field max_cat: Optional[int] = 100

Maximum amount of categorical information to preserve in each categorical column.

Constraints:
  • minimum = 1

field bin_strategy_default: BinningStrategyType = BinningStrategyType.UNIFORM

Method to use to discretise the data.

field drop_empty_bins: bool = False

When set to True ensures that no empty bin is used.

field processing_epsilon: Optional[float] = None

Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. This is an extra budget spent per table in processing.

field preserve_datetime_range: bool = True

Leaves out datetime columns from differentially private approximate min/max bound calculation. Turn this setting off to ensure all numeric columns use the processing privacy budget. Using a processing privacy budget ie epsilon causes the range of the synthetic data to be different from the original. If the real data is from a specific extracted date time range, the min/max not being close to the original can cause the marginal similarity of columns to perform poorly.

field skew_threshold: Optional[float] = None

Skewness value at which to consider a column is skewed, if None no transforms performed for skew columns.

Constraints:
  • minimum = 0

field split_time: bool = False

Controls whether to model date/datetime columns as separate components (year, week, …, etc.) or as a singular value.

AIM
class hazy_configurator.general_params.model_parameters.AIMConfig

Bases: DiscretisorParams, BasicProcessingParams

Adaptive and iterative mechanism (AIM) for differentially private synthetic data. It relies on a graphical model approach to select a workload defined as a set of queries to approximate.

Examples

from hazy_configurator import AIMConfig

generative_model_config = AIMConfig(
    epsilon=0.001,
    delta=1e-5
)
Fields:
field generator_type: Literal['aim'] = 'aim'
field epsilon: Optional[float] = 1000

Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. Set as None to turn off differential privacy. It should be noted that for AIM, as epsilon increases, training becomes slower. Training time with epsilon=None will take the longest.

Constraints:
  • exclusiveMinimum = 0.0

field delta: Optional[float] = 1e-05

Probability of information accidentally being leaked.Second Differential Privacy Parameter.

Constraints:
  • exclusiveMinimum = 0.0

  • exclusiveMaximum = 1.0

field max_model_size: Optional[int] = 512

Maximum Size (in Megabytes) the trained model can occupy in memory.

Constraints:
  • exclusiveMinimum = 0

field degree: Optional[int] = 2

Maximum size of the marginals to be modelled.

Constraints:
  • exclusiveMinimum = 0

  • maximum = 3

field num_marginals: Optional[int] = None

Maximum number of marginals to model.

Constraints:
  • exclusiveMinimum = 0

field n_iters: Optional[int] = 1000

Number of training iterations.

Constraints:
  • exclusiveMinimum = 0

field n_bins: int = 100

The number of bins to use when discretizing numerical data.

Constraints:
  • minimum = 1

field single_threshold: Optional[float] = None

Frequency above which singular numerical values are considered as separate categories, if not provided it will be set to the reciprocal of the number of bins.

Constraints:
  • minimum = 0

field max_cat: Optional[int] = 100

Maximum amount of categorical information to preserve in each categorical column.

Constraints:
  • minimum = 1

field bin_strategy_default: BinningStrategyType = BinningStrategyType.UNIFORM

Method to use to discretise the data.

field drop_empty_bins: bool = False

When set to True ensures that no empty bin is used.

field processing_epsilon: Optional[float] = None

Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. This is an extra budget spent per table in processing.

field preserve_datetime_range: bool = True

Leaves out datetime columns from differentially private approximate min/max bound calculation. Turn this setting off to ensure all numeric columns use the processing privacy budget. Using a processing privacy budget ie epsilon causes the range of the synthetic data to be different from the original. If the real data is from a specific extracted date time range, the min/max not being close to the original can cause the marginal similarity of columns to perform poorly.

field skew_threshold: Optional[float] = None

Skewness value at which to consider a column is skewed, if None no transforms performed for skew columns.

Constraints:
  • minimum = 0

field split_time: bool = False

Controls whether to model date/datetime columns as separate components (year, week, …, etc.) or as a singular value.

CLF Sampler
class hazy_configurator.general_params.model_parameters.CLFSamplerConfig

Bases: DiscretisorParams, BasicProcessingParams

CLF Sampler is an algorithm that makes use of predictive models to conditionally generate data by synthesising columns based on pairwise mutual information. The algorithm is somewhat inspired by PrivBayes and shares some similarities including conditional generation and discretisation of data.

Note

CLF Sampler currently does not provide any differential privacy guarantees.

Please use PrivBayes (via PrivBayesConfig) if differential privacy is a requirement for your synthetic data use case.

Examples

from hazy_configurator import (
    CLFSamplerConfig,
    LGBMConfig,
)

generative_model_config = CLFSamplerConfig(
    classifier_params=LGBMConfig(
        boosting_type="rf",
        n_estimators=25,
    ),
    sort_visit=True,
    sample_parents=5,
    n_bins=50,
    max_cat=100,
)
Fields:
field generator_type: Literal['clf_sampler'] = 'clf_sampler'
field classifier_params: Union[DecisionTreeConfig, LGBMConfig, LogisticRegressionConfig, RandomForestConfig] = DecisionTreeConfig(classifier_type='decision_tree', criterion='gini', splitter='best', max_depth=None, min_samples_split_count=2, min_samples_split_frac=None, min_samples_leaf_count=1, min_samples_leaf_frac=None, min_weight_fraction_leaf=0.0, max_features_func=None, max_features_count=None, max_features_frac=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)

Additional arguments to be provided during initialisation of the base classifier. Choose from DecisionTreeConfig, LGBMConfig, LogisticRegressionConfig or RandomForestConfig.

field visit_order: Optional[List[ColId]] = None

The order in which to generate the columns in a table. Note that providing a sample_parents value is likely to undo any changes done by visit_order. Similarly, setting sort_visit to true will attempt to automatically determine the optimal visit order.

field sort_visit: bool = False

Whether or not to automatically sort table columns in order of importance during generation. When enabled this aims to optimise the quality of the generated data.

field sample_parents: Optional[int] = None

Sets the number of parents to consider for each node during the network building step. A small value will create a less accurate model but will reduce training time.

Constraints:
  • exclusiveMinimum = 0

field n_bins: int = 100

The number of bins to use when discretizing numerical data.

Constraints:
  • minimum = 1

field single_threshold: Optional[float] = None

Frequency above which singular numerical values are considered as separate categories, if not provided it will be set to the reciprocal of the number of bins.

Constraints:
  • minimum = 0

field max_cat: Optional[int] = 100

Maximum amount of categorical information to preserve in each categorical column.

Constraints:
  • minimum = 1

field bin_strategy_default: BinningStrategyType = BinningStrategyType.UNIFORM

Method to use to discretise the data.

field drop_empty_bins: bool = False

When set to True ensures that no empty bin is used.

field processing_epsilon: Optional[float] = None

Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. This is an extra budget spent per table in processing.

field preserve_datetime_range: bool = True

Leaves out datetime columns from differentially private approximate min/max bound calculation. Turn this setting off to ensure all numeric columns use the processing privacy budget. Using a processing privacy budget ie epsilon causes the range of the synthetic data to be different from the original. If the real data is from a specific extracted date time range, the min/max not being close to the original can cause the marginal similarity of columns to perform poorly.

field skew_threshold: Optional[float] = None

Skewness value at which to consider a column is skewed, if None no transforms performed for skew columns.

Constraints:
  • minimum = 0

field split_time: bool = False

Controls whether to model date/datetime columns as separate components (year, week, …, etc.) or as a singular value.

DPGAN
class hazy_configurator.general_params.model_parameters.DPGANConfig

Bases: GANProcessingParams

Adaptive and iterative mechanism (AIM) for differentially private synthetic data. It relies on a graphical model approach to select a workload defined as a set of queries to approximate.

Examples

from hazy_configurator import AIMConfig

generative_model_config = AIMConfig(
    epsilon=0.001,
    delta=1e-5
)
Fields:
field generator_type: Literal['dpgan'] = 'dpgan'
field n_iter: Optional[int] = 2000

Number of training iterations to run during training (in number of batches)

Constraints:
  • exclusiveMinimum = 0

field generator_n_layers_hidden: Optional[int] = 2

Number of hidden layers in the generator module

Constraints:
  • exclusiveMinimum = 0

field generator_n_units_hidden: Optional[int] = 500

Size of the hidden layers in the generator module

Constraints:
  • exclusiveMinimum = 0

field generator_dropout: Optional[float] = 0.1

Size of the hidden layers in the generator module

Constraints:
  • minimum = 0

field discriminator_n_layers_hidden: Optional[int] = 2

Number of hidden layers in the Discriminator module

Constraints:
  • exclusiveMinimum = 0

field discriminator_n_units_hidden: Optional[int] = 500

Size of the hidden layers in the Discriminator module

Constraints:
  • exclusiveMinimum = 0

field discriminator_dropout: Optional[float] = 0.1

Size of the hidden layers in the discriminator module

Constraints:
  • minimum = 0

field discriminator_n_iter: Optional[int] = 1

Discriminator number of iterations during training loop

Constraints:
  • exclusiveMinimum = 0

field lr: Optional[float] = 0.001

Overall learning rate during training

Constraints:
  • minimum = 0

field weight_decay: Optional[float] = 0.001

Overall learning rate during training

Constraints:
  • minimum = 0

field batch_size: Optional[int] = 200

Batch Size

Constraints:
  • exclusiveMinimum = 0

field clipping_value: Optional[int] = 1

Clipping Value

Constraints:
  • exclusiveMinimum = 0

field lambda_gradient_penalty: Optional[float] = 10

Lambda Gradient Penalty

Constraints:
  • exclusiveMinimum = 0.0

field encoder_max_clusters: Optional[int] = 5

Encoder max clusters

Constraints:
  • exclusiveMinimum = 0

field epsilon: Optional[float] = 1000

Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. Set as None to turn off differential privacy.

Constraints:
  • exclusiveMinimum = 0.0

field delta: Optional[float] = 1e-05

Probability of information accidentally being leaked.Second Differential Privacy Parameter.

Constraints:
  • exclusiveMinimum = 0.0

field dp_max_grad_norm: Optional[float] = 2

Max Grad Norm

Constraints:
  • exclusiveMinimum = 0.0

field dp_secure_mode: Optional[bool] = False

Differential Privacy Secure Mode

field patience: Optional[int] = 5

Early Stopping mechanism

Constraints:
  • exclusiveMinimum = 0

field n_iter_min: Optional[int] = 100

Number of Minimum Iterations

Constraints:
  • exclusiveMinimum = 0

field compress_dataset: Optional[bool] = False

Compress Dataset

field sampling_patience: Optional[int] = 100

Number of Minimum Iterations

Constraints:
  • exclusiveMinimum = 0

field split_time: bool = False

Controls whether to model date/datetime columns as separate components (year, week, …, etc.) or as a singular value.

Classifiers

CLF Sampler relies on a base classifier whose hyperparameters are specified in the classifier_params field.

The supported classifiers are a selection from scikit-learn and LightGBM, each having their own additional parameters that can be configured.

Classes:

DecisionTreeConfig

Parameter configuration for the Decision Tree classifier.

LGBMConfig

Parameter configuration for the LGBM classifier.

LogisticRegressionConfig

Parameter configuration for the Logistic Regression classifier.

RandomForestConfig

Parameter configuration for the Random Forest classifier.

class hazy_configurator.general_params.generators.clf_sampler.DecisionTreeConfig

Bases: BaseClassifierConfig

Parameter configuration for the Decision Tree classifier. See the scikit-learn decision tree documentation for more details.

Fields:
  • ccp_alpha (float)

  • class_weight (Optional[Literal['balanced']])

  • classifier_type (Literal['decision_tree'])

  • criterion (Literal['gini', 'entropy', 'log_loss'])

  • max_depth (Optional[int])

  • max_features_count (Optional[int])

  • max_features_frac (Optional[float])

  • max_features_func (Optional[Literal['auto', 'sqrt', 'log2']])

  • max_leaf_nodes (Optional[int])

  • min_impurity_decrease (float)

  • min_samples_leaf_count (int)

  • min_samples_leaf_frac (Optional[float])

  • min_samples_split_count (int)

  • min_samples_split_frac (Optional[float])

  • min_weight_fraction_leaf (float)

  • splitter (Literal['best', 'random'])

field classifier_type: Literal['decision_tree'] = 'decision_tree'
field criterion: Literal['gini', 'entropy', 'log_loss'] = 'gini'

The function to measure the quality of a split.

field splitter: Literal['best', 'random'] = 'best'

The strategy used to choose the split at each node. "best" chooses the best split, "random" chooses the best random split.

field max_depth: Optional[int] = None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split_count (or min_samples_split_frac) samples.

Constraints:
  • exclusiveMinimum = 0

field min_samples_split_count: int = 2

The minimum number of samples required to split an internal node. Note that you may only set one of min_samples_split_count and min_samples_split_frac.

Constraints:
  • exclusiveMinimum = 0

field min_samples_split_frac: Optional[float] = None

The minimum number of samples required to split an internal node, as a fraction of the total number of samples. Note that you may only set one of min_samples_split_count and min_samples_split_frac.

Constraints:
  • minimum = 0.0

  • maximum = 1.0

field min_samples_leaf_count: int = 1

The minimum number of samples required to be at a leaf node. Note that you may only set one of min_samples_leaf_count and min_samples_leaf_frac.

Constraints:
  • exclusiveMinimum = 0

field min_samples_leaf_frac: Optional[float] = None

The minimum number of samples required to be at a leaf node, as a fraction of the total number of samples. Note that you may only set one of min_samples_leaf_count and min_samples_leaf_frac.

Constraints:
  • minimum = 0.0

  • maximum = 1.0

field min_weight_fraction_leaf: float = 0.0

The minimum weighted fraction of the sum total of weights required to be at a leaf node.

Constraints:
  • minimum = 0.0

  • maximum = 1.0

field max_features_func: Optional[Literal['auto', 'sqrt', 'log2']] = None

The number of features to consider when looking for the best split at each node, as a function of the total number of features. "auto"/"sqrt" and "log2" use the square root and base-2 logarithm of the total number of features. None corresponds to using all features. Note that you may only set one of max_features_func, max_features_count and max_features_frac.

field max_features_count: Optional[int] = None

The number of features to consider when looking for the best split at each node. None corresponds to using all features. Note that you may only set one of max_features_func, max_features_count and max_features_frac.

Constraints:
  • exclusiveMinimum = 0

field max_features_frac: Optional[float] = None

The number of features to consider when looking for the best split at each node, as a fraction of the total number of features. None corresponds to using all features. Note that you may only set one of max_features_func, max_features_count and max_features_frac.

Constraints:
  • minimum = 0.0

  • maximum = 1.0

field max_leaf_nodes: Optional[int] = None

Grows a tree in best-first fashion with the specified number of maximum leaf nodes. If None, then unlimited number of leaf nodes.

Constraints:
  • exclusiveMinimum = 0

field min_impurity_decrease: float = 0.0

A node will be split if the split induces a decrease of the impurity greater than or equal to this value.

Constraints:
  • minimum = 0.0

field class_weight: Optional[Literal['balanced']] = None

Weights associated with classes. If None, all classes have an equal weight of one. If "balanced", the classes are weighted in inverse proportion to the class frequencies in the data.

field ccp_alpha: float = 0.0

Complexity parameter used for Minimal Cost-Complexity Pruning.

Constraints:
  • minimum = 0.0

class hazy_configurator.general_params.generators.clf_sampler.LGBMConfig

Bases: BaseClassifierConfig

Parameter configuration for the LGBM classifier. See the LightGBM documentation for more details.

Fields:
  • boosting_type (Optional[Literal['gbdt', 'dart', 'goss', 'rf']])

  • class_weight (Optional[Literal['balanced']])

  • classifier_type (Literal['lgbm'])

  • colsample_bytree (float)

  • importance_type (Literal['split', 'gain'])

  • learning_rate (float)

  • max_depth (int)

  • min_child_samples (int)

  • min_child_weight (float)

  • min_split_gain (float)

  • n_estimators (int)

  • n_jobs (Optional[int])

  • num_leaves (int)

  • objective (Optional[Literal['binary', 'multiclass']])

  • reg_alpha (float)

  • reg_lambda (float)

  • subsample (float)

  • subsample_for_bin (int)

  • subsample_freq (int)

field classifier_type: Literal['lgbm'] = 'lgbm'
field boosting_type: Optional[Literal['gbdt', 'dart', 'goss', 'rf']] = 'gbdt'

Type of gradient boosting model to use. "gbdt" is traditional Gradient Boosting Decision Tree. "dart" is Dropout meets Multiple Additive Regression Trees. "goss" is Gradient-based One-Side Sampling. "rf" is Random Forest.

field num_leaves: int = 31

Maximum number of leaf nodes for base learners.

Constraints:
  • exclusiveMinimum = 0

field max_depth: int = -1

Maximum depth of the tree for base learners.

field learning_rate: float = 0.1

Learning rate used for training.

Constraints:
  • exclusiveMinimum = 0.0

field n_estimators: int = 100

Number of boosted trees to fit.

Constraints:
  • exclusiveMinimum = 0

field subsample_for_bin: int = 200000

Number of samples for constructing bins.

Constraints:
  • exclusiveMinimum = 0

field objective: Optional[Literal['binary', 'multiclass']] = None

Learning objective function to be used for training.

field class_weight: Optional[Literal['balanced']] = None

Weights associated with classes. If None, all classes have an equal weight of one. If "balanced", the classes are weighted in inverse proportion to the class frequencies in the data.

field min_split_gain: float = 0.0

Minimum loss reduction required to make a further partition on a leaf node of a tree.

Constraints:
  • minimum = 0.0

field min_child_weight: float = 0.001

Minimum sum of instance weights required in a child node.

Constraints:
  • minimum = 0.0

field min_child_samples: int = 20

Minimum number of samples required in a child node.

Constraints:
  • exclusiveMinimum = 0

field subsample: float = 1.0

Subsample ratio of the training instance.

Constraints:
  • exclusiveMinimum = 0.0

field subsample_freq: int = 0

Subsample frequency. To disable, set a value less than or equal to zero.

field colsample_bytree: float = 1.0

Subsample ratio of columns when constructing each tree.

Constraints:
  • exclusiveMinimum = 0.0

field reg_alpha: float = 0.0

L1 regularisation term on weights.

Constraints:
  • minimum = 0.0

field reg_lambda: float = 0.0

L2 regularisation term on weights.

Constraints:
  • minimum = 0.0

field n_jobs: Optional[int] = None

Number of parallel threads to use for training. For better performance, it is recommended to set this to the number of physical cores in the CPU. If -1, all threads are used. If None, the number of physical cores in the system is used. Negative integers are interpreted as (n_cpus + 1 + n_jobs).

field importance_type: Literal['split', 'gain'] = 'split'

Type of feature importance to be used as a criterion to measure the quality of a split. "split" will measure the number of times the feature is used in a model. "gain" will measure the total gain of splits which use the feature.

class hazy_configurator.general_params.generators.clf_sampler.LogisticRegressionConfig

Bases: BaseClassifierConfig

Parameter configuration for the Logistic Regression classifier. See the scikit-learn logistic regression documentation for more details.

Fields:
  • C (float)

  • class_weight (Optional[Literal['balanced']])

  • classifier_type (Literal['logistic_regression'])

  • dual (bool)

  • fit_intercept (bool)

  • intercept_scaling (float)

  • l1_ratio (Optional[float])

  • max_iter (int)

  • multi_class (Literal['auto', 'ovr', 'multinomial'])

  • n_jobs (Optional[int])

  • penalty (Literal['l1', 'l2', 'elasticnet', 'none'])

  • solver (Literal['newton-cg', 'lbfgs', 'liblibear', 'sag', 'saga'])

  • tol (float)

  • verbose (int)

  • warm_start (bool)

field classifier_type: Literal['logistic_regression'] = 'logistic_regression'
field penalty: Literal['l1', 'l2', 'elasticnet', 'none'] = 'l2'

Norm of the penalty used for regularisation during training

field dual: bool = False

Whether to use dual or primal formulation for regularisation. Dual formulation is only supported when using an "l2" penalty and "liblinear" solver.

field tol: float = 0.0001

Tolerance for the stopping criteria used during training.

Constraints:
  • exclusiveMinimum = 0.0

field C: float = 1.0

Inverse of regularisation strength. Smaller values specify stronger regularisation.

Constraints:
  • exclusiveMinimum = 0.0

field fit_intercept: bool = True

Specifies whether an intercept term should be fitted.

field intercept_scaling: float = 1

Amount to scale the intercept term by. Intercept scaling is only applicable when fit_intercept is True and a "liblinear" solver is used.

field class_weight: Optional[Literal['balanced']] = None

Weights associated with classes. If None, all classes have an equal weight of one. If "balanced", the classes are weighted in inverse proportion to the class frequencies in the data.

field solver: Literal['newton-cg', 'lbfgs', 'liblibear', 'sag', 'saga'] = 'lbfgs'

Algorithm to use in the optimization problem.

field max_iter: int = 100

Maximum number of iterations taken for the solver to converge.

Constraints:
  • exclusiveMinimum = 0

field multi_class: Literal['auto', 'ovr', 'multinomial'] = 'auto'

Multi-class classification problem approach. If "ovr", then a binary problem is fit for each label. If "multinomial", then the multinomial loss across all classes is used for optimization.

field verbose: int = 0

Verbosity level for logging optimization progress. Only supported for the "liblinear" and "lbfgs" solver.

field warm_start: bool = False

Whether or not to re-use the solution of the previous call to fit as initialization. Not supported for "liblinear" solver.

field n_jobs: Optional[int] = None

Number of parallel threads to use for training. For better performance, it is recommended to set this to the number of physical cores in the CPU. If -1, all threads are used. If None, a single thread is used. Negative integers are interpreted as (n_cpus + 1 + n_jobs).

field l1_ratio: Optional[float] = None

The Elastic-Net mixing parameter between 0 and 1. Only used if an "elasticnet" penalty is specified.

Constraints:
  • minimum = 0.0

  • maximum = 1.0

class hazy_configurator.general_params.generators.clf_sampler.RandomForestConfig

Bases: BaseClassifierConfig

Parameter configuration for the Random Forest classifier. See the scikit-learn random forest classifier documentation for more details.

Fields:
  • bootstrap (bool)

  • ccp_alpha (float)

  • class_weight (Optional[Literal['balanced', 'balanced_subsample']])

  • classifier_type (Literal['random_forest'])

  • criterion (Literal['gini', 'entropy', 'log_loss'])

  • max_depth (Optional[int])

  • max_features_count (Optional[int])

  • max_features_frac (Optional[float])

  • max_features_func (Optional[Literal['auto', 'sqrt', 'log2']])

  • max_leaf_nodes (Optional[int])

  • max_samples_count (Optional[int])

  • max_samples_frac (Optional[float])

  • min_impurity_decrease (float)

  • min_samples_leaf_count (int)

  • min_samples_leaf_frac (Optional[float])

  • min_samples_split_count (int)

  • min_samples_split_frac (Optional[float])

  • min_weight_fraction_leaf (float)

  • n_estimators (int)

  • n_jobs (Optional[int])

  • verbose (int)

  • warm_start (bool)

field classifier_type: Literal['random_forest'] = 'random_forest'
field n_estimators: int = 100

Number of trees in the forest.

Constraints:
  • exclusiveMinimum = 0

field criterion: Literal['gini', 'entropy', 'log_loss'] = 'gini'

The function to measure the quality of a split.

field max_depth: Optional[int] = None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split_count (or min_samples_split_frac) samples.

Constraints:
  • exclusiveMinimum = 0

field min_samples_split_count: int = 2

The minimum number of samples required to split an internal node. Note that you may only set one of min_samples_split_count and min_samples_split_frac.

Constraints:
  • exclusiveMinimum = 0

field min_samples_split_frac: Optional[float] = None

The minimum number of samples required to split an internal node, as a fraction of the total number of samples. Note that you may only set one of min_samples_split_count and min_samples_split_frac.

Constraints:
  • minimum = 0

  • maximum = 1

field min_samples_leaf_count: int = 1

The minimum number of samples required to be at a leaf node. Note that you may only set one of min_samples_leaf_count and min_samples_leaf_frac.

Constraints:
  • exclusiveMinimum = 0

field min_samples_leaf_frac: Optional[float] = None

The minimum number of samples required to be at a leaf node, as a fraction of the total number of samples. Note that you may only set one of min_samples_leaf_count and min_samples_leaf_frac.

Constraints:
  • minimum = 0.0

  • maximum = 1.0

field min_weight_fraction_leaf: float = 0.0

The minimum weighted fraction of the sum total of weights required to be at a leaf node.

Constraints:
  • minimum = 0.0

  • maximum = 1.0

field max_features_func: Optional[Literal['auto', 'sqrt', 'log2']] = 'sqrt'

The number of features to consider when looking for the best split at each node, as a function of the total number of features. "auto"/"sqrt" and "log2" use the square root and base-2 logarithm of the total number of features. None corresponds to using all features. Note that you may only set one of max_features_func, max_features_count and max_features_frac.

field max_features_count: Optional[int] = None

The number of features to consider when looking for the best split at each node. None corresponds to using all features. Note that you may only set one of max_features_func, max_features_count and max_features_frac.

Constraints:
  • exclusiveMinimum = 0

field max_features_frac: Optional[float] = None

The number of features to consider when looking for the best split at each node, as a fraction of the total number of features. None corresponds to using all features. Note that you may only set one of max_features_func, max_features_count and max_features_frac.

Constraints:
  • minimum = 0.0

  • maximum = 1.0

field max_leaf_nodes: Optional[int] = None

Grows a tree in best-first fashion with the specified number of maximum leaf nodes. If None, then unlimited number of leaf nodes.

Constraints:
  • exclusiveMinimum = 0

field min_impurity_decrease: float = 0.0

A node will be split if the split induces a decrease of the impurity greater than or equal to this value.

Constraints:
  • minimum = 0.0

field bootstrap: bool = True

Whether bootstrap samples are used when building trees.

field n_jobs: Optional[int] = None

Number of parallel threads to use for training. For better performance, it is recommended to set this to the number of physical cores in the CPU. If -1, all threads are used. If None, the number of physical cores in the system is used. Negative integers are interpreted as (n_cpus + 1 + n_jobs).

field verbose: int = 0

Verbosity level for fitting and prediction progress.

field warm_start: bool = False

Whether to re-use the solution of the previous call to fit and add more estimators to the ensemble or fit a whole new forest.

field class_weight: Optional[Literal['balanced', 'balanced_subsample']] = None

Weights associated with classes. If None, all classes have an equal weight of one. If "balanced", the classes are weighted in inverse proportion to the class frequencies in the data. If "balanced_subsample", this is the same as "balanced" with weights computed for every tree grown.

field ccp_alpha: float = 0.0

Complexity parameter used for Minimal Cost-Complexity Pruning.

Constraints:
  • minimum = 0.0

field max_samples_count: Optional[int] = None

Number of training samples to draw to train each base estimator (only if bootstrap is True). None corresponds to using all samples.

Constraints:
  • exclusiveMinimum = 0

field max_samples_frac: Optional[float] = None

Number of training samples to draw to train each base estimator, as a fraction of the total number of samples (only if bootstrap is True). None corresponds to using all samples. Note that you may only set one of max_samples_count and max_samples_frac.

Constraints:
  • minimum = 0.0

  • maximum = 1.0

Sequential Training Parameters

class hazy_configurator.general_params.model_parameters.SequentialTrainingParams

Bases: SaasSequentialTrainingParams

Training parameters for the basic sequential model.

Examples

from hazy_configurator import SequentialTrainingParams

seq_params = SequentialTrainingParams(
    window_size=6,
    n_predict=2,
)
Fields:
field window_size: int = 20

The number of previous elements in the sequence to condition on during each sequential generation step.

Constraints:
  • exclusiveMinimum = 0

field n_predict: int = 1

The number of elements in the sequence to predict at each sequential generation step.

Constraints:
  • exclusiveMinimum = 0

class hazy_configurator.general_params.model_parameters.SequentialDGANTrainingParams

Bases: SaasSequentialDGANTrainingParams

Training parameters for the DoppelGANger sequential model. DoppelGANger (DGAN) is a generative adversarial network with a specialized architecture for modelling time series data.

Examples

from hazy_configurator import SequentialDGANTrainingParams

seq_params = SequentialDGANTrainingParams(
    nb_batch_generation=4,
    max_cat=100,
)
Fields:
field nb_batch_generation: int = 4

Number of records generated at each RNN pass.

Constraints:
  • exclusiveMinimum = 0

field max_cat: int = 100

Maximum amount of categorical information to preserve in each categorical column.

Constraints:
  • exclusiveMinimum = 0

field num_epochs: int = 500

Number of epochs to run training for. An epoch is one passing of the entire training table through the model.

Constraints:
  • exclusiveMinimum = 0

class hazy_configurator.sequential_params.SequentialRNNTrainingParams

Bases: HazyBaseModel

Configuration for the sequential RNN pipeline.

The RNN pipeline is an autoregressive model that generates sequences conditioned on static attributes.

Examples

from hazy_configurator import SequentialRNNTrainingParams

seq_params = SequentialRNNTrainingParams(
    dropout=0.1,
    learning_rate=0.0003,
    hidden_size=256,
    num_layers=2,
)
Fields:
  • batch_size (int)

  • device (hazy_configurator.base.enums.TorchDevice)

  • dropout (float)

  • hidden_size (int)

  • learning_rate (float)

  • max_cat (int)

  • max_epochs (int)

  • num_layers (int)

  • patience (int)

  • shuffle (bool)

  • val_size (float)

field batch_size: int = 64

Number of sequences per training (and generation) batch. This should be decreased if running with constrained memory limits.

Constraints:
  • minimum = 1

field max_epochs: int = 100

Maximum number of epochs to train each neural network for. Note that training is usually terminated before this number is reached, based on the patience setting.

Constraints:
  • minimum = 1

field patience: int = 7

Number of epochs after which training is terminated if there has been no improvement in the validation loss.

Constraints:
  • minimum = 1

field learning_rate: float = 0.001

Initial learning rate for the Adam optimizer.

Constraints:
  • minimum = 0.0

field device: TorchDevice = TorchDevice.CPU

Accelerator to use for training the model. NOTE: As this model relies on deep learning, it will take longer to train on a CPU.

field val_size: float = 0.2

Proportion of data to use for validation.

Constraints:
  • minimum = 0.0

  • maximum = 1.0

field shuffle: bool = True

Whether or not to shuffle training data at every epoch. Also controls shuffling before splitting data into training and validation sets.

field hidden_size: int = 512

Number of units in each hidden layer. A higher number allows for more complex patterns to be captured at the cost of longer training times and potential over-fitting.

Constraints:
  • minimum = 2

field num_layers: int = 3

Number of hidden layers in each network. A higher number allows for more complex patterns to be captured at the cost of longer training times and potential overfitting.

Constraints:
  • minimum = 1

field dropout: float = 0.0

Probability of masking input values during training. Useful for mitigating overfitting and encouraging generalization.

Constraints:
  • minimum = 0.0

  • maximum = 1.0

field max_cat: int = 100

Maximum amount of categorical information to preserve in each categorical column.

Constraints:
  • minimum = 1