Model Parameters¶
The full model training parameters, including generative model hyperparameters, processing parameters and multi table model training parameters.
Model Parameters¶
- class hazy_configurator.general_params.model_parameters.ModelParameters¶
Bases:
HazyBaseModel
Container for all training and model parameters.
Examples
from hazy_configurator import ( ModelParameters, PrivBayesConfig, MultiTableTrainingParams, AdjacencyType, ) model_params = ModelParameters( generator=PrivBayesConfig( epsilon=0.001, n_parents=2, n_bins=50, max_cat=100, ), multi_table=MultiTableTrainingParams( adjacency_type=AdjacencyType.DEGREE_PRESERVING, ), )
{ "generator": { "generator_type": "priv_bayes", "epsilon": 0.001, "n_parents": 2, "n_bins": 50, "max_cat": 100 }, "multi_table": { "adjacency_type": "degree_preserving" } }
- field generator: GeneratorConfigUnion = PrivBayesConfig(skew_threshold=None, split_time=False, n_bins=100, single_threshold=None, max_cat=100, bin_strategy_default=<BinningStrategyType.UNIFORM: 'uniform'>, drop_empty_bins=False, processing_epsilon=None, preserve_datetime_range=True, generator_type='priv_bayes', epsilon=1000.0, n_parents=3, network_sample_rows=None, sample_parents=25, default_strategy=<PrivBayesUnknownCombinationStrategyType.UNIFORM: 'uniform'>)¶
Generative model hyperparameters and processing requirements. Choose from :
PrivBayesConfig
,CLFSamplerConfig
,MSTConfig
,AIMConfig
,DPGANConfig
- field multi_table: MultiTableTrainingParams = MultiTableTrainingParams(adjacency_type=<AdjacencyType.DEGREE_PRESERVING: 'degree_preserving'>, parent_compression=None, core_version=1, use_cache=False)¶
Multi table training parameters.
- field sequential: Union[SequentialTrainingParams, SequentialDGANTrainingParams, SequentialRNNTrainingParams] = SequentialTrainingParams(window_size=20, n_predict=1)¶
Sequential table training parameters. Choose from
SequentialTrainingParams
,SequentialDGANTrainingParams
, orSequentialRNNTrainingParams
.
Multi Table Training Parameters¶
- class hazy_configurator.general_params.model_parameters.MultiTableTrainingParams¶
Bases:
HazyBaseModel
Multi table specific training parameters.
Examples
from hazy_configurator import MultiTableTrainingParams, AdjacencyType multi_params = MultiTableTrainingParams( adjacency_type=AdjacencyType.DEGREE_PRESERVING, )
{ "adjacency_type": "degree_preserving" }
- Fields:
- field adjacency_type: AdjacencyType = AdjacencyType.DEGREE_PRESERVING¶
Adjacency method to use. Degree preserving is best for multi purpose. Component preserving is the best for when connected sub-components exist in the data, this may be the case when a single table has two or more foreign keys present. Identity gives the best results by reusing the same bipartite graph, however this comes with the caveat of only being able to generate data of fixed magnitude.
- field parent_compression: Optional[LowMIParentColumnCompressionConfig] = None¶
The maximum number of parent column candidates to condition on for any child table.
- field core_version: int = 1¶
Version of the multi-table core algorithm to use for training and generation. Version 2 is an improvement over the original, allowing the use of advanced features such as caching and incremental training & generation.
- Constraints:
minimum = 1
maximum = 2
- field use_cache: bool = False¶
Requires HAZY_CACHE_FOLDER and HAZY_CACHE_PASSWORD environment variables to be set during training and generation. Caching is only available for core version 2, and is ineffective unless a random state is specified for training and generation. Specifies whether or not to cache and reuse table sub-models during training and generation. Table sub-models are cached so that subsequent training runs involving these tables will reuse existing cached sub-models if the tables and their configurations have not been modified. Similarly for generation, synthetic data will be cached for tables in subsequent generation runs provided that the source data, training configuration and generation configuration for the tables have not been modified.
- class hazy_configurator.general_params.model_parameters.LowMIParentColumnCompressionConfig¶
Bases:
HazyBaseModel
The configuration for specifying how to compress the number of parent columns that the model uses to condition on for a column in a child table.
Parent columns are ranked by their mutual information scores with the child columns, and the lowest columns are dropped from the conditioning set.
Applying parent compression is a memory optimization technique that can be used to reduce the memory footprint of the training program, potentially at the cost of distribution similarity between training and generated data.
Note that there can be cases where the parent-child relationship between tables in a database can be ambiguous, and either ordering is valid. In this case the ordering is defined by the order of Foreign Key Type objects in the junction tables of the user-specified Data Schema. The first foreign key column will belong to the parent table, and the second foreign key column will belong to the child table.
Examples
from hazy_configurator import ( LowMIParentColumnCompressionConfig, ReductionType, ) pc_config = LowMIParentColumnCompressionConfig( max_parent_columns=4, reduction=ReductionType.MAX, )
- Fields:
- field max_parent_columns: int [Required]¶
The maximum number of columns from parent tables to condition on.
- field reduction: Optional[ReductionType] = ReductionType.MEAN¶
How scores for parent columns are reduced over child columns.
Generator Configuration¶
Priv Bayes¶
- class hazy_configurator.general_params.model_parameters.PrivBayesConfig¶
Bases:
DiscretisorParams
,BasicProcessingParams
,SaasPrivBayesConfig
PrivBayes is an algorithm introduced in 2017 that relies on discretising the data and approximating its low-dimensional marginals by fitting an optimal & differentially private Bayesian network in order to allow for efficient data generation.
Examples
from hazy_configurator import PrivBayesConfig, PrivBayesUnknownCombinationStrategyType generative_model_config = PrivBayesConfig( epsilon=0.001, n_parents=2, default_strategy=PrivBayesUnknownCombinationStrategyType.MARGINAL, n_bins=50, max_cat=100, )
{ "generator_type": "priv_bayes", "epsilon": 0.001, "n_parents": 2, "default_strategy": "marginal", "n_bins": 50, "max_cat": 100 }
- Fields:
- field epsilon: Optional[float] = 1000¶
Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. Set as
None
to turn off differential privacy.- Constraints:
exclusiveMinimum = 0.0
- field n_parents: int = 3¶
Number of parents per node to use in network building.
- Constraints:
exclusiveMinimum = 0
- field network_sample_rows: Optional[int] = None¶
Sets the number or rows to sample for the network building step. A small value will create a less accurate model but will reduce training time.
- Constraints:
exclusiveMinimum = 0
- field sample_parents: Optional[int] = 25¶
Sets the number of parents to consider for each node during the network building step. A small value will create a less accurate model but will reduce training time.
- Constraints:
exclusiveMinimum = 0
- field default_strategy: PrivBayesUnknownCombinationStrategyType = PrivBayesUnknownCombinationStrategyType.UNIFORM¶
Strategy used on generation when the combination of variables sampled from the learnt network has not been seen during training. The marginal option gives higher statistical similarity but is not differentially private.
- field n_bins: int = 100¶
The number of bins to use when discretizing numerical data.
- Constraints:
minimum = 1
- field single_threshold: Optional[float] = None¶
Frequency above which singular numerical values are considered as separate categories, if not provided it will be set to the reciprocal of the number of bins.
- Constraints:
minimum = 0
- field max_cat: Optional[int] = 100¶
Maximum amount of categorical information to preserve in each categorical column.
- Constraints:
minimum = 1
- field bin_strategy_default: BinningStrategyType = BinningStrategyType.UNIFORM¶
Method to use to discretise the data.
- field processing_epsilon: Optional[float] = None¶
Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. This is an extra budget spent per table in processing.
- field preserve_datetime_range: bool = True¶
Leaves out datetime columns from differentially private approximate min/max bound calculation. Turn this setting off to ensure all numeric columns use the processing privacy budget. Using a processing privacy budget ie epsilon causes the range of the synthetic data to be different from the original. If the real data is from a specific extracted date time range, the min/max not being close to the original can cause the marginal similarity of columns to perform poorly.
MST¶
- class hazy_configurator.general_params.model_parameters.MSTConfig¶
Bases:
DiscretisorParams
,BasicProcessingParams
MST is an algorithm introduced in 2021 that relies on discretising the data and fitting a differentially private graphical model to low-dimensional marginals in order to allow for efficient data generation.
Examples
from hazy_configurator import MSTConfig generative_model_config = MSTConfig( epsilon=0.001, delta=1e-5, n_iters=5_000, compress_domain=True, )
{ "generator_type": "mst", "epsilon": 0.001, "delta": 0.00001, "n_iters": 5000, "compress_domain": true }
- field epsilon: Optional[float] = 1000¶
Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. Set as
None
to turn off differential privacy.- Constraints:
exclusiveMinimum = 0.0
- field delta: Optional[float] = 1e-05¶
Probability of information accidentally being leaked.Second Differential Privacy Parameter.
- Constraints:
exclusiveMinimum = 0.0
exclusiveMaximum = 1.0
- field n_iters: Optional[int] = 1000¶
Number of training iterations.
- Constraints:
exclusiveMinimum = 0
- field compress_domain: Optional[bool] = True¶
When set to True the domain of the 1-way marginals is compressed according to the available privacy budget.
- field n_bins: int = 100¶
The number of bins to use when discretizing numerical data.
- Constraints:
minimum = 1
- field single_threshold: Optional[float] = None¶
Frequency above which singular numerical values are considered as separate categories, if not provided it will be set to the reciprocal of the number of bins.
- Constraints:
minimum = 0
- field max_cat: Optional[int] = 100¶
Maximum amount of categorical information to preserve in each categorical column.
- Constraints:
minimum = 1
- field bin_strategy_default: BinningStrategyType = BinningStrategyType.UNIFORM¶
Method to use to discretise the data.
- field processing_epsilon: Optional[float] = None¶
Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. This is an extra budget spent per table in processing.
- field preserve_datetime_range: bool = True¶
Leaves out datetime columns from differentially private approximate min/max bound calculation. Turn this setting off to ensure all numeric columns use the processing privacy budget. Using a processing privacy budget ie epsilon causes the range of the synthetic data to be different from the original. If the real data is from a specific extracted date time range, the min/max not being close to the original can cause the marginal similarity of columns to perform poorly.
AIM¶
- class hazy_configurator.general_params.model_parameters.AIMConfig¶
Bases:
DiscretisorParams
,BasicProcessingParams
Adaptive and iterative mechanism (AIM) for differentially private synthetic data. It relies on a graphical model approach to select a workload defined as a set of queries to approximate.
Examples
from hazy_configurator import AIMConfig generative_model_config = AIMConfig( epsilon=0.001, delta=1e-5 )
{ "generator_type": "aim", "epsilon": 0.001, "delta": 0.00001 }
- field epsilon: Optional[float] = 1000¶
Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. Set as
None
to turn off differential privacy. It should be noted that for AIM, as epsilon increases, training becomes slower. Training time with epsilon=None will take the longest.- Constraints:
exclusiveMinimum = 0.0
- field delta: Optional[float] = 1e-05¶
Probability of information accidentally being leaked.Second Differential Privacy Parameter.
- Constraints:
exclusiveMinimum = 0.0
exclusiveMaximum = 1.0
- field max_model_size: Optional[int] = 512¶
Maximum Size (in Megabytes) the trained model can occupy in memory.
- Constraints:
exclusiveMinimum = 0
- field degree: Optional[int] = 2¶
Maximum size of the marginals to be modelled.
- Constraints:
exclusiveMinimum = 0
maximum = 3
- field num_marginals: Optional[int] = None¶
Maximum number of marginals to model.
- Constraints:
exclusiveMinimum = 0
- field n_iters: Optional[int] = 1000¶
Number of training iterations.
- Constraints:
exclusiveMinimum = 0
- field n_bins: int = 100¶
The number of bins to use when discretizing numerical data.
- Constraints:
minimum = 1
- field single_threshold: Optional[float] = None¶
Frequency above which singular numerical values are considered as separate categories, if not provided it will be set to the reciprocal of the number of bins.
- Constraints:
minimum = 0
- field max_cat: Optional[int] = 100¶
Maximum amount of categorical information to preserve in each categorical column.
- Constraints:
minimum = 1
- field bin_strategy_default: BinningStrategyType = BinningStrategyType.UNIFORM¶
Method to use to discretise the data.
- field processing_epsilon: Optional[float] = None¶
Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. This is an extra budget spent per table in processing.
- field preserve_datetime_range: bool = True¶
Leaves out datetime columns from differentially private approximate min/max bound calculation. Turn this setting off to ensure all numeric columns use the processing privacy budget. Using a processing privacy budget ie epsilon causes the range of the synthetic data to be different from the original. If the real data is from a specific extracted date time range, the min/max not being close to the original can cause the marginal similarity of columns to perform poorly.
CLF Sampler¶
- class hazy_configurator.general_params.model_parameters.CLFSamplerConfig¶
Bases:
DiscretisorParams
,BasicProcessingParams
CLF Sampler is an algorithm that makes use of predictive models to conditionally generate data by synthesising columns based on pairwise mutual information. The algorithm is somewhat inspired by PrivBayes and shares some similarities including conditional generation and discretisation of data.
Note
CLF Sampler currently does not provide any differential privacy guarantees.
Please use PrivBayes (via
PrivBayesConfig
) if differential privacy is a requirement for your synthetic data use case.Examples
from hazy_configurator import ( CLFSamplerConfig, LGBMConfig, ) generative_model_config = CLFSamplerConfig( classifier_params=LGBMConfig( boosting_type="rf", n_estimators=25, ), sort_visit=True, sample_parents=5, n_bins=50, max_cat=100, )
{ "generator_type": "clf_sampler", "classifier_params": { "classifier_type": "lgbm", "boosting_type": "rf", "n_estimators": 25 }, "sort_visit": true, "sample_parents": 5, "n_bins": 50, "max_cat": 100 }
- Fields:
- field classifier_params: Union[DecisionTreeConfig, LGBMConfig, LogisticRegressionConfig, RandomForestConfig] = DecisionTreeConfig(classifier_type='decision_tree', criterion='gini', splitter='best', max_depth=None, min_samples_split_count=2, min_samples_split_frac=None, min_samples_leaf_count=1, min_samples_leaf_frac=None, min_weight_fraction_leaf=0.0, max_features_func=None, max_features_count=None, max_features_frac=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)¶
Additional arguments to be provided during initialisation of the base classifier. Choose from
DecisionTreeConfig
,LGBMConfig
,LogisticRegressionConfig
orRandomForestConfig
.
- field visit_order: Optional[List[ColId]] = None¶
The order in which to generate the columns in a table. Note that providing a
sample_parents
value is likely to undo any changes done byvisit_order
. Similarly, settingsort_visit
to true will attempt to automatically determine the optimal visit order.
- field sort_visit: bool = False¶
Whether or not to automatically sort table columns in order of importance during generation. When enabled this aims to optimise the quality of the generated data.
- field sample_parents: Optional[int] = None¶
Sets the number of parents to consider for each node during the network building step. A small value will create a less accurate model but will reduce training time.
- Constraints:
exclusiveMinimum = 0
- field n_bins: int = 100¶
The number of bins to use when discretizing numerical data.
- Constraints:
minimum = 1
- field single_threshold: Optional[float] = None¶
Frequency above which singular numerical values are considered as separate categories, if not provided it will be set to the reciprocal of the number of bins.
- Constraints:
minimum = 0
- field max_cat: Optional[int] = 100¶
Maximum amount of categorical information to preserve in each categorical column.
- Constraints:
minimum = 1
- field bin_strategy_default: BinningStrategyType = BinningStrategyType.UNIFORM¶
Method to use to discretise the data.
- field processing_epsilon: Optional[float] = None¶
Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. This is an extra budget spent per table in processing.
- field preserve_datetime_range: bool = True¶
Leaves out datetime columns from differentially private approximate min/max bound calculation. Turn this setting off to ensure all numeric columns use the processing privacy budget. Using a processing privacy budget ie epsilon causes the range of the synthetic data to be different from the original. If the real data is from a specific extracted date time range, the min/max not being close to the original can cause the marginal similarity of columns to perform poorly.
DPGAN¶
- class hazy_configurator.general_params.model_parameters.DPGANConfig¶
Bases:
GANProcessingParams
Adaptive and iterative mechanism (AIM) for differentially private synthetic data. It relies on a graphical model approach to select a workload defined as a set of queries to approximate.
Examples
from hazy_configurator import AIMConfig generative_model_config = AIMConfig( epsilon=0.001, delta=1e-5 )
{ "generator_type": "aim", "epsilon": 0.001, "delta": 0.00001 }
- Fields:
- field n_iter: Optional[int] = 2000¶
Number of training iterations to run during training (in number of batches)
- Constraints:
exclusiveMinimum = 0
Number of hidden layers in the generator module
- Constraints:
exclusiveMinimum = 0
Size of the hidden layers in the generator module
- Constraints:
exclusiveMinimum = 0
- field generator_dropout: Optional[float] = 0.1¶
Size of the hidden layers in the generator module
- Constraints:
minimum = 0
Number of hidden layers in the Discriminator module
- Constraints:
exclusiveMinimum = 0
Size of the hidden layers in the Discriminator module
- Constraints:
exclusiveMinimum = 0
- field discriminator_dropout: Optional[float] = 0.1¶
Size of the hidden layers in the discriminator module
- Constraints:
minimum = 0
- field discriminator_n_iter: Optional[int] = 1¶
Discriminator number of iterations during training loop
- Constraints:
exclusiveMinimum = 0
- field weight_decay: Optional[float] = 0.001¶
Overall learning rate during training
- Constraints:
minimum = 0
- field lambda_gradient_penalty: Optional[float] = 10¶
Lambda Gradient Penalty
- Constraints:
exclusiveMinimum = 0.0
- field encoder_max_clusters: Optional[int] = 5¶
Encoder max clusters
- Constraints:
exclusiveMinimum = 0
- field epsilon: Optional[float] = 1000¶
Privacy budget parameter. Use small values such as 0.01 for extremely high privacy guarantees, around 1-25 for standard privacy guarantees and large values such as 10000 for low privacy but greater accuracy. Set as
None
to turn off differential privacy.- Constraints:
exclusiveMinimum = 0.0
- field delta: Optional[float] = 1e-05¶
Probability of information accidentally being leaked.Second Differential Privacy Parameter.
- Constraints:
exclusiveMinimum = 0.0
- field n_iter_min: Optional[int] = 100¶
Number of Minimum Iterations
- Constraints:
exclusiveMinimum = 0
Classifiers¶
CLF Sampler relies on a base classifier whose hyperparameters are specified in the classifier_params
field.
The supported classifiers are a selection from scikit-learn and LightGBM, each having their own additional parameters that can be configured.
Classes:
Parameter configuration for the Decision Tree classifier. |
|
Parameter configuration for the LGBM classifier. |
|
Parameter configuration for the Logistic Regression classifier. |
|
Parameter configuration for the Random Forest classifier. |
- class hazy_configurator.general_params.generators.clf_sampler.DecisionTreeConfig¶
Bases:
BaseClassifierConfig
Parameter configuration for the Decision Tree classifier. See the scikit-learn decision tree documentation for more details.
- Fields:
ccp_alpha (float)
class_weight (Optional[Literal['balanced']])
classifier_type (Literal['decision_tree'])
criterion (Literal['gini', 'entropy', 'log_loss'])
max_depth (Optional[int])
max_features_count (Optional[int])
max_features_frac (Optional[float])
max_features_func (Optional[Literal['auto', 'sqrt', 'log2']])
max_leaf_nodes (Optional[int])
min_impurity_decrease (float)
min_samples_leaf_count (int)
min_samples_leaf_frac (Optional[float])
min_samples_split_count (int)
min_samples_split_frac (Optional[float])
min_weight_fraction_leaf (float)
splitter (Literal['best', 'random'])
- field criterion: Literal['gini', 'entropy', 'log_loss'] = 'gini'¶
The function to measure the quality of a split.
- field splitter: Literal['best', 'random'] = 'best'¶
The strategy used to choose the split at each node.
"best"
chooses the best split,"random"
chooses the best random split.
- field max_depth: Optional[int] = None¶
The maximum depth of the tree. If
None
, then nodes are expanded until all leaves are pure or until all leaves contain less thanmin_samples_split_count
(ormin_samples_split_frac
) samples.- Constraints:
exclusiveMinimum = 0
- field min_samples_split_count: int = 2¶
The minimum number of samples required to split an internal node. Note that you may only set one of
min_samples_split_count
andmin_samples_split_frac
.- Constraints:
exclusiveMinimum = 0
- field min_samples_split_frac: Optional[float] = None¶
The minimum number of samples required to split an internal node, as a fraction of the total number of samples. Note that you may only set one of
min_samples_split_count
andmin_samples_split_frac
.- Constraints:
minimum = 0.0
maximum = 1.0
- field min_samples_leaf_count: int = 1¶
The minimum number of samples required to be at a leaf node. Note that you may only set one of
min_samples_leaf_count
andmin_samples_leaf_frac
.- Constraints:
exclusiveMinimum = 0
- field min_samples_leaf_frac: Optional[float] = None¶
The minimum number of samples required to be at a leaf node, as a fraction of the total number of samples. Note that you may only set one of
min_samples_leaf_count
andmin_samples_leaf_frac
.- Constraints:
minimum = 0.0
maximum = 1.0
- field min_weight_fraction_leaf: float = 0.0¶
The minimum weighted fraction of the sum total of weights required to be at a leaf node.
- Constraints:
minimum = 0.0
maximum = 1.0
- field max_features_func: Optional[Literal['auto', 'sqrt', 'log2']] = None¶
The number of features to consider when looking for the best split at each node, as a function of the total number of features.
"auto"
/"sqrt"
and"log2"
use the square root and base-2 logarithm of the total number of features.None
corresponds to using all features. Note that you may only set one ofmax_features_func
,max_features_count
andmax_features_frac
.
- field max_features_count: Optional[int] = None¶
The number of features to consider when looking for the best split at each node.
None
corresponds to using all features. Note that you may only set one ofmax_features_func
,max_features_count
andmax_features_frac
.- Constraints:
exclusiveMinimum = 0
- field max_features_frac: Optional[float] = None¶
The number of features to consider when looking for the best split at each node, as a fraction of the total number of features.
None
corresponds to using all features. Note that you may only set one ofmax_features_func
,max_features_count
andmax_features_frac
.- Constraints:
minimum = 0.0
maximum = 1.0
- field max_leaf_nodes: Optional[int] = None¶
Grows a tree in best-first fashion with the specified number of maximum leaf nodes. If
None
, then unlimited number of leaf nodes.- Constraints:
exclusiveMinimum = 0
- field min_impurity_decrease: float = 0.0¶
A node will be split if the split induces a decrease of the impurity greater than or equal to this value.
- Constraints:
minimum = 0.0
- class hazy_configurator.general_params.generators.clf_sampler.LGBMConfig¶
Bases:
BaseClassifierConfig
Parameter configuration for the LGBM classifier. See the LightGBM documentation for more details.
- Fields:
boosting_type (Optional[Literal['gbdt', 'dart', 'goss', 'rf']])
class_weight (Optional[Literal['balanced']])
classifier_type (Literal['lgbm'])
colsample_bytree (float)
importance_type (Literal['split', 'gain'])
learning_rate (float)
max_depth (int)
min_child_samples (int)
min_child_weight (float)
min_split_gain (float)
n_estimators (int)
n_jobs (Optional[int])
num_leaves (int)
objective (Optional[Literal['binary', 'multiclass']])
reg_alpha (float)
reg_lambda (float)
subsample (float)
subsample_for_bin (int)
subsample_freq (int)
- field boosting_type: Optional[Literal['gbdt', 'dart', 'goss', 'rf']] = 'gbdt'¶
Type of gradient boosting model to use.
"gbdt"
is traditional Gradient Boosting Decision Tree."dart"
is Dropout meets Multiple Additive Regression Trees."goss"
is Gradient-based One-Side Sampling."rf"
is Random Forest.
- field num_leaves: int = 31¶
Maximum number of leaf nodes for base learners.
- Constraints:
exclusiveMinimum = 0
- field learning_rate: float = 0.1¶
Learning rate used for training.
- Constraints:
exclusiveMinimum = 0.0
- field subsample_for_bin: int = 200000¶
Number of samples for constructing bins.
- Constraints:
exclusiveMinimum = 0
- field objective: Optional[Literal['binary', 'multiclass']] = None¶
Learning objective function to be used for training.
- field class_weight: Optional[Literal['balanced']] = None¶
Weights associated with classes. If
None
, all classes have an equal weight of one. If"balanced"
, the classes are weighted in inverse proportion to the class frequencies in the data.
- field min_split_gain: float = 0.0¶
Minimum loss reduction required to make a further partition on a leaf node of a tree.
- Constraints:
minimum = 0.0
- field min_child_weight: float = 0.001¶
Minimum sum of instance weights required in a child node.
- Constraints:
minimum = 0.0
- field min_child_samples: int = 20¶
Minimum number of samples required in a child node.
- Constraints:
exclusiveMinimum = 0
- field subsample: float = 1.0¶
Subsample ratio of the training instance.
- Constraints:
exclusiveMinimum = 0.0
- field subsample_freq: int = 0¶
Subsample frequency. To disable, set a value less than or equal to zero.
- field colsample_bytree: float = 1.0¶
Subsample ratio of columns when constructing each tree.
- Constraints:
exclusiveMinimum = 0.0
- field n_jobs: Optional[int] = None¶
Number of parallel threads to use for training. For better performance, it is recommended to set this to the number of physical cores in the CPU. If
-1
, all threads are used. IfNone
, the number of physical cores in the system is used. Negative integers are interpreted as(n_cpus + 1 + n_jobs)
.
- class hazy_configurator.general_params.generators.clf_sampler.LogisticRegressionConfig¶
Bases:
BaseClassifierConfig
Parameter configuration for the Logistic Regression classifier. See the scikit-learn logistic regression documentation for more details.
- Fields:
C (float)
class_weight (Optional[Literal['balanced']])
classifier_type (Literal['logistic_regression'])
dual (bool)
fit_intercept (bool)
intercept_scaling (float)
l1_ratio (Optional[float])
max_iter (int)
multi_class (Literal['auto', 'ovr', 'multinomial'])
n_jobs (Optional[int])
penalty (Literal['l1', 'l2', 'elasticnet', 'none'])
solver (Literal['newton-cg', 'lbfgs', 'liblibear', 'sag', 'saga'])
tol (float)
verbose (int)
warm_start (bool)
- field penalty: Literal['l1', 'l2', 'elasticnet', 'none'] = 'l2'¶
Norm of the penalty used for regularisation during training
- field dual: bool = False¶
Whether to use dual or primal formulation for regularisation. Dual formulation is only supported when using an
"l2"
penalty and"liblinear"
solver
.
- field tol: float = 0.0001¶
Tolerance for the stopping criteria used during training.
- Constraints:
exclusiveMinimum = 0.0
- field C: float = 1.0¶
Inverse of regularisation strength. Smaller values specify stronger regularisation.
- Constraints:
exclusiveMinimum = 0.0
- field intercept_scaling: float = 1¶
Amount to scale the intercept term by. Intercept scaling is only applicable when
fit_intercept
isTrue
and a"liblinear"
solver
is used.
- field class_weight: Optional[Literal['balanced']] = None¶
Weights associated with classes. If
None
, all classes have an equal weight of one. If"balanced"
, the classes are weighted in inverse proportion to the class frequencies in the data.
- field solver: Literal['newton-cg', 'lbfgs', 'liblibear', 'sag', 'saga'] = 'lbfgs'¶
Algorithm to use in the optimization problem.
- field max_iter: int = 100¶
Maximum number of iterations taken for the solver to converge.
- Constraints:
exclusiveMinimum = 0
- field multi_class: Literal['auto', 'ovr', 'multinomial'] = 'auto'¶
Multi-class classification problem approach. If
"ovr"
, then a binary problem is fit for each label. If"multinomial"
, then the multinomial loss across all classes is used for optimization.
- field verbose: int = 0¶
Verbosity level for logging optimization progress. Only supported for the
"liblinear"
and"lbfgs"
solver
.
- field warm_start: bool = False¶
Whether or not to re-use the solution of the previous call to fit as initialization. Not supported for
"liblinear"
solver.
- field n_jobs: Optional[int] = None¶
Number of parallel threads to use for training. For better performance, it is recommended to set this to the number of physical cores in the CPU. If
-1
, all threads are used. IfNone
, a single thread is used. Negative integers are interpreted as(n_cpus + 1 + n_jobs)
.
- class hazy_configurator.general_params.generators.clf_sampler.RandomForestConfig¶
Bases:
BaseClassifierConfig
Parameter configuration for the Random Forest classifier. See the scikit-learn random forest classifier documentation for more details.
- Fields:
bootstrap (bool)
ccp_alpha (float)
class_weight (Optional[Literal['balanced', 'balanced_subsample']])
classifier_type (Literal['random_forest'])
criterion (Literal['gini', 'entropy', 'log_loss'])
max_depth (Optional[int])
max_features_count (Optional[int])
max_features_frac (Optional[float])
max_features_func (Optional[Literal['auto', 'sqrt', 'log2']])
max_leaf_nodes (Optional[int])
max_samples_count (Optional[int])
max_samples_frac (Optional[float])
min_impurity_decrease (float)
min_samples_leaf_count (int)
min_samples_leaf_frac (Optional[float])
min_samples_split_count (int)
min_samples_split_frac (Optional[float])
min_weight_fraction_leaf (float)
n_estimators (int)
n_jobs (Optional[int])
verbose (int)
warm_start (bool)
- field criterion: Literal['gini', 'entropy', 'log_loss'] = 'gini'¶
The function to measure the quality of a split.
- field max_depth: Optional[int] = None¶
The maximum depth of the tree. If
None
, then nodes are expanded until all leaves are pure or until all leaves contain less thanmin_samples_split_count
(ormin_samples_split_frac
) samples.- Constraints:
exclusiveMinimum = 0
- field min_samples_split_count: int = 2¶
The minimum number of samples required to split an internal node. Note that you may only set one of
min_samples_split_count
andmin_samples_split_frac
.- Constraints:
exclusiveMinimum = 0
- field min_samples_split_frac: Optional[float] = None¶
The minimum number of samples required to split an internal node, as a fraction of the total number of samples. Note that you may only set one of
min_samples_split_count
andmin_samples_split_frac
.- Constraints:
minimum = 0
maximum = 1
- field min_samples_leaf_count: int = 1¶
The minimum number of samples required to be at a leaf node. Note that you may only set one of
min_samples_leaf_count
andmin_samples_leaf_frac
.- Constraints:
exclusiveMinimum = 0
- field min_samples_leaf_frac: Optional[float] = None¶
The minimum number of samples required to be at a leaf node, as a fraction of the total number of samples. Note that you may only set one of
min_samples_leaf_count
andmin_samples_leaf_frac
.- Constraints:
minimum = 0.0
maximum = 1.0
- field min_weight_fraction_leaf: float = 0.0¶
The minimum weighted fraction of the sum total of weights required to be at a leaf node.
- Constraints:
minimum = 0.0
maximum = 1.0
- field max_features_func: Optional[Literal['auto', 'sqrt', 'log2']] = 'sqrt'¶
The number of features to consider when looking for the best split at each node, as a function of the total number of features.
"auto"
/"sqrt"
and"log2"
use the square root and base-2 logarithm of the total number of features.None
corresponds to using all features. Note that you may only set one ofmax_features_func
,max_features_count
andmax_features_frac
.
- field max_features_count: Optional[int] = None¶
The number of features to consider when looking for the best split at each node.
None
corresponds to using all features. Note that you may only set one ofmax_features_func
,max_features_count
andmax_features_frac
.- Constraints:
exclusiveMinimum = 0
- field max_features_frac: Optional[float] = None¶
The number of features to consider when looking for the best split at each node, as a fraction of the total number of features.
None
corresponds to using all features. Note that you may only set one ofmax_features_func
,max_features_count
andmax_features_frac
.- Constraints:
minimum = 0.0
maximum = 1.0
- field max_leaf_nodes: Optional[int] = None¶
Grows a tree in best-first fashion with the specified number of maximum leaf nodes. If
None
, then unlimited number of leaf nodes.- Constraints:
exclusiveMinimum = 0
- field min_impurity_decrease: float = 0.0¶
A node will be split if the split induces a decrease of the impurity greater than or equal to this value.
- Constraints:
minimum = 0.0
- field n_jobs: Optional[int] = None¶
Number of parallel threads to use for training. For better performance, it is recommended to set this to the number of physical cores in the CPU. If
-1
, all threads are used. IfNone
, the number of physical cores in the system is used. Negative integers are interpreted as(n_cpus + 1 + n_jobs)
.
- field warm_start: bool = False¶
Whether to re-use the solution of the previous call to fit and add more estimators to the ensemble or fit a whole new forest.
- field class_weight: Optional[Literal['balanced', 'balanced_subsample']] = None¶
Weights associated with classes. If
None
, all classes have an equal weight of one. If"balanced"
, the classes are weighted in inverse proportion to the class frequencies in the data. If"balanced_subsample"
, this is the same as"balanced"
with weights computed for every tree grown.
- field ccp_alpha: float = 0.0¶
Complexity parameter used for Minimal Cost-Complexity Pruning.
- Constraints:
minimum = 0.0
- field max_samples_count: Optional[int] = None¶
Number of training samples to draw to train each base estimator (only if
bootstrap
isTrue
).None
corresponds to using all samples.- Constraints:
exclusiveMinimum = 0
- field max_samples_frac: Optional[float] = None¶
Number of training samples to draw to train each base estimator, as a fraction of the total number of samples (only if
bootstrap
isTrue
).None
corresponds to using all samples. Note that you may only set one ofmax_samples_count
andmax_samples_frac
.- Constraints:
minimum = 0.0
maximum = 1.0
Sequential Training Parameters¶
- class hazy_configurator.general_params.model_parameters.SequentialTrainingParams¶
Bases:
SaasSequentialTrainingParams
Training parameters for the basic sequential model.
Examples
from hazy_configurator import SequentialTrainingParams seq_params = SequentialTrainingParams( window_size=6, n_predict=2, )
{ "window_size": 6, "n_predict": 2 }
- Fields:
- class hazy_configurator.general_params.model_parameters.SequentialDGANTrainingParams¶
Bases:
SaasSequentialDGANTrainingParams
Training parameters for the DoppelGANger sequential model. DoppelGANger (DGAN) is a generative adversarial network with a specialized architecture for modelling time series data.
Examples
from hazy_configurator import SequentialDGANTrainingParams seq_params = SequentialDGANTrainingParams( nb_batch_generation=4, max_cat=100, )
{ "nb_batch_generation": 4, "max_cat": 2 }
- field nb_batch_generation: int = 4¶
Number of records generated at each RNN pass.
- Constraints:
exclusiveMinimum = 0
- class hazy_configurator.sequential_params.SequentialRNNTrainingParams¶
Bases:
HazyBaseModel
Configuration for the sequential RNN pipeline.
The RNN pipeline is an autoregressive model that generates sequences conditioned on static attributes.
Examples
from hazy_configurator import SequentialRNNTrainingParams seq_params = SequentialRNNTrainingParams( dropout=0.1, learning_rate=0.0003, hidden_size=256, num_layers=2, )
{ "dropout": 0.1, "learning_rate": 0.0003, "hidden_size": 256, "num_layers": 2, }
- Fields:
batch_size (int)
device (hazy_configurator.base.enums.TorchDevice)
dropout (float)
hidden_size (int)
learning_rate (float)
max_cat (int)
max_epochs (int)
num_layers (int)
patience (int)
shuffle (bool)
val_size (float)
- field batch_size: int = 64¶
Number of sequences per training (and generation) batch. This should be decreased if running with constrained memory limits.
- Constraints:
minimum = 1
- field max_epochs: int = 100¶
Maximum number of epochs to train each neural network for. Note that training is usually terminated before this number is reached, based on the patience setting.
- Constraints:
minimum = 1
- field patience: int = 7¶
Number of epochs after which training is terminated if there has been no improvement in the validation loss.
- Constraints:
minimum = 1
- field learning_rate: float = 0.001¶
Initial learning rate for the Adam optimizer.
- Constraints:
minimum = 0.0
- field device: TorchDevice = TorchDevice.CPU¶
Accelerator to use for training the model. NOTE: As this model relies on deep learning, it will take longer to train on a CPU.
- field val_size: float = 0.2¶
Proportion of data to use for validation.
- Constraints:
minimum = 0.0
maximum = 1.0
- field shuffle: bool = True¶
Whether or not to shuffle training data at every epoch. Also controls shuffling before splitting data into training and validation sets.
Number of units in each hidden layer. A higher number allows for more complex patterns to be captured at the cost of longer training times and potential over-fitting.
- Constraints:
minimum = 2
- field num_layers: int = 3¶
Number of hidden layers in each network. A higher number allows for more complex patterns to be captured at the cost of longer training times and potential overfitting.
- Constraints:
minimum = 1