Hazy Configuration¶
These configuration objects are provided to the Client for running training and generation jobs.
Training Configuration¶
- class hazy_configurator.general_params.training_config.TrainingConfig¶
Bases:
HazyBaseModel
Main training config used to initiate a training job.
Examples
For a more detailed example see Examples.
from datetime import datetime from hazy_configurator import TrainingConfig train_config = TrainingConfig( data_schema=get_data_schema(), data_input=get_data_input(), created_at=datetime.now(), model_output="model.hmf", )
{ "action": "train", "data_schema": {...}, "data_input": [{...}], "created_at": "2022-08-11T19:51:35.816163", "model_output": "model.hmf" }
- Fields:
action (Literal['train'])
configurator_version (str)
configurator_version_used (Optional[hazy_configurator.base.extra_schema_params.SchemaVersion])
data_input (List[hazy_configurator.data_schema.data_location.io.DataLocationInput])
data_schema (hazy_configurator.data_schema.data_schema.DataSchema)
evaluation (Optional[hazy_configurator.general_params.evaluation_config.EvaluationConfig])
experimental (Optional[hazy_configurator.general_params.training_config.ExperimentalTrainingParameters])
model_parameters (hazy_configurator.general_params.model_parameters.ModelParameters)
sample_params (Optional[hazy_configurator.general_params.sample_generation_config.SampleParams])
schema_version_used (Optional[hazy_configurator.base.extra_schema_params.SchemaVersion])
- field created_at: Optional[datetime] = None¶
Datetime of when the configuration was created, use
datetime.now()
in Python to set as the time the object was created.
- field model_output: Optional[Union[PathWriteTableConfig, GenericEndpoint, str]] = None¶
Path to save output model file (.hmf). It should follow the format ‘/path/to/directory/modelname.hmf’If saving to S3 a full path should be specified. If training via SynthDocker this must be set!
- field data_schema: DataSchema [Required]¶
A description of the data defining the data types and table configuration.
- field data_input: List[DataLocationInput] [Required]¶
A list of data locations defining where the data is to be read from. See Data Connectors.
- field data_sources: List[Union[SecretDataSource, SecretDataSourceIO]] = []¶
List of data sources that can be referred to by DataLocationInputs under the data_input parameter. Only SecretDataSource should be used.
- field model_parameters: ModelParameters = ModelParameters(generator=PrivBayesConfig(skew_threshold=None, split_time=False, n_bins=100, single_threshold=None, max_cat=100, bin_strategy_default=<BinningStrategyType.UNIFORM: 'uniform'>, drop_empty_bins=False, processing_epsilon=None, preserve_datetime_range=True, generator_type='priv_bayes', epsilon=1000.0, n_parents=3, network_sample_rows=None, sample_parents=25, default_strategy=<PrivBayesUnknownCombinationStrategyType.UNIFORM: 'uniform'>), multi_table=MultiTableTrainingParams(adjacency_type=<AdjacencyType.DEGREE_PRESERVING: 'degree_preserving'>, parent_compression=None, core_version=1, use_cache=False), sequential=SequentialTrainingParams(window_size=20, n_predict=1))¶
The full model training parameters, including generative model hyperparameters, processing parameters and multi table model training parameters.
- field evaluation: Optional[EvaluationConfig] = EvaluationConfig(metrics=[HistogramSimilarityParams(metric_type=<MetricType.HISTOGRAM_SIMILARITY: 'histogram_similarity'>, table=None), MutualInformationSimilarityParams(metric_type=<MetricType.MUTUAL_INFORMATION_SIMILARITY: 'mutual_information_similarity'>, table=None), CrossTableMutualInformationSimilarityParams(metric_type=<MetricType.CROSS_TABLE_MUTUAL_INFORMATION_SIMILARITY: 'cross_table_mutual_information_similarity'>, table=None), DegreeDistributionSimilarityParams(metric_type=<MetricType.DEGREE_DISTRIBUTION_SIMILARITY: 'degree_distribution_similarity'>, table=None)], eval_sample_params=EvalSampleParams(magnitude=1.0))¶
The evaluation parameters used to configure the evaluation metrics generated.
- field sample_params: Optional[SampleParams] = SampleParams(magnitude=1.0, auto_magnitude=True)¶
Parameters used to generate a synthetic sample of the data to be stored in the model for inspection.
- field train_test_split: bool = False¶
Perform cross validation by splitting data into train and test sets.
- field database_subsetting_params: Optional[DatabaseSubsettingParams] = None¶
Parameters used to downsample/train-test split a multi-table dataset.
- field random_state: Optional[int] = None¶
Random seed to use during the training process. Will make the training process deterministic.
- field title: str = ''¶
A title given to each specific configuration. Should help the user distinguish different configurations.
- field metadata: Dict[str, str] = {}¶
Key value store of any metadata you wish to be associated with this model. Values are stored as a string, they can be cast as different types in the Hub for model comparison.
- field nrows: Optional[int] = None¶
Limits the number of rows from tables read into memory and used for sampling. Please note that this is likely to break referential integrity so should be used with caution. Database subsetting provides a more robust version of this which can handle the above scenarios. This is not currently available when reading in data from a database.
- Constraints:
exclusiveMinimum = 0
- field strip_strings: bool = False¶
When True, left/right whitespace padding is stripped from strings.
- field empty_strings_as_null: bool = True¶
When True, empty strings are treated as null values. Any columns containingempty strings and no missing values will be affected.
- field acknowledgement: bool = False¶
As a Data Controller, I acknowledge that I have configured all requisite settings diligently and correctly as defined in the Hazy documentation.
- sanitize() TrainingConfig ¶
Sanitize data_sources
- class hazy_configurator.general_params.sample_generation_config.SampleParams¶
Bases:
SaasSampleParams
,BaseSampleParams
Sample parameters.
These define how data should be generated for producing a sample of data.
- field magnitude: float = 1.0¶
Amount of synthetic data generated as a proportion of the number of rows in the training data. That is, the number of rows after any optional subsampling and train-test-splitting, as specified by the user, has been performed. For example, a value of 1.0 will generate as much data as the training data for every table. A value of 2.0 will generate twice as many rows.
- Constraints:
exclusiveMinimum = 0
Generation Configuration¶
- class hazy_configurator.general_params.generation_config.GenerationConfig¶
Bases:
VersionedHazyBaseModel
Main generation config used to initiate a generation job.
Examples
from datetime import datetime from hazy_configurator import GenerationConfig, GenSampleParams gen_config = GenerationConfig( created_at=datetime.now(), model="model.hmf", data_output=[ DataLocationOutput( name="output-table", location="/data/output.csv", ) ], generate_params=GenSampleParams(magnitude=2.0), )
{ "action": "generate", "created_at": "2022-08-11T19:51:35.816163", "model": "model.hmf", "data_output": [ { name: "output-table", location: "/data/output.csv" } ], "generate_params": { "magnitude": 2.0 } }
- Fields:
action (Literal['generate'])
configurator_version (str)
configurator_version_used (Optional[hazy_configurator.base.extra_schema_params.SchemaVersion])
data_output (List[hazy_configurator.data_schema.data_location.io.DataLocationOutput])
generate_params (hazy_configurator.general_params.sample_generation_config.GenSampleParams)
schema_version_used (Optional[hazy_configurator.base.extra_schema_params.SchemaVersion])
validation (Optional[hazy_configurator.general_params.validation_config.ValidationConfig])
- field created_at: Optional[datetime] = None¶
Datetime of when the configuration was created, use
datetime.now()
in Python to set as the time the object was created.
- field data_output: List[DataLocationOutput] [Required]¶
List of output data storage locations for each table.
- field data_sources: List[Union[SecretDataSource, SecretDataSourceIO]] = []¶
List of data sources that can be referred to by DataLocationOutputs under the data_output parameter. Only SecretDataSource should be used.
- field random_state: Optional[int] = None¶
Random seed to use during the generation process. Will make the generation process deterministic so that rerunning the generation process with the same parameters produces identical data.
- field generate_params: GenSampleParams = GenSampleParams(magnitude=1.0, post_generation=None, limit_strings=False, drop_unique_violation=False, batch_magnitude=None)¶
Parameters which specify how data should be generated.
- field validation: Optional[ValidationConfig] = ValidationConfig(enable=True)¶
Options for configuring post-generation validation.
- field generation_state_path: Optional[Union[PathWriteTableConfig, str]] = None¶
Path to save output file containing the metadata JSON for the generation run. If no value is specified, the generation state will not be saved.
- sanitize() GenerationConfig ¶
Sanitize data_sources
- class hazy_configurator.general_params.sample_generation_config.GenSampleParams¶
Bases:
SaasGenSampleParams
,BaseSampleParams
Generation sample parameters.
These define how data should be generated.
- Fields:
- field magnitude: float = 1.0¶
Amount of synthetic data generated as a proportion of the number of rows in the training data. That is, the number of rows after any optional subsampling and train-test-splitting, as specified by the user, has been performed. For example, a value of 1.0 will generate as much data as the training data for every table. A value of 2.0 will generate twice as many rows.
- Constraints:
exclusiveMinimum = 0
- field post_generation: Optional[PostGenerationConfig] = None¶
Configuration for applying post-generation to a set of tables. See Post Generation.
- field limit_strings: bool = False¶
Whether or not generated strings should be restricted to have a length no longer than the maximum string length in the original data.
- field drop_unique_violation: bool = False¶
When True will drop duplicates from columns that had no duplicate values when reading in from DISK, or columns that had unique constraints in the database.
- field batch_magnitude: float = None¶
If enabled, data is generated batch-wise. The batch magnitude is the amount of synthetic data generated per batch as a proportion of the number of rows in the training data. Must be less than or equal to magnitude.
If this parameter is set and you are writing to a database server, note the following behaviours:
The supplied
if_exists
write parameter will be overridden for batch generation indices > 0, as subsequent batches of generated data are appended to tables.Also, the ability to ‘rollback’ SQL transactions from a set of writes to a database server if any write is unsuccessful is made on a per-batch basis. Therefore it is possible that any preceding batches remain written to the server if one batch-write fails.
Alternatively, if a fail-safe path has been set in the write config, then a subset of batches can be written to the database server, and the remainder to the fail-safe path.
- Constraints:
exclusiveMinimum = 0
- class hazy_configurator.general_params.validation_config.ValidationConfig¶
Bases:
VersionedHazyBaseModel
Class for configuring validation of generated data. For more information on the set of validators run, see functional validation.
- Fields:
- field enable: bool = True¶
If enabled, run validation on the generated data, and produce a validation report. This uses the configuration, as well metadata from the training data, stored in the model file, to check the integrity of the generated data. This can currently only be viewed in the Hub user interface.