Hazy Configuration

These configuration objects are provided to the Client for running training and generation jobs.

Training Configuration

class hazy_configurator.general_params.training_config.TrainingConfig

Bases: HazyBaseModel

Main training config used to initiate a training job.

Examples

For a more detailed example see Examples.

from datetime import datetime
from hazy_configurator import TrainingConfig

train_config = TrainingConfig(
    data_schema=get_data_schema(),
    data_input=get_data_input(),
    created_at=datetime.now(),
    model_output="model.hmf",
)
Fields:
field created_at: Optional[datetime] = None

Datetime of when the configuration was created, use datetime.now() in Python to set as the time the object was created.

field model_output: Optional[Union[PathWriteTableConfig, GenericEndpoint, str]] = None

Path to save output model file (.hmf). It should follow the format ‘/path/to/directory/modelname.hmf’If saving to S3 a full path should be specified. If training via SynthDocker this must be set!

field data_schema: DataSchema [Required]

A description of the data defining the data types and table configuration.

field data_input: List[DataLocationInput] [Required]

A list of data locations defining where the data is to be read from. See Data Connectors.

field data_sources: List[Union[SecretDataSource, SecretDataSourceIO]] = []

List of data sources that can be referred to by DataLocationInputs under the data_input parameter. Only SecretDataSource should be used.

field model_parameters: ModelParameters = ModelParameters(generator=PrivBayesConfig(skew_threshold=None, split_time=False, n_bins=100, single_threshold=None, max_cat=100, bin_strategy_default=<BinningStrategyType.UNIFORM: 'uniform'>, drop_empty_bins=False, processing_epsilon=None, preserve_datetime_range=True, generator_type='priv_bayes', epsilon=1000.0, n_parents=3, network_sample_rows=None, sample_parents=25, default_strategy=<PrivBayesUnknownCombinationStrategyType.UNIFORM: 'uniform'>), multi_table=MultiTableTrainingParams(adjacency_type=<AdjacencyType.DEGREE_PRESERVING: 'degree_preserving'>, parent_compression=None, core_version=1, use_cache=False), sequential=SequentialTrainingParams(window_size=20, n_predict=1))

The full model training parameters, including generative model hyperparameters, processing parameters and multi table model training parameters.

field evaluation: Optional[EvaluationConfig] = EvaluationConfig(metrics=[HistogramSimilarityParams(metric_type=<MetricType.HISTOGRAM_SIMILARITY: 'histogram_similarity'>, table=None), MutualInformationSimilarityParams(metric_type=<MetricType.MUTUAL_INFORMATION_SIMILARITY: 'mutual_information_similarity'>, table=None), CrossTableMutualInformationSimilarityParams(metric_type=<MetricType.CROSS_TABLE_MUTUAL_INFORMATION_SIMILARITY: 'cross_table_mutual_information_similarity'>, table=None), DegreeDistributionSimilarityParams(metric_type=<MetricType.DEGREE_DISTRIBUTION_SIMILARITY: 'degree_distribution_similarity'>, table=None)], eval_sample_params=EvalSampleParams(magnitude=1.0))

The evaluation parameters used to configure the evaluation metrics generated.

field sample_params: Optional[SampleParams] = SampleParams(magnitude=1.0, auto_magnitude=True)

Parameters used to generate a synthetic sample of the data to be stored in the model for inspection.

field train_test_split: bool = False

Perform cross validation by splitting data into train and test sets.

field database_subsetting_params: Optional[DatabaseSubsettingParams] = None

Parameters used to downsample/train-test split a multi-table dataset.

field random_state: Optional[int] = None

Random seed to use during the training process. Will make the training process deterministic.

field title: str = ''

A title given to each specific configuration. Should help the user distinguish different configurations.

field description: str = ''

Description of model to differentiate it from others.

field metadata: Dict[str, str] = {}

Key value store of any metadata you wish to be associated with this model. Values are stored as a string, they can be cast as different types in the Hub for model comparison.

field nrows: Optional[int] = None

Limits the number of rows from tables read into memory and used for sampling. Please note that this is likely to break referential integrity so should be used with caution. Database subsetting provides a more robust version of this which can handle the above scenarios. This is not currently available when reading in data from a database.

Constraints:
  • exclusiveMinimum = 0

field strip_strings: bool = False

When True, left/right whitespace padding is stripped from strings.

field empty_strings_as_null: bool = True

When True, empty strings are treated as null values. Any columns containingempty strings and no missing values will be affected.

field acknowledgement: bool = False

As a Data Controller, I acknowledge that I have configured all requisite settings diligently and correctly as defined in the Hazy documentation.

sanitize() TrainingConfig

Sanitize data_sources

class hazy_configurator.general_params.sample_generation_config.SampleParams

Bases: SaasSampleParams, BaseSampleParams

Sample parameters.

These define how data should be generated for producing a sample of data.

Fields:
field magnitude: float = 1.0

Amount of synthetic data generated as a proportion of the number of rows in the training data. That is, the number of rows after any optional subsampling and train-test-splitting, as specified by the user, has been performed. For example, a value of 1.0 will generate as much data as the training data for every table. A value of 2.0 will generate twice as many rows.

Constraints:
  • exclusiveMinimum = 0

field auto_magnitude: bool = True

Auto set magnitude based on an estimate to produce a minimum of 25 rows of sample synthetic data to be stored in the model. If this is set to True magnitude will be ignored as this takes precedence.

Generation Configuration

class hazy_configurator.general_params.generation_config.GenerationConfig

Bases: VersionedHazyBaseModel

Main generation config used to initiate a generation job.

Examples

from datetime import datetime
from hazy_configurator import GenerationConfig, GenSampleParams

gen_config = GenerationConfig(
    created_at=datetime.now(),
    model="model.hmf",
    data_output=[
        DataLocationOutput(
            name="output-table",
            location="/data/output.csv",
        )
    ],
    generate_params=GenSampleParams(magnitude=2.0),
)
Fields:
field created_at: Optional[datetime] = None

Datetime of when the configuration was created, use datetime.now() in Python to set as the time the object was created.

field model: Optional[Union[PathReadTableConfig, str, Path]] = None

Path to model file (.hmf) .

field data_output: List[DataLocationOutput] [Required]

List of output data storage locations for each table.

field data_sources: List[Union[SecretDataSource, SecretDataSourceIO]] = []

List of data sources that can be referred to by DataLocationOutputs under the data_output parameter. Only SecretDataSource should be used.

field random_state: Optional[int] = None

Random seed to use during the generation process. Will make the generation process deterministic so that rerunning the generation process with the same parameters produces identical data.

field generate_params: GenSampleParams = GenSampleParams(magnitude=1.0, post_generation=None, limit_strings=False, drop_unique_violation=False, batch_magnitude=None)

Parameters which specify how data should be generated.

field validation: Optional[ValidationConfig] = ValidationConfig(enable=True)

Options for configuring post-generation validation.

field generation_state_path: Optional[Union[PathWriteTableConfig, str]] = None

Path to save output file containing the metadata JSON for the generation run. If no value is specified, the generation state will not be saved.

sanitize() GenerationConfig

Sanitize data_sources

class hazy_configurator.general_params.sample_generation_config.GenSampleParams

Bases: SaasGenSampleParams, BaseSampleParams

Generation sample parameters.

These define how data should be generated.

Fields:
field magnitude: float = 1.0

Amount of synthetic data generated as a proportion of the number of rows in the training data. That is, the number of rows after any optional subsampling and train-test-splitting, as specified by the user, has been performed. For example, a value of 1.0 will generate as much data as the training data for every table. A value of 2.0 will generate twice as many rows.

Constraints:
  • exclusiveMinimum = 0

field post_generation: Optional[PostGenerationConfig] = None

Configuration for applying post-generation to a set of tables. See Post Generation.

field limit_strings: bool = False

Whether or not generated strings should be restricted to have a length no longer than the maximum string length in the original data.

field drop_unique_violation: bool = False

When True will drop duplicates from columns that had no duplicate values when reading in from DISK, or columns that had unique constraints in the database.

field batch_magnitude: float = None

If enabled, data is generated batch-wise. The batch magnitude is the amount of synthetic data generated per batch as a proportion of the number of rows in the training data. Must be less than or equal to magnitude.

If this parameter is set and you are writing to a database server, note the following behaviours:

The supplied if_exists write parameter will be overridden for batch generation indices > 0, as subsequent batches of generated data are appended to tables.

Also, the ability to ‘rollback’ SQL transactions from a set of writes to a database server if any write is unsuccessful is made on a per-batch basis. Therefore it is possible that any preceding batches remain written to the server if one batch-write fails.

Alternatively, if a fail-safe path has been set in the write config, then a subset of batches can be written to the database server, and the remainder to the fail-safe path.

Constraints:
  • exclusiveMinimum = 0

class hazy_configurator.general_params.validation_config.ValidationConfig

Bases: VersionedHazyBaseModel

Class for configuring validation of generated data. For more information on the set of validators run, see functional validation.

Fields:
field enable: bool = True

If enabled, run validation on the generated data, and produce a validation report. This uses the configuration, as well metadata from the training data, stored in the model file, to check the integrity of the generated data. This can currently only be viewed in the Hub user interface.