Evaluation

Evaluation Configuration

class hazy_configurator.general_params.evaluation_config.EvaluationConfig

Bases: HazyBaseModel

Defines how to run evaluation at the end of a training job.

Examples

from hazy_configurator import (
    EvaluationConfig,
    HistogramSimilarityParams,
    MutualInformationSimilarityParams,
    CrossTableMutualInformationSimilarityParams,
    PresenceDisclosureParams,
    DensityDisclosureParams,
    EvalSampleParams,
)

eval_config = EvaluationConfig(
    metrics=[
        HistogramSimilarityParams(),
        MutualInformationSimilarityParams(),
        CrossTableMutualInformationSimilarityParams(),
        DegreeDistributionSimilarityParams(),
        PresenceDisclosureParams(table="customer_marketing"),
        DensityDisclosureParams(table="customer_marketing"),
    ],
    eval_sample_params=EvalSampleParams(magnitude=0.2),
)
Fields:
field metrics: Optional[List[MetricParamsUnion]] = [HistogramSimilarityParams(metric_type=<MetricType.HISTOGRAM_SIMILARITY: 'histogram_similarity'>, table=None), MutualInformationSimilarityParams(metric_type=<MetricType.MUTUAL_INFORMATION_SIMILARITY: 'mutual_information_similarity'>, table=None), CrossTableMutualInformationSimilarityParams(metric_type=<MetricType.CROSS_TABLE_MUTUAL_INFORMATION_SIMILARITY: 'cross_table_mutual_information_similarity'>, table=None), DegreeDistributionSimilarityParams(metric_type=<MetricType.DEGREE_DISTRIBUTION_SIMILARITY: 'degree_distribution_similarity'>, table=None)]

A list of metrics and their parameters to run. See Metrics. By default histogram similarity, mutual information, cross table mutual information and degree distribution similarity are run.

field eval_sample_params: EvalSampleParams = EvalSampleParams(magnitude=1.0)

These parameters describe how to generate the data for evaluation.

class hazy_configurator.general_params.sample_generation_config.EvalSampleParams

Bases: SaasEvalSampleParams, BaseSampleParams

Evaluation sample parameters.

These define how data should be generated for evaluation.

Fields:
field magnitude: float = 1.0

Amount of synthetic data generated as a proportion of the number of rows in the training data. That is, the number of rows after any optional subsampling and train-test-splitting, as specified by the user, has been performed. For example, a value of 1.0 will generate as much data as the training data for every table. A value of 2.0 will generate twice as many rows.

Constraints:
  • exclusiveMinimum = 0

Metrics

Classes:

AggSeqDensityDisclosureParams

Aggregates sequential data and then performs density disclosure on the processed data.

AggSeqHistogramSimilarityParams

Aggregates sequential data and then performs histogram similarity on the processed data.

AggSeqMutualInfoParams

Aggregates sequential data and then performs mutual information similarity on the processed data.

AggSeqPresenceDisclosureParams

Aggregates sequential data and then performs presence disclosure on the processed data.

AggSeqQueryUtilityParams

Aggregates sequential data and then performs query utility on the processed data.

CrossTableMutualInformationSimilarityParams

Pairwise mutual information between pairs of column from connected tables.

DegreeDistributionSimilarityParams

Measures the similarity in the distribution of the number of connections in one-to-many and many-to-many relationships across real and synthetic data.

DensityDisclosureParams

Estimates the risk of an adversary constructing a mapping from the synthetic data points to the real data points (this concept can also be referred to as “reversibility”).

DistanceClosestRecordParams

Captures whether the synthetic data contains records that are simple copies or minor perturbations of the train data records.

HistogramSimilarityParams

Measures the similarity of the marginal distributions in the real and synthetic data.

IndexMetricParams

Measures the preservation of points of interest in the synthetic data based on a market research approach to identify those.

MutualInformationSimilarityParams

Measures the similarity of the real and synthetic data from an Information Theory point of view.

PredictiveUtilityParams

Runs a predictor on synthetic and real data and measures how close the synthetic score is to the real score.

PresenceDisclosureParams

Measures the certainty with which an adversary could infer whether an arbitrary data point was present in the real data used to train the synthetic data generator.

QueryUtilityParams

Measures the average overlap of the joint distribution of values across three or more columns.

SequentialDiscriminatorParams

Captures whether a sequential classifier is able to distinguish between real and synthetic data.

SequentialSimilarityParams

Applies a subset of the catch22: CAnonical Time-series CHaracteristics properties and compares real to synthetic.

class hazy_configurator.general_params.metrics.agg_seq_density_disclosure_params.AggSeqDensityDisclosureParams

Bases: AggregatedSequentialParams, SaasAggSeqDensityDisclosureParams

Aggregates sequential data and then performs density disclosure on the processed data.

Examples

from hazy_configurator import AggSeqDensityDisclosureParams

# run on "transactions" table only
AggSeqDensityDisclosureParams(table="transactions", seq_id="account_id")
Fields:
field sample_records: int = 10000

Number of records of synthetic data to be sampled when when scoring the Presence Disclosure metric.

field seq_id: str [Required]

ID column defining rows which belong to the same sequence. e.g. account_id in a transactions table.

field agg_cols: Optional[List[str]] = None

Columns to aggregate on.

field agg_functions: List[AggFunctionUnion] = ['median']

List of functions to use to aggregate each column.

field table: str [Required]

A table must be provided.

class hazy_configurator.general_params.metrics.agg_seq_histogram_similarity_params.AggSeqHistogramSimilarityParams

Bases: AggregatedSequentialParams

Aggregates sequential data and then performs histogram similarity on the processed data.

Examples

from hazy_configurator import AggSeqHistogramSimilarityParams

# run on "transactions" table only
AggSeqHistogramSimilarityParams(table="transactions", seq_id="account_id")
Fields:
field seq_id: str [Required]

ID column defining rows which belong to the same sequence. e.g. account_id in a transactions table.

field agg_cols: Optional[List[str]] = None

Columns to aggregate on.

field agg_functions: List[AggFunctionUnion] = ['median']

List of functions to use to aggregate each column.

field table: str [Required]

A table must be provided.

class hazy_configurator.general_params.metrics.agg_seq_mutual_info_params.AggSeqMutualInfoParams

Bases: AggregatedSequentialParams

Aggregates sequential data and then performs mutual information similarity on the processed data.

Examples

from hazy_configurator import AggSeqMutualInfoParams

# run on "transactions" table only
AggSeqMutualInfoParams(table="transactions", seq_id="account_id")
Fields:
field seq_id: str [Required]

ID column defining rows which belong to the same sequence. e.g. account_id in a transactions table.

field agg_cols: Optional[List[str]] = None

Columns to aggregate on.

field agg_functions: List[AggFunctionUnion] = ['median']

List of functions to use to aggregate each column.

field table: str [Required]

A table must be provided.

class hazy_configurator.general_params.metrics.agg_seq_presence_disclosure_params.AggSeqPresenceDisclosureParams

Bases: AggregatedSequentialParams

Aggregates sequential data and then performs presence disclosure on the processed data.

Examples

from hazy_configurator import AggSeqPresenceDisclosureParams

# run on "transactions" table only
AggSeqPresenceDisclosureParams(table="transactions", seq_id="account_id")
Fields:
field seq_id: str [Required]

ID column defining rows which belong to the same sequence. e.g. account_id in a transactions table.

field agg_cols: Optional[List[str]] = None

Columns to aggregate on.

field agg_functions: List[AggFunctionUnion] = ['median']

List of functions to use to aggregate each column.

field table: str [Required]

A table must be provided.

class hazy_configurator.general_params.metrics.agg_seq_query_utility_params.AggSeqQueryUtilityParams

Bases: AggregatedSequentialParams

Aggregates sequential data and then performs query utility on the processed data.

Examples

from hazy_configurator import AggSeqQueryUtilityParams

# run on "transactions" table only
AggSeqQueryUtilityParams(table="transactions", seq_id="account_id")
Fields:
field seq_id: str [Required]

ID column defining rows which belong to the same sequence. e.g. account_id in a transactions table.

field agg_cols: Optional[List[str]] = None

Columns to aggregate on.

field agg_functions: List[AggFunctionUnion] = ['median']

List of functions to use to aggregate each column.

field table: str [Required]

A table must be provided.

class hazy_configurator.general_params.metrics.cross_table_mutual_information_similarity_params.CrossTableMutualInformationSimilarityParams

Bases: BaseMetricParams

Pairwise mutual information between pairs of column from connected tables.

This metric is only applicable in a multi-table context and measures how well models capture the relations between tables. It is identical to the Mutual Information Similarity metric in the way its calculated, but with the additional constraint that pairs of columns must be in different tables.

Examples

from hazy_configurator import CrossTableMutualInformationSimilarityParams

# cannot select a table with this metric since cross table
CrossTableMutualInformationSimilarityParams()
Fields:
field table: Optional[str] = None

If None is provided the metric will run on all tables.

class hazy_configurator.general_params.metrics.degree_distribution_similarity_params.DegreeDistributionSimilarityParams

Bases: BaseMetricParams

Measures the similarity in the distribution of the number of connections in one-to-many and many-to-many relationships across real and synthetic data.

It produces a histogram for either side of the relationship i.e. A->B and B->A.

Examples

from hazy_configurator import DegreeDistributionSimilarityParams

# cannot select a table with this metric since cross table
DegreeDistributionSimilarityParams()
Fields:
field table: Optional[str] = None

If None is provided the metric will run on all tables.

class hazy_configurator.general_params.metrics.density_disclosure_params.DensityDisclosureParams

Bases: BaseMetricParams, SaasDensityDisclosureParams

Estimates the risk of an adversary constructing a mapping from the synthetic data points to the real data points (this concept can also be referred to as “reversibility”).

This estimation is done by counting how many real data points exist in the neighbourhood of each synthetic data point. If there are no real points or if there are many real points in the neighbourhood, then it is not possible for an adversary to construct an unambiguous map from the synthetic data point to a real data point. In the first case, there is no real data point to map to and, on the second case, the many alternatives would make any attempt ambiguous at best as well as maintaining plausible deniability.

Examples

from hazy_configurator import DensityDisclosureParams

# run on "customer marketing" table only
DensityDisclosureParams(table="customer_marketing")

# run on all tables
DensityDisclosureParams()
Fields:
field sample_records: int = 10000

Number of records of synthetic data to be sampled when when scoring the Density Disclosure metric.

Constraints:
  • minimum = 1

field table: Optional[str] = None

If None is provided the metric will run on all tables.

class hazy_configurator.general_params.metrics.distance_closest_record_params.DistanceClosestRecordParams

Bases: BaseMetricParams, SaasDistanceClosestRecordParams

Captures whether the synthetic data contains records that are simple copies or minor perturbations of the train data records.

Examples

from hazy_configurator import DistanceClosestRecordParams

# run on "customer marketing" table only
DistanceClosestRecordParams(table="customer_marketing")

# run on all tables
DistanceClosestRecordParams()
Fields:
field n_records: Optional[int] = 10000

Number of records to be randomly sampled from both synth_df and test_df.

field columns: Optional[List[str]] = None

Subset of the dataset columns.

field table: Optional[str] = None

If None is provided the metric will run on all tables.

class hazy_configurator.general_params.metrics.histogram_similarity_params.HistogramSimilarityParams

Bases: BaseMetricParams

Measures the similarity of the marginal distributions in the real and synthetic data.

Every column in both datasets is binned and the overlap of the resulting histograms is calculated. The final score is calculated by averaging the overlap across all columns.

Examples

from hazy_configurator import HistogramSimilarityParams

# run on "customer marketing" table only
HistogramSimilarityParams(table="customer_marketing")

# run on all tables
HistogramSimilarityParams()
Fields:
field table: Optional[str] = None

If None is provided the metric will run on all tables.

class hazy_configurator.general_params.metrics.index_metric_params.IndexMetricParams

Bases: BaseMetricParams

Measures the preservation of points of interest in the synthetic data based on a market research approach to identify those.

Indexing is a market research approach used to highlight points of interest in a dataset that can drive advertising campaigns. This metric measures how well these points of interest are preserved in the synthetic data. A filter is used to group the data into separate subpopulations. The probability of the subpopulation having a particular attribute is calculated and then divided by the probability of the total population also having that attribute. The similarity between the real and synthetic data is then calculated using the element-wise Jaccard distance (min/max).

Examples

from hazy_configurator import DegreeDistributionSimilarityParams

# run on "transactions" table,
# filtering the data by the column "currency_symbol" which is not binary
# and focusing on an attribute named "categorical", to which the columns "type" and "operation" are related to (both non-binary)
# table, filters and attributes are required arguments
IndexMetricParams(
    table="transactions",
    filters=IndexMetricFilter(columns=["currency_symbol"], binary_flag=False),
    attributes=[
        IndexMetricAttribute(
            feature_group_name="categorical",
            columns=["type", "operation"],
            binary_flag=False
        )
    ]
),
Fields:
field table: str [Required]

A table must be provided.

field filters: IndexMetricFilter [Required]

The filter used to group the data into subpopulations.

field attributes: List[IndexMetricAttribute] [Required]

These are the attributes of a subpopulation that are being indexed against the total population. For example, if we want to assess a particular groups sentiment towards a brand of product, the column associated with sentiment towards a particular brand would be provided as the attribute.

class hazy_configurator.general_params.metrics.mutual_information_similarity_params.MutualInformationSimilarityParams

Bases: BaseMetricParams

Measures the similarity of the real and synthetic data from an Information Theory point of view.

First, the column-wise normalised mutual information is calculated for both the real and synthetic datasets. Then, the similarity of these matrices is calculated by taking the average of all off-diagonal elements (since the normalised diagonal elements are 1 in both matrices) using Jaccard distance (min/max).

Examples

from hazy_configurator import MutualInformationSimilarityParams

# run on "customer marketing" table only
MutualInformationSimilarityParams(table="customer_marketing")

# run on all tables
MutualInformationSimilarityParams()
Fields:
field table: Optional[str] = None

If None is provided the metric will run on all tables.

class hazy_configurator.general_params.metrics.predictive_utility_params.PredictiveUtilityParams

Bases: BaseMetricParams, SaasPredictiveUtilityParams

Runs a predictor on synthetic and real data and measures how close the synthetic score is to the real score.

A built-in model using the selected modelling technique (see predictor_type) is trained on real data. Typical performance metrics for the selected modelling technique are used to measure its performance. This process is then repeated using synthetic data. The overall score for a given performance metric is a measure of how close the synthetic score is to the real score. To avoid overfitting, whatever data is being used is always split into training and test set but, in this case, the test set always consists of real data. Note: For regression techniques, a lower performance result is better.

Examples

from hazy_configurator import PredictiveUtilityParams, ClassifierType

pred_utility = PredictiveUtilityParams(
    table="customer_marketing",
    label_columns=["segment"],
    predictor_type=ClassifierType.LGBM
)
Fields:
field table: str [Required]

A table must be provided.

field predictor_type: Union[ClassifierType, RegressorType] [Required]

Predictor model.

field label_columns: List[str] [Required]

Columns that will be predicted.

field optimise_predictors: bool = True

When enabled, runs hyper-parameter optimisation for each selected predictor. Due to high numbers of hyper-parameters, this feature for lgbm_classifier and lgbm_regressor can considerably increase training times, particularly when classifying columns of high cardinality. In such cases, it is advisable to consider disabling this feature.

field augment: bool = False

When enabled, combines real & synthetic data for predictive utility as oppose to only using synthetic data.

field isolate_targets: bool = False

When enabled, ensures all target columns are never used as predictive features during predictive utility.

field predictor_max_rows: Optional[int] = None

Maximum number of rows to use during predictive utility.

Constraints:
  • minimum = 1

class hazy_configurator.general_params.metrics.presence_disclosure_params.PresenceDisclosureParams

Bases: BaseMetricParams, SaasPresenceDisclosureParams

Measures the certainty with which an adversary could infer whether an arbitrary data point was present in the real data used to train the synthetic data generator.

Assume that a hypothetical adversary has access to the full synthetic dataset and a subset of real data points that can belong to the train set or the test set. The Hamming distance between each real data point and every synthetic data point is calculated and if that distance is below a certain threshold, it concludes that the real data point belongs to the train set and not to the test set. The Presence Disclosure metric score is calculated by averaging over multiple threshold settings.

Sampling a fraction of the synthetic data is allowed due to performance needs.

Examples

from hazy_configurator import PresenceDisclosureParams

# run on "customer_marketing" table only and iterate 3 times over samples of 20% of synthetic data, using 500 source records
PresenceDisclosureParams(table="customer_marketing", n_records=500, synth_magnitude=0.2, iterations=3)

# run on all tables iterating 3 times over samples of 20% of synthetic data, using 500 source records
PresenceDisclosureParams(n_records=500, synth_magnitude=0.2, iterations=3)
Fields:
field n_records: int = 1000

Number of records of source data to be sampled when when scoring the Presence Disclosure metric.

field synth_magnitude: float = 1

Fraction of the synthetic data to be sampled when scoring the Presence Disclosure metric.

Constraints:
  • exclusiveMinimum = 0

  • maximum = 1

field iterations: int = 1

Number of iterations when scoring Presence Disclosure metric using only a sample of the synthetic data.

Constraints:
  • minimum = 1

field table: Optional[str] = None

If None is provided the metric will run on all tables.

class hazy_configurator.general_params.metrics.query_utility_params.QueryUtilityParams

Bases: BaseMetricParams

Measures the average overlap of the joint distribution of values across three or more columns.

It calculates the frequency of occurrences when applying high dimensional queries to both the real and synthetic datasets. Similarity between the real and synthetic results is then calculated using the Jaccard distance (min/max). The average is computed over n iterations, with each iteration randomly selecting the number of columns to be included in the joint distribution.

Examples

from hazy_configurator import QueryUtilityParams

# run on "customer marketing" table only
QueryUtilityParams(table="customer_marketing")

# run on all tables
QueryUtilityParams()
Fields:
field table: Optional[str] = None

If None is provided the metric will run on all tables.

class hazy_configurator.general_params.metrics.sequential_discriminator_params.SequentialDiscriminatorParams

Bases: BaseMetricParams

Captures whether a sequential classifier is able to distinguish between real and synthetic data. The harder it is to distinguish

Examples

from hazy_configurator import SequentialDiscriminatorParams

# run on "customer marketing" table only
SequentialDiscriminatorParams(table="customer_marketing")

# run on all tables
SequentialDiscriminatorParams()
Fields:
field seq_id: str [Required]

ID column defining rows which belong to the same sequence. e.g. account_id in a transactions table.

field sort_by: List[str] = None

List of columns used to order sequences.

Constraints:
  • minItems = 1

field seq_keys: Optional[List[str]] = None

Columns identified as part of the sequential records.

field table: Optional[str] = None

If None is provided the metric will run on all tables.

field n_bins: Optional[int] = 100

Number of bins to use when creating the histogram

class hazy_configurator.general_params.metrics.sequential_similarity_params.SequentialSimilarityParams

Bases: BaseMetricParams

Applies a subset of the catch22: CAnonical Time-series CHaracteristics properties and compares real to synthetic.

It is designed to work on fixed frequency sequential data.

Examples

from hazy_configurator import SequentialSimilarityParams

# run on "transactions" table
# where seq_id is the sequential id that will be used to aggregate the data
# and sort_by is a list of columns that has the sequence order (such as date and time columns)
SequentialSimilarityParams(
    seq_id="account_id",
    sort_by=["date", "time"],
    table="transactions",
),
Fields:
field seq_id: str [Required]

ID column defining rows which belong to the same sequence, e.g. account_id in a transactions table.

field sort_by: List[str] = None

List of columns used to order sequences.

Constraints:
  • minItems = 1

field table: Optional[str] = None

If None is provided the metric will run on all tables.

field assume_fixed_freq: bool = False

Assume fixed frequency?