Evaluation¶
Evaluation Configuration¶
- class hazy_configurator.general_params.evaluation_config.EvaluationConfig¶
Bases:
HazyBaseModel
Defines how to run evaluation at the end of a training job.
Examples
from hazy_configurator import ( EvaluationConfig, HistogramSimilarityParams, MutualInformationSimilarityParams, CrossTableMutualInformationSimilarityParams, PresenceDisclosureParams, DensityDisclosureParams, EvalSampleParams, ) eval_config = EvaluationConfig( metrics=[ HistogramSimilarityParams(), MutualInformationSimilarityParams(), CrossTableMutualInformationSimilarityParams(), DegreeDistributionSimilarityParams(), PresenceDisclosureParams(table="customer_marketing"), DensityDisclosureParams(table="customer_marketing"), ], eval_sample_params=EvalSampleParams(magnitude=0.2), )
{ ... }
- Fields:
- field metrics: Optional[List[MetricParamsUnion]] = [HistogramSimilarityParams(metric_type=<MetricType.HISTOGRAM_SIMILARITY: 'histogram_similarity'>, table=None), MutualInformationSimilarityParams(metric_type=<MetricType.MUTUAL_INFORMATION_SIMILARITY: 'mutual_information_similarity'>, table=None), CrossTableMutualInformationSimilarityParams(metric_type=<MetricType.CROSS_TABLE_MUTUAL_INFORMATION_SIMILARITY: 'cross_table_mutual_information_similarity'>, table=None), DegreeDistributionSimilarityParams(metric_type=<MetricType.DEGREE_DISTRIBUTION_SIMILARITY: 'degree_distribution_similarity'>, table=None)]¶
A list of metrics and their parameters to run. See Metrics. By default histogram similarity, mutual information, cross table mutual information and degree distribution similarity are run.
- field eval_sample_params: EvalSampleParams = EvalSampleParams(magnitude=1.0)¶
These parameters describe how to generate the data for evaluation.
- class hazy_configurator.general_params.sample_generation_config.EvalSampleParams¶
Bases:
SaasEvalSampleParams
,BaseSampleParams
Evaluation sample parameters.
These define how data should be generated for evaluation.
- Fields:
- field magnitude: float = 1.0¶
Amount of synthetic data generated as a proportion of the number of rows in the training data. That is, the number of rows after any optional subsampling and train-test-splitting, as specified by the user, has been performed. For example, a value of 1.0 will generate as much data as the training data for every table. A value of 2.0 will generate twice as many rows.
- Constraints:
exclusiveMinimum = 0
Metrics¶
Classes:
Aggregates sequential data and then performs density disclosure on the processed data. |
|
Aggregates sequential data and then performs histogram similarity on the processed data. |
|
Aggregates sequential data and then performs mutual information similarity on the processed data. |
|
Aggregates sequential data and then performs presence disclosure on the processed data. |
|
Aggregates sequential data and then performs query utility on the processed data. |
|
Pairwise mutual information between pairs of column from connected tables. |
|
Measures the similarity in the distribution of the number of connections in one-to-many and many-to-many relationships across real and synthetic data. |
|
Estimates the risk of an adversary constructing a mapping from the synthetic data points to the real data points (this concept can also be referred to as “reversibility”). |
|
Captures whether the synthetic data contains records that are simple copies or minor perturbations of the train data records. |
|
Measures the similarity of the marginal distributions in the real and synthetic data. |
|
Measures the preservation of points of interest in the synthetic data based on a market research approach to identify those. |
|
Measures the similarity of the real and synthetic data from an Information Theory point of view. |
|
Runs a predictor on synthetic and real data and measures how close the synthetic score is to the real score. |
|
Measures the certainty with which an adversary could infer whether an arbitrary data point was present in the real data used to train the synthetic data generator. |
|
Measures the average overlap of the joint distribution of values across three or more columns. |
|
Captures whether a sequential classifier is able to distinguish between real and synthetic data. |
|
Applies a subset of the catch22: CAnonical Time-series CHaracteristics properties and compares real to synthetic. |
- class hazy_configurator.general_params.metrics.agg_seq_density_disclosure_params.AggSeqDensityDisclosureParams¶
Bases:
AggregatedSequentialParams
,SaasAggSeqDensityDisclosureParams
Aggregates sequential data and then performs density disclosure on the processed data.
Examples
from hazy_configurator import AggSeqDensityDisclosureParams # run on "transactions" table only AggSeqDensityDisclosureParams(table="transactions", seq_id="account_id")
{ "metric_type": "aggregated_sequential_density_disclosure", "table": "transactions", "seq_id": "account_id" }
- Fields:
- field sample_records: int = 10000¶
Number of records of synthetic data to be sampled when when scoring the Presence Disclosure metric.
- field seq_id: str [Required]¶
ID column defining rows which belong to the same sequence. e.g.
account_id
in a transactions table.
- field agg_functions: List[AggFunctionUnion] = ['median']¶
List of functions to use to aggregate each column.
- class hazy_configurator.general_params.metrics.agg_seq_histogram_similarity_params.AggSeqHistogramSimilarityParams¶
Bases:
AggregatedSequentialParams
Aggregates sequential data and then performs histogram similarity on the processed data.
Examples
from hazy_configurator import AggSeqHistogramSimilarityParams # run on "transactions" table only AggSeqHistogramSimilarityParams(table="transactions", seq_id="account_id")
{ "metric_type": "aggregated_sequential_histogram_similarity", "table": "transactions", "seq_id": "account_id" }
- Fields:
- field seq_id: str [Required]¶
ID column defining rows which belong to the same sequence. e.g.
account_id
in a transactions table.
- field agg_functions: List[AggFunctionUnion] = ['median']¶
List of functions to use to aggregate each column.
- class hazy_configurator.general_params.metrics.agg_seq_mutual_info_params.AggSeqMutualInfoParams¶
Bases:
AggregatedSequentialParams
Aggregates sequential data and then performs mutual information similarity on the processed data.
Examples
from hazy_configurator import AggSeqMutualInfoParams # run on "transactions" table only AggSeqMutualInfoParams(table="transactions", seq_id="account_id")
{ "metric_type": "aggregated_sequential_mutual_information", "table": "transactions", "seq_id": "account_id" }
- Fields:
- field seq_id: str [Required]¶
ID column defining rows which belong to the same sequence. e.g.
account_id
in a transactions table.
- field agg_functions: List[AggFunctionUnion] = ['median']¶
List of functions to use to aggregate each column.
- class hazy_configurator.general_params.metrics.agg_seq_presence_disclosure_params.AggSeqPresenceDisclosureParams¶
Bases:
AggregatedSequentialParams
Aggregates sequential data and then performs presence disclosure on the processed data.
Examples
from hazy_configurator import AggSeqPresenceDisclosureParams # run on "transactions" table only AggSeqPresenceDisclosureParams(table="transactions", seq_id="account_id")
{ "metric_type": "aggregated_sequential_presence_disclosure", "table": "transactions", "seq_id": "account_id" }
- Fields:
- field seq_id: str [Required]¶
ID column defining rows which belong to the same sequence. e.g.
account_id
in a transactions table.
- field agg_functions: List[AggFunctionUnion] = ['median']¶
List of functions to use to aggregate each column.
- class hazy_configurator.general_params.metrics.agg_seq_query_utility_params.AggSeqQueryUtilityParams¶
Bases:
AggregatedSequentialParams
Aggregates sequential data and then performs query utility on the processed data.
Examples
from hazy_configurator import AggSeqQueryUtilityParams # run on "transactions" table only AggSeqQueryUtilityParams(table="transactions", seq_id="account_id")
{ "metric_type": "aggregated_sequential_query_utility", "table": "transactions", "seq_id": "account_id" }
- Fields:
- field seq_id: str [Required]¶
ID column defining rows which belong to the same sequence. e.g.
account_id
in a transactions table.
- field agg_functions: List[AggFunctionUnion] = ['median']¶
List of functions to use to aggregate each column.
- class hazy_configurator.general_params.metrics.cross_table_mutual_information_similarity_params.CrossTableMutualInformationSimilarityParams¶
Bases:
BaseMetricParams
Pairwise mutual information between pairs of column from connected tables.
This metric is only applicable in a multi-table context and measures how well models capture the relations between tables. It is identical to the Mutual Information Similarity metric in the way its calculated, but with the additional constraint that pairs of columns must be in different tables.
Examples
from hazy_configurator import CrossTableMutualInformationSimilarityParams # cannot select a table with this metric since cross table CrossTableMutualInformationSimilarityParams()
{ "metric_type": "cross_table_mutual_information_similarity", }
- Fields:
metric_type (Literal[
- class hazy_configurator.general_params.metrics.degree_distribution_similarity_params.DegreeDistributionSimilarityParams¶
Bases:
BaseMetricParams
Measures the similarity in the distribution of the number of connections in one-to-many and many-to-many relationships across real and synthetic data.
It produces a histogram for either side of the relationship i.e.
A->B
andB->A
.Examples
from hazy_configurator import DegreeDistributionSimilarityParams # cannot select a table with this metric since cross table DegreeDistributionSimilarityParams()
{ "metric_type": "degree_distribution_similarity", }
- Fields:
metric_type (Literal[
- class hazy_configurator.general_params.metrics.density_disclosure_params.DensityDisclosureParams¶
Bases:
BaseMetricParams
,SaasDensityDisclosureParams
Estimates the risk of an adversary constructing a mapping from the synthetic data points to the real data points (this concept can also be referred to as “reversibility”).
This estimation is done by counting how many real data points exist in the neighbourhood of each synthetic data point. If there are no real points or if there are many real points in the neighbourhood, then it is not possible for an adversary to construct an unambiguous map from the synthetic data point to a real data point. In the first case, there is no real data point to map to and, on the second case, the many alternatives would make any attempt ambiguous at best as well as maintaining plausible deniability.
Examples
from hazy_configurator import DensityDisclosureParams # run on "customer marketing" table only DensityDisclosureParams(table="customer_marketing") # run on all tables DensityDisclosureParams()
{ "metric_type": "density_disclosure", "table": "customer_marketing", }
- Fields:
metric_type (Literal[
- class hazy_configurator.general_params.metrics.distance_closest_record_params.DistanceClosestRecordParams¶
Bases:
BaseMetricParams
,SaasDistanceClosestRecordParams
Captures whether the synthetic data contains records that are simple copies or minor perturbations of the train data records.
Examples
from hazy_configurator import DistanceClosestRecordParams # run on "customer marketing" table only DistanceClosestRecordParams(table="customer_marketing") # run on all tables DistanceClosestRecordParams()
{ "metric_type": "distance_closest_record", "table": "customer_marketing", }
- Fields:
metric_type (Literal[
- class hazy_configurator.general_params.metrics.histogram_similarity_params.HistogramSimilarityParams¶
Bases:
BaseMetricParams
Measures the similarity of the marginal distributions in the real and synthetic data.
Every column in both datasets is binned and the overlap of the resulting histograms is calculated. The final score is calculated by averaging the overlap across all columns.
Examples
from hazy_configurator import HistogramSimilarityParams # run on "customer marketing" table only HistogramSimilarityParams(table="customer_marketing") # run on all tables HistogramSimilarityParams()
{ "metric_type": "histogram_similarity", "table": "customer_marketing", }
- Fields:
metric_type (Literal[
- class hazy_configurator.general_params.metrics.index_metric_params.IndexMetricParams¶
Bases:
BaseMetricParams
Measures the preservation of points of interest in the synthetic data based on a market research approach to identify those.
Indexing is a market research approach used to highlight points of interest in a dataset that can drive advertising campaigns. This metric measures how well these points of interest are preserved in the synthetic data. A filter is used to group the data into separate subpopulations. The probability of the subpopulation having a particular attribute is calculated and then divided by the probability of the total population also having that attribute. The similarity between the real and synthetic data is then calculated using the element-wise Jaccard distance (min/max).
Examples
from hazy_configurator import DegreeDistributionSimilarityParams # run on "transactions" table, # filtering the data by the column "currency_symbol" which is not binary # and focusing on an attribute named "categorical", to which the columns "type" and "operation" are related to (both non-binary) # table, filters and attributes are required arguments IndexMetricParams( table="transactions", filters=IndexMetricFilter(columns=["currency_symbol"], binary_flag=False), attributes=[ IndexMetricAttribute( feature_group_name="categorical", columns=["type", "operation"], binary_flag=False ) ] ),
{ "metric_type": "index_metric", "table": "transactions", "filters": { "columns": ["currency_symbol"], "binary_flag": false }, "attributes": [ { "feature_group_name": "categorical", "columns": ["type", "operation"], "binary_flag": false } ] }
- Fields:
- field filters: IndexMetricFilter [Required]¶
The filter used to group the data into subpopulations.
- field attributes: List[IndexMetricAttribute] [Required]¶
These are the attributes of a subpopulation that are being indexed against the total population. For example, if we want to assess a particular groups sentiment towards a brand of product, the column associated with sentiment towards a particular brand would be provided as the attribute.
- class hazy_configurator.general_params.metrics.mutual_information_similarity_params.MutualInformationSimilarityParams¶
Bases:
BaseMetricParams
Measures the similarity of the real and synthetic data from an Information Theory point of view.
First, the column-wise normalised mutual information is calculated for both the real and synthetic datasets. Then, the similarity of these matrices is calculated by taking the average of all off-diagonal elements (since the normalised diagonal elements are 1 in both matrices) using Jaccard distance (min/max).
Examples
from hazy_configurator import MutualInformationSimilarityParams # run on "customer marketing" table only MutualInformationSimilarityParams(table="customer_marketing") # run on all tables MutualInformationSimilarityParams()
{ "metric_type": "mutual_information_similarity", "table": "customer_marketing", }
- Fields:
metric_type (Literal[
- class hazy_configurator.general_params.metrics.predictive_utility_params.PredictiveUtilityParams¶
Bases:
BaseMetricParams
,SaasPredictiveUtilityParams
Runs a predictor on synthetic and real data and measures how close the synthetic score is to the real score.
A built-in model using the selected modelling technique (see
predictor_type
) is trained on real data. Typical performance metrics for the selected modelling technique are used to measure its performance. This process is then repeated using synthetic data. The overall score for a given performance metric is a measure of how close the synthetic score is to the real score. To avoid overfitting, whatever data is being used is always split into training and test set but, in this case, the test set always consists of real data. Note: For regression techniques, a lower performance result is better.Examples
from hazy_configurator import PredictiveUtilityParams, ClassifierType pred_utility = PredictiveUtilityParams( table="customer_marketing", label_columns=["segment"], predictor_type=ClassifierType.LGBM )
{ "metric_type": "predictive_utility", "table": "customer_marketing", "label_columns": ["segment"], "predictor_type": "lgbm_classifier", }
- Fields:
- field predictor_type: Union[ClassifierType, RegressorType] [Required]¶
Predictor model.
- field optimise_predictors: bool = True¶
When enabled, runs hyper-parameter optimisation for each selected predictor. Due to high numbers of hyper-parameters, this feature for lgbm_classifier and lgbm_regressor can considerably increase training times, particularly when classifying columns of high cardinality. In such cases, it is advisable to consider disabling this feature.
- field augment: bool = False¶
When enabled, combines real & synthetic data for predictive utility as oppose to only using synthetic data.
- class hazy_configurator.general_params.metrics.presence_disclosure_params.PresenceDisclosureParams¶
Bases:
BaseMetricParams
,SaasPresenceDisclosureParams
Measures the certainty with which an adversary could infer whether an arbitrary data point was present in the real data used to train the synthetic data generator.
Assume that a hypothetical adversary has access to the full synthetic dataset and a subset of real data points that can belong to the train set or the test set. The Hamming distance between each real data point and every synthetic data point is calculated and if that distance is below a certain threshold, it concludes that the real data point belongs to the train set and not to the test set. The Presence Disclosure metric score is calculated by averaging over multiple threshold settings.
Sampling a fraction of the synthetic data is allowed due to performance needs.
Examples
from hazy_configurator import PresenceDisclosureParams # run on "customer_marketing" table only and iterate 3 times over samples of 20% of synthetic data, using 500 source records PresenceDisclosureParams(table="customer_marketing", n_records=500, synth_magnitude=0.2, iterations=3) # run on all tables iterating 3 times over samples of 20% of synthetic data, using 500 source records PresenceDisclosureParams(n_records=500, synth_magnitude=0.2, iterations=3)
{ "metric_type": "presence_disclosure", "table": "customer_marketing", "n_records": 500, "synth_magnitude": 0.2, "iterations": 3, }
- Fields:
metric_type (Literal[
- field n_records: int = 1000¶
Number of records of source data to be sampled when when scoring the Presence Disclosure metric.
- field synth_magnitude: float = 1¶
Fraction of the synthetic data to be sampled when scoring the Presence Disclosure metric.
- Constraints:
exclusiveMinimum = 0
maximum = 1
- class hazy_configurator.general_params.metrics.query_utility_params.QueryUtilityParams¶
Bases:
BaseMetricParams
Measures the average overlap of the joint distribution of values across three or more columns.
It calculates the frequency of occurrences when applying high dimensional queries to both the real and synthetic datasets. Similarity between the real and synthetic results is then calculated using the Jaccard distance (min/max). The average is computed over n iterations, with each iteration randomly selecting the number of columns to be included in the joint distribution.
Examples
from hazy_configurator import QueryUtilityParams # run on "customer marketing" table only QueryUtilityParams(table="customer_marketing") # run on all tables QueryUtilityParams()
{ "metric_type": "query_utility", "table": "customer_marketing", }
- Fields:
metric_type (Literal[
- class hazy_configurator.general_params.metrics.sequential_discriminator_params.SequentialDiscriminatorParams¶
Bases:
BaseMetricParams
Captures whether a sequential classifier is able to distinguish between real and synthetic data. The harder it is to distinguish
Examples
from hazy_configurator import SequentialDiscriminatorParams # run on "customer marketing" table only SequentialDiscriminatorParams(table="customer_marketing") # run on all tables SequentialDiscriminatorParams()
{ "metric_type": "seq_discriminator", "table": "customer_marketing", }
- Fields:
metric_type (Literal[
- class hazy_configurator.general_params.metrics.sequential_similarity_params.SequentialSimilarityParams¶
Bases:
BaseMetricParams
Applies a subset of the catch22: CAnonical Time-series CHaracteristics properties and compares real to synthetic.
It is designed to work on fixed frequency sequential data.
Examples
from hazy_configurator import SequentialSimilarityParams # run on "transactions" table # where seq_id is the sequential id that will be used to aggregate the data # and sort_by is a list of columns that has the sequence order (such as date and time columns) SequentialSimilarityParams( seq_id="account_id", sort_by=["date", "time"], table="transactions", ),
{ "metric_type": "sequential_similarity", "table": "transactions", "seq_id": "account_id", "sort_by": ["date", "time"], }
- Fields:
metric_type (Literal[