Utility

Utility measures the performance of a downstream model or algorithm when run on the source data vs the synthetic.

Generally, synthetic data will show a drop in utility. This is not always the case: sometimes the synthetic data can randomly increase utility by chance. Generally, the aim of smart synthetic data generation is to minimise loss in utility.

Note: currently, Hazy only meausures utility for predictive analytics use-cases. For clustering and other unsupervised learning techniques, you can use similarity as a proxy for utility and / or perform your own bespoke evaluation.

Predictive performance

Hazy selects typical performance metrics for the selected modelling technique and uses built-in models to measure the performance of the modelling technique trained with the source data. The process is then repeated by measuring the performance of the modelling technique when trained with the synthetic data. Normally this involves splitting the data into a training set to train the model and a test set to validate the model, in order to avoid overfitting - note that the test set always consists of the original data.

The overall score or "match" for a given performance metric is a measure of how close the synthetic score is to the real score:

Predictive Utility=Performance of model trained on synthetic dataPerformance of model trained on original data\text{Predictive Utility} = \frac{\text{Performance of model trained on synthetic data}} {\text {Performance of model trained on original data}}

These scores are plotted as overlapping squares with the intersecting area being indicative of the "match" between the synthetic and real data sets.

The final predictive utility score is then an average of the "match" between the synthetic and real data set scores for each of the performance metrics.

This metric requires both a target column and a machine learning model to be selected. The model can be a classifier or regressor from the selection given below:

  • naive_bayes_gauss or nb - Naive Bayes Classifier
  • logistic_regression or lgr - Logistic Regression
  • k_neighbors or knnc - K-Nearest Neighbours
  • linear_svm or lsvm - Linear Support Vector Machine
  • svm - Support Vector Machine
  • decision_tree or dtc - Decision Tree Classifier
  • random_forest or rfc - Random Forest Classifier
  • lgbm or lgbmc - Light GBM

The default model is lightgbm. All models are optimised using a Bayesian optimisation method. Note that some of these method are faster and more precise than others. Logistic Regression lgr and Decision Trees decision_tree are among the fastest to run. Note that some do not support feature importance.

A good predictive utility score is generally >= 90%.

Feature importance shift

This metric compares the order of feature importance in the model trained on the original data and the model trained on the synthetic data. Most machine learning algorithms are able to rank the variables in that data by how informative they are for a specific task.

Synthetic data of good quality should be able to preserve the variables' order of importance. In the example below, we see that within the Hazy Hub you are able to see the level of importance set by the algorithm and how accurately the synthetic data retains that order.

Feature importance shift represented by position, sorted by importance. Ideally the graph should look like a ladder - meaning that no feature has moved.
Feature importance shift represented by position, sorted by importance. Ideally the graph should look like a ladder - meaning that no feature has moved.
Feature importance shift represented as a histogram of the difference in importance between source and synthetic.
Feature importance shift represented as a histogram of the difference in importance between source and synthetic.

Confusion matrix shift

Confusion matrices illustrate how many true positives, false positives, true negatives and false negatives are made when predicting categorical target variables. Confusion matrix shift illustrates the difference between a confusion matrix produced by a predictive model trained on the synthetic data and that of a model trained on the real data. Darker squares are a sign of higher performance.

Agreement rate

Agreement rate is the frequency with which a predictive model trained on the synthetic data and a predictive model trained on the real data make similar predictions.

Query utility

Query utility shares many similarities to the bi-joint distribution similarity metric. It is the average overlap of the joint distribution of values across three or more columns. The average is computed over n iterations, with each iteration randomly selecting the number of columns to be included in the joint distribution. In essence, query utility estimates the the similarity between highly dimensional joint distributions through random queries applied to both the real and synthetic data set.

A good query utility score is typically >= 60%.