Utility measures the performance of a downstream model or algorithm when run on the source data vs the synthetic.
Generally, synthetic data will show a drop in utility. This is not always the case: sometimes the synthetic data can randomly increase utility by chance. Generally, the aim of smart synthetic data generation is to minimise loss in utility.
Note: currently, Hazy only meausures utility for predictive analytics use-cases. For clustering and other unsupervised learning techniques, you can use similarity as a proxy for utility and / or perform your own bespoke evaluation.
Hazy selects typical performance metrics for the selected modelling technique and uses built-in models to measure the performance of the modelling technique trained with the source data. The process is then repeated by measuring the performance of the modelling technique when trained with the synthetic data. Normally this involves splitting the data into a training set to train the model and a test set to validate the model, in order to avoid overfitting - note that the test set always consists of the original data.
The overall score or "match" for a given performance metric is a measure of how close the synthetic score is to the real score:
These scores are plotted as overlapping squares with the intersecting area being indicative of the "match" between the synthetic and real data sets.
The final predictive utility score is then an average of the "match" between the synthetic and real data set scores for each of the performance metrics.
This metric requires both a target column and a machine learning model to be selected. The model can be a classifier or regressor from the selection given below:
nb- Naive Bayes Classifier
lgr- Logistic Regression
knnc- K-Nearest Neighbours
lsvm- Linear Support Vector Machine
svm- Support Vector Machine
dtc- Decision Tree Classifier
rfc- Random Forest Classifier
lgbmc- Light GBM
The default model is
lightgbm. All models are optimised using a Bayesian optimisation method. Note that some of these method are faster and more precise than others. Logistic Regression
lgr and Decision Trees
decision_tree are among the fastest to run. Note that some do not support feature importance.
A good predictive utility score is generally >= 90%.
Feature importance shift¶
This metric compares the order of feature importance in the model trained on the original data and the model trained on the synthetic data. Most machine learning algorithms are able to rank the variables in that data by how informative they are for a specific task.
Synthetic data of good quality should be able to preserve the variables' order of importance. In the example below, we see that within the Hazy Hub you are able to see the level of importance set by the algorithm and how accurately the synthetic data retains that order.
Confusion matrix shift¶
Confusion matrices illustrate how many true positives, false positives, true negatives and false negatives are made when predicting categorical target variables. Confusion matrix shift illustrates the difference between a confusion matrix produced by a predictive model trained on the synthetic data and that of a model trained on the real data. Darker squares are a sign of higher performance.
Agreement rate is the frequency with which a predictive model trained on the synthetic data and a predictive model trained on the real data make similar predictions.
Query utility shares many similarities to the bi-joint distribution similarity metric. It is the average overlap of the joint distribution of values across three or more columns. The average is computed over n iterations, with each iteration randomly selecting the number of columns to be included in the joint distribution. In essence, query utility estimates the the similarity between highly dimensional joint distributions through random queries applied to both the real and synthetic data set.
A good query utility score is typically >= 60%.