Similarity

The concept of similarity is essential to synthetic data. The source data has statistical properties, such as distributions of values. How well are these properties and distributions mirrored in the safe synthetic data?

Hazy measures similarity for a single column, pair of columns and a matrix of columns:

Marginal distribution¶

The marginal distribution shows the distributions of the values in a column for the source data compared with the synthetic data.

The amount of intersection between the source and synthetic histograms indicate how well the statistical properties of the column has been captured.

A good marginal distribution score is typically >= 98%.

Mutual information¶

Histogram similarity fails to capture the dependencies between different columns in the data. For that purpose we use the concept of mutual information that measures the co-dependencies, or correlations if the data is numeric, between all pairs of variables. Quantifying this information is an abstract but very powerful concept that allows us to understand the relationship between variables.

The hub displays a heatmap of the mutual information matrix for the source data, and the synthetic. To aid a visual comparison of the two, we use an optimal leaf ordering of the source columns to cluster similarities. The hub then also shows the difference between these two matrices, which is the basis for the final mutual information score.

Hazy's score for mutual information is the average of the ratio between the mutual information on all pairs of variables in source data $x$ and the synthetic data $\hat{x}$:

$MI_{score} = \left| \frac{ \sum_{i=1}^{N} \sum_{j=1}^{N} MI(x_{i},x_{j}) } {\sum_{i=1}^{N} \sum_{j=1}^{N} MI(\hat{x_{i}},\hat{x_{j}}) }\right|$

A good mutual information score is typically >= 70%.

Bi-joint distribution¶

The bi-joint distribution shows the overlap of the joint distribution of the values across two columns. In this plot, each distinct value in the first selected column has a marginal distribution showing the overlap (by colour) of values of the second column for the source data compared with the synthetic data:

So, for example, whilst the main marginal distribution plot for any column shows the overall similarity of all values in the source vs the synthetic, the bi-joint distribution for any two columns segments the marginal distribution for one by distinct value of the other, and shows whether the segmented distributions still match.

A good bi-joint distribution score is typically >= 90%.