Sequential quality

If you are dealing with sequential data (i.e. data that has a time dependency) such as bank transactions, these temporal dependencies must be preserved in the synthetic data as well. For instance, in healthcare the order of exams and treatments must be preserved; chemotherapy treatments must follow x-rays, CT scans and other medical analysis in specific order and timing. The synthetic data should preserve this temporal pattern as well as replicate the frequency of events, costs, and outcomes. Therefore there are specific metrics used to asses the quality of sequential data.

Autocorrelation

To capture temporal correlations the metric of choice is Autocorrelation with a variable lag parameter. Autocorrelation measures how events at time are related to events at time where is a lag parameter.

The autocorrelation of a sequence xix_i is given by:

AC=i=1nk(xixˉ)(xi+kxˉ)/i=1n(xixˉ)2AC = \sum_{i=1}^{n-k} (x_{i} - \bar{x})(x_{i+k} - \bar{x}) / \sum_{i=1}^{n} (x_{i} - \bar{x})^2

Where xˉ\bar{x} is the mean of xx. We assume events occur at a fixed rate, but this restriction does not affect the generality of the concept.

If the events are categorical instead of numeric (for instance medical exam records), the same concept still applies but we use Mutual Information instead.

The metric has the following parameters:

  • smooth=10 the smoothing parameter over previous time steps window.
  • steps=5 the number of previous steps taken into account to measure the autocorrelation

Histogram overlap

This measures the degree of overlap between the distributions of real and synthetic data aggregated over several time periods:

  • Dayofweek - day of week cycle
  • Week - aggregation over weeks
  • MonthYear - month of the year cycle
  • Month - aggregation over months
  • Day - aggregation over days of the month

Two aggregation functions are used:

  • Sum of values and
  • Number of transactions in the period

Recommendation

This metric aims to capture the relations between users and products being transactioned. It creates a pairwise distance matrix between products based on patterns of the amounts and frequencies of money being spent.

This metric has several parameters (with defaults):

  • metric='cityblock' - used to evaluate pairwise distances
  • N_group_column - number of most common merchant groups to include (default is all)
  • use_log = False - take the log of values before evaluating the pairwise distances

Cumulative sum

This metric compares the cumulative sums over different merchants. Given a set of transactions of original xtnx_t^n over a period with t=0,...,Tt = 0,...,T and n=0,...,Nn = 0,..., N different merchants/products. It is defined as:

Cumulative=iNt=0TxtniNt=0Tx^tn\text{Cumulative} = \frac{\sum_i^N \sum_{t=0}^{T}x_t^n}{\sum_i^N \sum_{t=0}^{T}\hat{x}_t^n}

where x^tn\hat{x}_t^n are the synthetic sequences.