If you are dealing with sequential data (i.e. data that has a time dependency) such as bank transactions, these temporal dependencies must be preserved in the synthetic data as well. For instance, in healthcare the order of exams and treatments must be preserved; chemotherapy treatments must follow x-rays, CT scans and other medical analysis in specific order and timing. The synthetic data should preserve this temporal pattern as well as replicate the frequency of events, costs, and outcomes. Therefore there are specific metrics used to asses the quality of sequential data.
To capture temporal correlations the metric of choice is Autocorrelation with a variable lag parameter. Autocorrelation measures how events at time are related to events at time where is a lag parameter.
The autocorrelation of a sequence is given by:
Where is the mean of . We assume events occur at a fixed rate, but this restriction does not affect the generality of the concept.
If the events are categorical instead of numeric (for instance medical exam records), the same concept still applies but we use Mutual Information instead.
The metric has the following parameters:
smooth=10the smoothing parameter over previous time steps window.
steps=5the number of previous steps taken into account to measure the autocorrelation
This measures the degree of overlap between the distributions of real and synthetic data aggregated over several time periods:
- Dayofweek - day of week cycle
- Week - aggregation over weeks
- MonthYear - month of the year cycle
- Month - aggregation over months
- Day - aggregation over days of the month
Two aggregation functions are used:
- Sum of values and
- Number of transactions in the period
This metric aims to capture the relations between users and products being transactioned. It creates a pairwise distance matrix between products based on patterns of the amounts and frequencies of money being spent.
This metric has several parameters (with defaults):
metric='cityblock'- used to evaluate pairwise distances
N_group_column- number of most common merchant groups to include (default is all)
use_log = False- take the log of values before evaluating the pairwise distances
This metric compares the cumulative sums over different merchants. Given a set of transactions of original over a period with and different merchants/products. It is defined as:
where are the synthetic sequences.