Sequential data

This document explains how Hazy trains and generates models for sequential data and how it evaluates and displays the quality of the synthetic data.

For this demo we will use data from an ecommerce dataset that has been released by a Czech bank for a machine learning competition in 1999. It consists of a little over 1 million transactions with 6 attributes for a cohort of 4,500 bank customers.

As the generator, we will use our proprietary model Synthpop_sequential – an enhanced version of Synthpop to deal with sequential data.

0. What is sequential data and why is it harder to model?

Sequential data is any type of data with dependencies between rows. Examples can be:

  • Time-series (weather conditions of a specific location, stock market prices, etc)
  • Bank transactions over a specific period
  • Database with hierarchical structure (shopping basket: user -> order -> order items)
The components of sequential data

Sequential data is harder to model since:

  • Sequences have a variable length
  • We need to capture not only dependencies over columns but also over the rows (normally time).
  • This means that synthetic data should capture trends, seasonality effects, etc, that may exist in the data.
  • Apart from that, synthetic data should preserve also temporal correlations, for instance when a user buys item X at a given day it is likely to buy item Y the next day. These correlations may be very long range with makes it hard to model on the synthetic version.
  • We need to preserve cumulative effects (total monthly or yearly expenditure on given items)

How we model sequential data

There are two major ways to model sequential data, the step sequencer that generates sequence step by step and full sequencer that models the full length of the sequence at once as show in the next diagram. In this work we will use the full sequencer.

sequencer

1. Imports and data preparation

First we do the imports necessary to train the model and evaluate the synthetic data:

import pandas as pd
import numpy as np
import time
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from synthpop import Synthpop
from sklearn.cluster import KMeans
from harmonise import harmonise_dfs
from sklearn.preprocessing import LabelEncoder
from hazy_public.evaluation.similarity.hist_sim import HistSim
from hazy_public.evaluation.similarity.mi_sim import MISim
from hazy_public.evaluation.utility.pred_utility import PredUtility
import warnings
warnings.filterwarnings('ignore')
import logging
logging.getLogger().setLevel(logging.INFO)

Then, we load the metrics, temporal Histogram similarity, Autcorrelation, HistogramMerchantSimilarity, Catch22 metrics.

The Histogram similarity measures the overlap of distributions between original and synthetic data at different levels of time aggregation.

The Catch22 library captures a series of important metrics important to compare pairs of time-series, like auto-correlation and time dependencies. The final score is the mean of all the metrics.

from evaluation.histogram.histogram import Histogram
from evaluation.autocorrelation.autocorrelation import Autocorrelation
from ml_applied.banking.transactions.evaluation.similarity import (
    MerchantHistogramSimilarity,
    merchant_histogram_aggregator,
   get_visuals_merchant_histogram_metric)
from ml_core.evaluation.similarity.catch22 import Catch22Evaluation

We also use some helper functions to deal with data.

def logx(x):
    return np.sign(x)*np.log1p(x*np.sign(x))

def logsum(x):
    x = np.sum(x)
    return np.sign(x)*np.log1p(x*np.sign(x))

def log_inv(x):
    return np.sign(x)*(np.exp(x*np.sign(x)))

def cluster_data(data, length, col, n_clusters=10):
    data = data.copy()
    data['Day'] = data.date.dt.day

    tableu = pd.pivot_table(data, values='amount', index=col,
                    columns=['k_symbol'], aggfunc=logsum).fillna(0)
 
    table = (
    (
        data.groupby(by=[col, pd.Grouper(freq='1M', key='date')])['amount']
        .agg('sum')
        .unstack(0)
    )
    .resample('1M')
    .max()
    ).T.fillna(0)

    table = table.join(tableu, on=col)
    
    kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(table)
    table['labels'] = kmeans.labels_

    cluster=[]
    for i in range(len(table)): 
        cluster+=[int(kmeans.labels_[i])]*length[i]

    data['cluster']= cluster
 
    return data, cluster, table.reset_index()

First we read the data (only 100k rows for simplicity) and do some simple manipulations and conversions:

data = pd.read_csv(
    '~/Projects/Berka/data_berka/trans.asc',
    delimiter=";",
    nrows=1e5)

data.date = pd.to_datetime(data.date, format="%y%m%d")
print(data.shape)
data['k_symbol'] = data['k_symbol'].astype('O').fillna('Other')
data.describe()
trans_id account_id amount balance account
count 1.000000e+05 100000.000000 100000.000000 100000.000000 2.555100e+04
mean 1.354408e+06 2934.848810 6325.518617 35524.758018 4.599072e+07
std 1.245272e+06 2509.861075 9807.325344 19921.223279 3.012205e+07
min 2.760000e+02 2.000000 0.000000 -5446.600000 0.000000e+00
25% 4.237768e+05 1146.000000 140.400000 22807.750000 1.868500e+07
50% 8.532365e+05 2425.000000 2240.000000 30597.200000 4.601617e+07
75% 2.267847e+06 3683.000000 7567.000000 43923.125000 7.155672e+07
max 3.682933e+06 11359.000000 76500.000000 193909.900000 9.998564e+07

We add a column Length to the dataframe with the sequence length for each of the users:

group=data.groupby('account_id')
length=[]; ll=[]
for g in group.groups.items():
    l=len(g[1])
    length.append(l)
    ll+=([l]*l)
data['Length']=ll

Insert a cluster column, cluster, so that the model captures different behaviours based on expenditures patterns - here we use 5 clusters:

data,cluster, _ = cluster_data(data, n_clusters=5, col='account_id', length=length)

Add moving average column, ma, to better condition the sequential model

grouper = data.groupby(['account_id',pd.Grouper(key='date',freq='1M')]).ngroup().values
temp = []
for k,v in data.groupby(grouper):
    temp += list(v['balance'].rolling(10).mean().fillna(v['balance'].mean()).values)
data['balance_ma']=temp

Do some plots. First the transactions activity per merchant category, k_symbol. We can see that the amounts spent varies substantially and there is a seasonality and a trend that synthetic data should preserve:

data.iloc[:10000].groupby(['date','k_symbol'])['amount'].size().unstack().plot(figsize=(10,5))
plt.ylim(0,100)
plt.title('Original')

In order to train the model we need to specify the visit order, the fit methods and the data types.

dt={'date': 'int', 
    'cluster':'int',
    'Length':'int',
    'type': 'category',
    'balance_ma':'float',
    'amount': 'float',
    'k_symbol':'category',
    'balance': 'float', 
   }

visit={'cluster':'sample',
       'Length':'cart',
       'date':'perturb',
       'type':'cart',
       'k_symbol':'cart',
       'balance_ma':'empty',
       'amount':'cart',
       'balance':'cart',
      }

The Synthpop model is instantiated with the following parameters:

  • method, is the method used to model the data. Methods available are:
    decision trees (cart), sample, empty and xgboost
  • visit_sequence is the order of variables used to visit (it has a large impact on the quality of the data. Note that the first variable has to be numeric and method sample or empty
  • one_hot_encoding either to use 1-hot encoding for categorical variables or not. Using it generates higher quality data but it takes longer to train the model.
  • label is the column used to condition the model - in this case is the column cluster
  • time_col is the column with information about the timestamp.
  • attributes is a list of variables used as attributes
  • skip_columns is a list of columns that could be ignored during training
model = Synthpop(method=list(visit.values()), 
                 visit_sequence = list(visit.keys()), 
                 sequential=True, 
                 one_hot_cat_cols=True, 
                 label = 'cluster', 
                 time_col='date',
                 attributes=['cluster','Length'],
                 autoregressive=['amount'],
                 skip_columns=[],
                 dtypes=dt
                )

Then we train the model:

start = time.time()
model.fit(data[['account_id']+list(visit.keys())] )
end = time.time()
print('time to train %s seconds'% np.round((end - start),3))

train_cluster train_Length train_date train_type train_k_symbol train_balance_ma train_amount train_balance time to train 6.451 seconds

As we can see it takes only 6.4 s to train even with one_hot_encoding set to True. Note that this model scales linearly with the number of rows. Now we generate new data (1000 new sequences, that corresponds to roughly 88k rows):

synth_df = model.generate(1000)
synth_df.shape
generate_cluster
generate_Length
generate_date
generate_type
generate_k_symbol
generate_balance_ma
generate_amount
generate_balance

(87854, 9)

2. Metrics

Now that we generate the data, we will access the quality of it.

2.1 Histogram and Mutual information

First we evaluate the usual histogram similarity and mutual information - as if it was tabular data.

def evaluate (df,df_synth):
    h = HistSim(n_bins=100)
    mi = MISim(n_bins=100)
    hs = h.score(df,df_synth)
    mis = mi.score(df,df_synth)
    print('Histogram similarity = ', np.round(hs['hist']['score'],3), '\n'
          'Mutual information   = ', np.round(mis['mi']['score'],3))

evaluate(data,synth_df)                    
Histogram similarity =  0.862 

Mutual information   =  0.487

The amounts spent on transactions follow a log normal distribution that should be preserved by the synthetic data, which can be seen in next figure.

These numbers are reasonable but not very meaningful as they fail to capture temporal dependencies.

2.2.1 Autocorrelation

The first metric that should be preserved in the synthetic data is autocorrelation, i.e. how events co-depend on each other over time. We have a metric Autocorrelation that captures that. We need to specify a few parameters, like the aggregation column aggvalue and the number of top merchants N_group_column:

auto_corr = Autocorrelation(aggvalue='amount',group_column='k_symbol',smooth=5)
score_ac, details_ac = auto_corr.score(data,synth_df,synth_df,N_group_column=4)
print(score_ac.score)
0.891

2.2.2 Temporal histograms

For that we need to use histogram similarity between aggregated data. This is quantified by calling the following metric. It returns a dictionary with the histogram metrics for each merchant category item k_symbol, aggregation time (Dayofweek, Week, Month) and aggregation function (sum and num - counts).

hist_time = Histogram(
                group_column='k_symbol',
                aggvalue='amount',
                group_key='account_id',
                date='date',
                time_aggregation=None,
                use_log=False,
                N_group_column = 10
            )
score, details = hist.score(df=data,synth_df=synth_df,test_df=synth_df)
print('total score = ', score.score)
print(details['histogram'])
total score =  0.6719111111111112
{'Other': {'Dayofweek': {'num': 0.989, 'sum': 0.981}, 'Week': {'num': 0.734, 'sum': 0.646}, 'Month': {'num': 0.915, 'sum': 0.903}, 'MonthYear': {'num': 0.939, 'sum': 0.925}, 'Day': {'num': 0.78, 'sum': 0.616}}, 'UROK': {'Dayofweek': {'num': 0.921, 'sum': 0.803}, 'Week': {'num': 0.506, 'sum': 0.476}, 'Month': {'num': 0.822, 'sum': 0.657}, 'MonthYear': {'num': 0.867, 'sum': 0.694}, 'Day': {'num': 0.346, 'sum': 0.415}}, 'SLUZBY': {'Dayofweek': {'num': 0.868, 'sum': 0.703}, 'Week': {'num': 0.333, 'sum': 0.398}, 'Month': {'num': 0.577, 'sum': 0.642}, 'MonthYear': {'num': 0.713, 'sum': 0.674}, 'Day': {'num': 0.248, 'sum': 0.318}}, 'SIPO': {'Dayofweek': {'num': 0.963, 'sum': 0.954}, 'Week': {'num': 0.656, 'sum': 0.66}, 'Month': {'num': 0.817, 'sum': 0.809}, 'MonthYear': {'num': 0.932, 'sum': 0.843}, 'Day': {'num': 0.736, 'sum': 0.659}}, ' ': {'Dayofweek': {'num': 0.965, 'sum': 0.939}, 'Week': {'num': 0.651, 'sum': 0.657}, 'Month': {'num': 0.841, 'sum': 0.827}, 'MonthYear': {'num': 0.912, 'sum': 0.875}, 'Day': {'num': 0.696, 'sum': 0.626}}, 'DUCHOD': {'Dayofweek': {'num': 0.975, 'sum': 0.944}, 'Week': {'num': 0.642, 'sum': 0.661}, 'Month': {'num': 0.929, 'sum': 0.853}, 'MonthYear': {'num': 0.881, 'sum': 0.886}, 'Day': {'num': 0.528, 'sum': 0.584}}, 'POJISTNE': {'Dayofweek': {'num': 0.956, 'sum': 0.833}, 'Week': {'num': 0.674, 'sum': 0.6}, 'Month': {'num': 0.833, 'sum': 0.767}, 'MonthYear': {'num': 0.917, 'sum': 0.844}, 'Day': {'num': 0.639, 'sum': 0.551}}, 'UVER': {'Dayofweek': {'num': 0.749, 'sum': 0.67}, 'Week': {'num': 0.459, 'sum': 0.46}, 'Month': {'num': 0.644, 'sum': 0.628}, 'MonthYear': {'num': 0.765, 'sum': 0.732}, 'Day': {'num': 0.163, 'sum': 0.216}}, 'SANKC. UROK': {'Dayofweek': {'num': 0.521, 'sum': 0.413}, 'Week': {'num': 0.301, 'sum': 0.003}, 'Month': {'num': 0.562, 'sum': 0.065}, 'MonthYear': {'num': 0.588, 'sum': 0.11}, 'Day': {'num': 0.34, 'sum': 0.159}}}

2.2.3 Calculate and display the visuals of sequential histogram

This section creates some visuals to display the sequential histogram similarity.

We put some columns as categories to do the aggregations:

for d in [data, synth_df]:
    d['account_id'] = d['account_id'].astype('category')
    d['k_symbol'] = d['k_symbol'].astype('category')
    d['date']=pd.to_datetime(d['date'],errors='coerce')

First we create the aggregations. As strategy we could select the following options:

  • "D" - day non-cyclical,
  • "W" - week non-cyclical,
  • "M" - month non-cyclical,
  • "DayOfWeek" - day of the week,
  • "DayOfMonth" - day of the month,
  • "MonthOfYear" - month of the year
aggregation = "sum"
strategy = "DayOfWeek"
# strategy = "M"
separate_sign = False

agg_dfs, synth_agg_dfs = [
    merchant_histogram_aggregator(
        _df,
        transaction_time="date",
        transaction_amount="amount",
        merchant_id="k_symbol",
        strategy=strategy,
        aggregation=aggregation,
        separate_sign=separate_sign,
    )
    for _df in [data[cols],synth_df[cols]]
]
print("Aggregation completed")

then we harmonise the dataframes.

if separate_sign:
    _dfs = harmonise_dfs(agg_dfs + synth_agg_dfs)
    agg_dfs, synth_agg_dfs = _dfs[:2], _dfs[2:]
else:
    _dfs = harmonise_dfs([agg_dfs] + [synth_agg_dfs])
    agg_dfs, synth_agg_dfs = [_dfs[0]], [_dfs[1]]
print("Harmonisations completed")

Display the visuals, it creates interactive plots where we can compare the temporal histograms by merchant category.

merchant_hist = MerchantHistogramSimilarity()
score, details = merchant_hist.score(agg_dfs, synth_agg_dfs)
visuals = get_visuals_merchant_histogram_metric(merchant_hist, score, details)

display(visuals)

Note that for categories with less transactions the agreement is worst:

2.2.4 Catch22 score

This section evaluates some time-series metrics using Catch22. Catch22 captures a series of important metrics important to compare pairs of time-series, like auto-correlation and time dependencies. The final score is the mean of all the metrics. It assumes the data (bank transactions) is projected onto a time series using 1 day resampling.

accounts = data["account_id"].unique()

def acct_to_balance_ts(acct_no, df_raw):
    df_acct0 = df_raw[df_raw["account_id"] == acct_no]
    tx0 = df_acct0[["amount", "balance","date"]] \
       .groupby("date").agg(sum) \
       .sort_values("date")

    balance_ts0 = (tx0[["amount"]]).resample("1D").ffill()
    
    balance_ts0 = balance_ts0 - np.min(balance_ts0)
    balance_ts0 = balance_ts0 / np.max(balance_ts0)
    
    return balance_ts0
# pandas series with data index
series = [acct_to_balance_ts(acct, data) for acct in accounts[:100]]

# raw numpy values
ts = [s.values.squeeze() for s in series]
fig, axs = plt.subplots(len(accounts[:10]), sharex=True)
for index, acct in enumerate(accounts[:10]):
    ts_values = acct_to_balance_ts(accounts[index], data)
    axs[index].plot(ts_values)
    axs[index].set_ylabel(f"Acct {index}")
plt.suptitle("Money spent timeseries")  
plt.show()

Catch22 score between original data accounts

catch22 = Catch22Evaluation()
ts_sz = len(ts)
result = np.eye(ts_sz)
for i in range(10):#(ts_sz):
    for j in range(i):
        # get similarity between ts i and j
        score = catch22.score(pd.DataFrame(ts[i]), pd.DataFrame(ts[j]), 0, 0)[0].score
        result[i,j] = score
        result[j,i] = score     
fig = plt.figure(figsize=(15, 8))
plt.title("Account balance timeseries similarity")
plt.imshow(result[:10,:10])
plt.colorbar()

3. Train and generate data using the docker image

This model can also be run on a Docker image. The only difference is that we have to provide the input parameters as a dictionary named: params_synthpop.

organisation = ""
generator_name = ""
with open('/Users/armandovieira/hazy_hub.json') as f:  
    api_key = json.load(f)['staging_api_key']
    
hub_url = "https://staging-hub.hazy.network/"

tabular_image = 'project/synthpop-generic:20210706T053518' # synthpop docker image
hub_url = hub_url.rstrip("/")
generator_uri = f"{organisation}/{generator_name}"

#  working directory
work_dir = Path.cwd() / "output"
work_dir.mkdir(exist_ok=True, parents=True) 

# Logging
logging.basicConfig(level=logging.INFO)
hub = HazyHub(hub_url, api_key)
synth = hub.synthesiser(image_name=tabular_image, work_dir=work_dir)
train_params = dict(
    # Model Settings
    epsilon = None,
    max_cat = 100,
    
    # Handler Settings
    custom_handlers = [],
    automatic_handlers = {
        "extractors": [{"type": "determined"}], 
        "ignore_thresh": 0.1, 
        "ignore": list(df.columns), #cols_date + cols_date1 #
    },

    params_synthpop = {
        "sequential":True,
         "time_col":'date',
        "one_hot_cat_cols":True,
         "perturb": None,
         "fix_length":False,
         "label":'cluster',
         "attributes":['cluster','Length'],
        "autoregressive":['amount'],
        "skip_columns":[],
        "visit": visit,
        dtypes = dt
      },
    
    # Stacking
    stacked_by=None,
    sort_by=None,
    stack_relative=False,
    disable_presence_disclosure_metric=False,
    disable_density_disclosure_metric=True,
    
    # Evaluation
    evaluate=False, 
    train_test_split=True,
    label_columns= ["k_symbol"], # choosed a the last column. Change accordingly. list(loaded_dtypes.keys())[-1]
    predictors=["lgbm"], # which prediction model is used to access the utility of the data
    evaluation_exclude_columns = [],
    optimise_predictors = False,
    predictor_max_rows = 10**5,
    
    # Name of the model to be saved
    model_output="adult.hmf",
    
    # how many records sample generates
    sample_generate_params= {
        "params": {"num_records": 100},
        "implementation_override": False
    },

    # how many records are used for the evaluation metrics
    evaluation_generate_params= {
        "params": {"num_records":  3000},
        "implementation_override": False
    },
)
%%time
hazy_model = synth.train(
    source_data = data,
     dtypes=dt,
    **train_params)