Sequential

This is the initial v1.0 release for the sequential data synthesiser. An initial feature set has been implemented for the first release, which covers sequential data synthesis.

This first release will ensure that cross-row correlations and statistics, as well as multiplicity distribution, are preserved. This synthesiser can be used with time-indexed sequences as well as ordered sequences.

Further features are planned for upcoming releases.

Features

  • Support for varied length sequential data

    Sequences within a sequential dataset tend to be different lengths. This release preserves and captures the length/multiplicity distribution allowing it to accept both fixed and varied length sequential datasets.

  • Captures cross row correlations in source data

    Steps within a sequence can often be correlated with each other (for example, buying a coffee after buying lunch). This release provides a generative model that has the capability to capture such local cross-steps correlations between elements of the same sequence.

  • Support for the existing handler pipeline as used with tabular synths

    In order to avoid reducing coverage, this sequential product does not forbid the use of any handler as provided in the tabular+ synthesiser.

Future release features

  • Sequential Metrics including multiplicity distributions, aggregate scores

    Currently, only the training and generation have been updated and the evaluation metrics only contain the metrics included in the tabular+ release. However, several sequential-specific metrics have been implemented and will be integrated in an upcoming release. This includes:

    • multiplicity-related metrics
    • aggregated statistics metrics
  • Rule-based sequential columns such as balance columns, separate positive and negative amount columns

    Some sequential features, such as balance column, that updates at each transaction or separated amount columns (positive and negative amounts), are fairly common in financial services. Capabilities to model and specify said behaviours are currently being developed in order to allow for a flexible configuration when it comes to sequential datasets.

  • Speed/performance improvements

    While a first attempt at run-time and memory improvement has been done for this initial release, further optimisation work will be done to ensure an easier and faster flow.

Training parameters

Parameter Type Required Default Description
Data parameters
input_path string Yes

Docker path for source data csv file

dtypes_path string Yes

Docker path to JSON file containing source data feature dtypes

custom_handlers list Yes []

List of handlers to use. Each handler is a dictionary.

automatic_handlers dict Yes None

Specifies the list of handler-extractors in order, to automatically detect and add specific handlers to the manually specified set of handlers.

Sequential parameters
seq_id string Yes

ID column to use to identify each unique sequence.

seq_id_params dict Yes

parameters specifying how to generate the specified sequence id. Must match the settings for the id handler.

sort_by list Yes []

List of columns to use to sort the sequences. If no order is provided the sequences are assumed to be in order.

n_steps int Yes 1

number of steps within each sequence to predict at every call.

window_size int Yes 5

number of previous elements in the sequence to condition against during. This allows the future steps to depend on the past window_size steps.

assume_fixed_frequency boolean Yes False

Assume that the time series is the result of sampling at regular intervals

Generator parameters
epsilon float Yes

Privacy parameter. The smaller the value of epsilon, the higher the degree of privacy in the synthetic data. Epsilon typically lies within the range of 1e-3 < epsilon < 1e6. Epsilon tend to be specified in orders of magnitude.

n_bins int Yes

Number of bins that continuous data will be discretised into

drop_empty_bins boolean Yes False

When set to True ensures that no empty bin is used.

bin_strategy string Yes uniform

Controls the binning method used. to be selected from the following list:

  • uniform: Splits the continuous range into bins of equal size.
  • quantile: Split the continuous range into quantiles (portions of the overall distribution) into equal size. This behaves better for skewed distributions
  • max_quantile: Similar to quantile but ensures that the bins cover at most 10% of the overall data range. to avoid bins that are too large.
max_cat int Yes 250

Max amount of categorical information to preserve in each categorical column.

n_parents int Yes 3

Number of parents the bayesian network will use at most for any node within the network. This has a significant impact on memory usage.

network_sample_rows int Yes None

Controls the number or rows to sample and use for the network building step.

sample_parents int Yes None

Controls the number of parents to consider for each node during the network building step.

default_strategy string Yes uniform

Controls the behaviour of the generative model when the an unknown combination of parent values is encountered. choose from:

  • marginal: Randomly sample following the marginal distribution of the column being generated.
  • uniform: Randomly sample a valid value using a uniform distribution.
skew_threshold float Yes None

Skewness value at which to consider a column is skewed

single_threshold float Yes auto

Frequency above which singular numerical values are considered as separate categories

split_time boolean Yes False

Controls whether to model date / datetime columns as seperate components (year, week, ..etc) or as a singular value.

Evaluation parameters
disable_presence_disclosure_metric boolean Yes False

Disables computation of the presence disclosure metric.

disable_density_disclosure_metric boolean Yes True

Disables computation of the density disclosure metric.

evaluate_on_original list Yes []

List of metrics to evaluate against the raw schema rather than the clean version. Select metrics from the following list :'

  • hist: histogram / marginal distribution similarity.
  • mi: Mutual Information Similarity.
  • pred_utility: Predictive Utility.
  • query: Query Utility.
  • density_disclosure: Density Disclosure.
  • presence_disclosure: Presence Disclosure.

Generation parameters

Same as Tabular+