Tabular+

Hazy’s Tabular+ synthesiser is for non-sequential data structured in rows and columns. Its main difference with the Tabular synthesiser is that this version handles stacked rows which is when a given record is composed of multiple rows. A good example of this is when several customers in a banking dataset share the same account; in this case a record is an account and not a customer. This synthesiser allows a user to specify which set of columns to identify record with in order to ensure that the modelling is done correctly.

Additionally, This model also allows for the handling of custom fields, such as: name, location, id, ssn, etc... a comprehensive set of available handlers is detailed below.

The following data types can be handled by the tabular synthesiser:

  • Integers
  • Floats
  • Datetimes
  • Booleans
  • Categoricals
  • Strings (when provided to a handler)

Training parameters

The table below shows the parameters available during a Tabular+ synthesiser training step:

Parameter Type Required Default Description
input_path string True Docker path for source data csv file
dtypes_path string False null Docker path to JSON file containing source data feature dtypes
epsilon float True Privacy parameter. The smaller the value of epsilon, the higher the degree of privacy in the synthetic data. Epsilon typically lies within the range of 0.0001 < epsilon < 1000
n_bins int True Number of bins that continuous data will be discretised into
max_cat int False 250 Max amount of categorical information to preserve in each categorical column.
n_parents int False 3 Number of bins that continuous data will be discretised into
skew_threshold boolean False null Skewness value at which to consider a column is skewed
split_time boolean False False Controls whether to model date / datetime columns as seperate components (year, week, ..etc) or as a singular value.
custom_handlers list False null List of handlers to use. Each handler is a dictionary.
stacked_by list False null List of columns the data to use to unstack the data
sort_by list False null Only works when stacked_by has been provided. List of columns with which to sort the stacked data
stack_relative list False null Only works when stacked_by has been provided. When set to True Includes and models relative difference between rows of the same record
read_n_rows int False null Number of rows to sample during training - mostly for quick testing
evaluate boolean False False Run evaluation on synthetic data
evaluation_generate_params dict False null Parameters used for evaluation
disable_presence_disclosure_metric boolean False False Disables computation of the presence disclosure metric.
disable_density_disclosure_metric boolean False False Disables computation of the density disclosure metric.
sample_generate_params dict True Parameters used for sample data
label_columns list False null Columns that predictors will act on to assess utility performance (comma separated if multiple)
predictors list False null Predictors to use for each label column during evaluation evaluation (comma separated if multiple)
predictor_max_rows int False null Maximum number of rows to use in the predictive utility metric
optimise_predictors boolean False True When set to True runs hyper-parameter optimisation during predictive utility metric.
train_test_split boolean False False Perform a train test split during evaluation
evaluation_exclude_columns list False null Columns to exclude from evaluation (comma separated if multiple)

An example JSON file for training looks as follows:

{
	"action": "train",
	"params": {
      	"input_path": "/mount/berka-customers.csv",
  		"dtypes_path": "/mount/dtypes.json",
		"epsilon": null,
		"n_bins": 100,
		"max_cat": 250,
		"n_parents": 2,
		"skew_threshold": 10,
		# Handler Settings
		"custom_handlers ":  [
			# ID Settings
			{"type": "id", "params": {"target": "id", "id_type": "numerical", "id_settings": {"length": 9}}},
			# Formula Settings
			{"type": "formula", "params": {
				"target": "D",
				"formula": "a+b+c",
				"column_map": {
					"a": "A",
					"b": "B",
					"c": "C"
					},
				}
			},        
		],
		"automatic_handlers ":  {
			"extractors": [{"type": "determined"}, {"type": "conditioned", "params": {"min_n": 100, "min_frac": 0.01}}],
			"ignore_thresh": 0.1,
			"ignore": []
		},
		# Stacking
		"stacked_by": ["id"],
		"sort_by": null,
		"stack_relative": False,

		# Evaluation
		"evaluate": True,
		"train_test_split": True,
		"label_columns": ["target_col"],
		"predictors": ["decision_tree"],
		"evaluation_exclude_columns": ["id"],
		# Extra Settings
		"model_output": "model.hmf",
		"sample_generate_params":  {
			"params": {"num_records": 25},
			"implementation_override": False
		},
		"evaluation_generate_params":  {
			"params": {"num_records": model.id.nunique()},
			"implementation_override": False
		}
    }
}

Generation parameters

The table below shows the parameters required for generating with the Tabular synthesiser:

Parameter Type Required Default Description
output string True Output path for generated data
model string True Input path for trained model
num_records int True Number of records to generate
development_only boolean False False Enable development mode

An example JSON file for generation looks as follows:

{
	"action": "generate",
	"params": {
		"output": "/mount/synth_data.csv",
		"model": "/mount/model.hmf",
		"num_records": 10000,
		"development_only": false
	}
}

Handlers

Custom Handler: custom_handlers
id

Allows to generate ID columns. An example config JSON can be found below.

{
	"type": "id",
 	"params": {
		"id_type": "numerical",
		"id_settings": {"length": 8}
	}
}

Interface

  • id_type [string|REQUIRED]: type of id column. choose from [numerical, uuid, ssn].

  • id_settings [dict]: settings for the specified id_type

    • numerical: integer ids
      • length [integer]: length of integer ids to generate.
    • ssn: social security numbers
      • locales [List[string]]: list of locales to generate social security numbers from - choose from Faker's set of locales
    • uuid: standard UUIDs.
location

Allows the modelling of location information. An example config JSON can be found below.

{
	"type": "location",
	"params": {
		# Locale Settings
		"locale": "en_US",
		# Column Mapping
		"postcode": "postcode",
		"country_cd": "country_cd",
		"state_cd": "state_cd",
		"state_name": "state_name",
		"county": "county",
		"city": "city",
		"street_address": "street_address",
		# Clustering settings
		"num_clusters": 1000,
		"knn": 50
	}
}   

Interface

  • locale [string | REQUIRED]: country_code for which to generate location information. select from : [en_US]

  • postcode [string | REQUIRED]: name of the postcode column to handle

  • country_cd [string]: name of the country code column to handle.

  • state_cd [string]: name of the state code column to handle.

  • state_name [string]: name of the state name column to handle.

  • county [string]: name of the county column to handle.

  • city [string]: name of the city column to handle.

  • street_address [string]: name of the street address column to handle.

  • num_clusters [integer]: number of cluster in which to cluster the publicly available data. NB: the clustering algorithm is only trained on public data and not on the data provided by the pipeline

  • knn [integer]: number of publicly available addresses to compare a point for clustering.

person

Allows the modelling of first name, last names and full names. An example config JSON can be found below.

{
	"type": "person",
 	"params": {
 		# Column Map
    	"first_name": "first_name"
		"last_name": "last_name",
		"title": "title",
		"gender": "gender",
		"full_name": "full_name",
		# Locales
		"locales": ["en_GB", "en_US"],
	}
}

Interface

  • first_name[string]: firstname column name.

  • last_name[string]: lastname column name.

  • title[string]: title/honorofic column name.

  • gender[string]: gender column name.

  • full_name[string]: full name column name.

  • locales[List[string]]: list of locales to generate a person's identity from - choose from Faker's set of locales

symbol

Allows support for numerical columns with a symbol leading or preceeding the numerical value such as : 10% or £250

{
	"type": "symbol",
 	"params": {
		 "target": "rate",
		 "symbol": "%"
 	}
}

Interface

  • target [string | REQUIRED]: Name of the target column.

  • symbol [string | REQUIRED]: symbol or pattern to strip away.

sample

Allows to preserve non-supported columns by randomly sampling values from the source dataset.

{
	"type": "sample",
 	"params": {
 		"target": "foo",
		"preserve_dist": True
	}
}

Interface

  • target [string | REQUIRED]: name of the column to sample from.

  • preserve_dist [boolean]: when set to True will preserve the initial column distribution.

bounded

Allows to specify when a column is bounded by either other columns or a fixed value.

{
	"type": "bounded",
	"params": {
		"target": "rent",
		"upper": {"type": "column", "value": "income"},
		"lower": {"type": "static", "value": 500}
	}
}

Interface

  • target [string | REQUIRED] : name of bounded column.

  • upper/lower [dict]: upper / lower bound settings.

    • type [string]: column if bound is another column, static if it is a fixed value
    • value [float|string | REQUIRED]: column name or static value of the upper / lower bound.

NB: at least one bound must specified

conditioned

Allows to specify when values within a target column are pre-determined whenever a condition is verified.

{
	"type": "conditioned",
 	"params": {
 		"target": "joint_income",
		"condition_map": {
			"status": {
				"single": null,
				"divorced": null
				},
			"emp_title": {"Unemployed": null}
		}
 	}
}

Interface

  • target [string | REQUIRED] : name of the target column.

  • condition_map [dict | REQUIRED]: condition_map following the condition_column -> condition -> value structure.

determined

Allows to specify when a column is entirely determined by another column. For example country is entirely determined by city.

{
	"type": "determined",
 	"params": {
 	"target": "country",
	"condition_column": "city",
	"condition_map": {
		"Paris": "France",
		"London": "United Kingdom",
		"New York": "United States"
		}
 	}
}

Interface

  • target [string | REQUIRED]: name of the target column.

  • condition_column [string | REQUIRED]: name of the condition column.

  • condition_map [dict | REQUIRED]: condition_map following the condition -> value structure.

formula

Allows to specify when a column is modelled as a function of other columns using a formula.

{
	"type": "formula",
 	"params": {
 		"target": "remaining_income",
		"formula": "a-(b+c)",
		"column_map": {"a": "income", "b": "groceries", "c": "rent"},
		"condition": '"employment_status" == "employed"',
		"model_error": True
 	}
}

Interface

  • target [string | REQURED]: name of the column to compute using a formula
  • formula [string | REQURED]: formula to apply to the column. The available syntax can be found here.
  • column_map [dict | REQUIRED]: mapping between variables defined in the formula and the columns of the provided data.
  • condition [string]: query condition to use when attempting to apply the formula to only a subset of the data verifying the specified condition. Please follow pandas' query syntax to define a condition.
  • model_error [boolean]: when set to True models potential errors between the provided formula and the encountered data.
Automatic Handlers: automatic_handlers

Specifies the list of handler-extractors in order, to automatically detect and add specific handlers to the manually specified set of handlers.

Extractors

conditioned

Automatically extracts conditioned handlers.

Interface

  • min_n [integer]: minimum amount of records to consider a condition - value mapping to be valid

  • min_frac [float]: minimum amount of records as a proportion of the initial data to consider a condition -value mapping to be valid.

determined

Automatically extracts determined handlers.

ignore & ignore_thresh

In order to avoid, over-detecting handlers, it is possible to play with the following two settings:

  • ignore [List[string]]: list of columns to ignore during extraction.
  • ignore_thresh: [float]: threshold in cardinality as proportion of the data size above which columns are ignored during handler extraction.

Client library

Assuming we have a synthesiser object, the train method can be used to create a model and return a model object:

model = synthesiser.train(
	epsilon=None,
	n_parents=2,
	n_bins=100,
	source_data=source_df,
	dtypes=dtypes_dict,
	model_output="model.hmf",
	stack_col="account_id",
	id_cols=["client_id", "disp_id", "account_id"],
	evaluate=True,
	train_test_split=True,
	label_columns=["age", "gender"],
	predictors=["lgbm", "decision_tree"]
	sample_generate_params={
		"params": {"num_records": 25},
		"implementation_override": False
	},
	evaluation_generate_params={
		"params": {"num_records": 25},
		"implementation_override": True
	},
	evaluation_exclude_columns=["client_id", "disp_id", "account_id"]
)

With the model object we can then use the generate method to produce some synthetic data:

synth_df = model.generate(
	num_records=1000,
	output="synth_data.csv"
)