Conditional tabular

Hazy’s conditional tabular synthesiser is for non-sequential data structured in rows and columns. The conditional synthesiser aims to support reporting use-cases where a certain set of columns need to be preserved as is; a good example is when attempting to generate a table containing financial data for each business unit; each business unit being a row with some attributes. In this instance, it is only necessary to generate the financial data conditioned on the business unit attributes. Similarly to the tabular synthesiser, this synthesiser will not handle rules within a dataset, i.e. consider a dataset that has two features, age and pension eligibility, the tabular synthesiser will not know with absolute certainty that age >= 65 always signals that pension eligibility is True (see the Stacked Tabular synthesiser for datasets with rules). The following data types can be handled by the tabular synthesiser:

  • Integers
  • Floats
  • Datetimes
  • Booleans
  • Categoricals

Training parameters

The table below shows the parameters required for training a conditional tabular synthesiser:

Parameter Type Required Default Description
epsilon float Yes Privacy parameter. The smaller the value of epsilon, the higher the degree of privacy in the synthetic data. Epsilon typically lies within the range of 0.0001 < epsilon < 1000
n_bins int Yes Number of bins that continuous data will be discretized into
n_parents int No 3 Number of parents that each node will have in the network model
input_path string Yes Docker path for source data csv file
dtypes_path string No null Docker path to JSON file containing source data feature dtypes
condition_columns list Yes List of columns to condition on.
evaluate boolean No false Run evaluation on synthetic data
evaluation_generate_params dict No null Parameters used for evaluation (i.e. {"params": {"n_duplicates": 1}, "implementation_override": true}, where n_duplicates determines the number of table versions to be generated for evaluation, implementation_override overrides the n_duplicates parameter and evaluates on the same number of versions as the source dataset)
sample_generate_params dict Yes Parameters used for sample data (i.e. {"params": {"n_duplicates": 1}, "implementation_override": false}, where n_duplicates determines the number of table versions to be generated for the sample data, implementation_override overrides the n_duplicates parameter and generates a sample with the same number of versions as the source dataset)
evaluation_exclude_columns list No null Columns to exclude from evaluation (comma separated if multiple)

An example JSON file for training looks as follows:

{
	"action": "train",
	"params": {
		"n_parents": 2,
		"epsilon": null,
		"n_bins": 100,
		"input_path": "/mount/berka-reporting.csv",
		"dtypes_path": "/mount/dtypes.json",
		"condition_columns": ["date", "district_id"],
		"evaluate": true,
		"evaluation_exclude_columns": [],
		"development_only": false,
		"train_test_split": false,
		"model_output": "/mount/model.hmf",
		"sample_generate_params": {
			"params": {
				"n_duplicates": 1
			},
			"implementation_override": false
		},
		"evaluation_generate_params": {
			"params": {
				"n_duplicates": 1
			},
			"implementation_override": true
		}
	}
}

Generation parameters

The table below shows the parameters required for generating with the conditional tabular synthesiser:

Parameter Type Required Default Description
output string Yes Output path for generated data
model string Yes Input path for trained model
n_duplicates int No 1 Number of copies of initial table to generate
development_only boolean No false Enable development mode

An example JSON file for generation looks as follows:

{
	"action": "generate",
	"params": {
		"output": "/mount/synth_data.csv",
		"model": "/mount/model.hmf",
		"n_duplicates": 1,
		"development_only": false
	}
}

Client library

Assuming we have a synthesiser object, the train method can be used to create a model and return a model object:

model = synthesiser.train(
	epsilon=None,
	n_parents=2,
	n_bins=100,
	source_data=source_df,
	dtypes=dtypes_dict,
	model_output="model.hmf",
	stack_col="account_id",
	id_cols=["client_id", "disp_id", "account_id"],
	evaluate=True,
	train_test_split=True,
	label_columns=["age", "gender"],
	predictors=["lgbm", "decision_tree"],
	sample_generate_params={
		"params": {"num_records": 25},
		"implementation_override": False
	},
	evaluation_generate_params={
		"params": {"num_records": 25},
		"implementation_override": True
	},
	evaluation_exclude_columns=["client_id", "disp_id", "account_id"]
)

With the model object we can then use the generate method to produce some synthetic data:

synth_df = model.generate(
	num_records=1000,
	output="synth_data.csv"
)