Python implementation of synthpop

Python implementation of synthpop

By on 31 Jan 2020.


Source code and example usage on Install using pip:

pip install py-synthpop

Reimplementing synthpop in Python

At Hazy, we create smart synthetic data using a range of synthetic data generation models. One of those models is synthpop, a tool for producing synthetic versions of microdata containing confidential information, where the synthetic data is safe to be released to users for exploratory analysis.

Originally, synthpop was released as an R package but, since Python is the most popular language for machine learning and since our data scientists prototype their experiments in Python, we decided to reimplement it in Python using scikit-learn, numpy and pandas.

As a result, you can now create synthpop data using Python.


Install the Python reimplementation of synthpop using pip:

pip install py-synthpop


Initialise a synthpop instance:

spop = Synthpop()

Fit the model, passing in a pandas.DataFrame with:

my_data_frame = # source data as a pandas dataframe
my_data_types = None # optionally specify corresponding data types, dtypes=my_data_types)

Generate a synthetic dataframe:

num_rows = None # optionally specify the number of rows to be generated

More information

The README has more usage examples. See also the original synthpop docs.

The key objective of generating synthetic data is to replace sensitive original values with synthetic ones causing minimal distortion of the statistical information contained in the dataset. Variables, which can be categorical or continuous, are synthesised one-by-one using sequential modelling. Replacements are generated by drawing from conditional distributions fitted to the original data using parametric or classification and regression trees models.


Synthetic data newsletter

Sign up for our sporadic newsletter to keep up to date on synthetic data, privacy matters and machine learning.