Python implementation of synthpop

By Georgi Ganev on 31 Jan 2020.

tl;dr

Source code and example usage on github.com/hazy/synthpop. Install using pip:

pip install py-synthpop

Reimplementing synthpop in Python

At Hazy, we create smart synthetic data using a range of synthetic data generation models. One of those models is synthpop, a tool for producing synthetic versions of microdata containing confidential information, where the synthetic data is safe to be released to users for exploratory analysis.

Originally, synthpop was released as an R package but, since Python is the most popular language for machine learning and since our data scientists prototype their experiments in Python, we decided to reimplement it in Python using scikit-learn, numpy and pandas.

As a result, you can now create synthpop data using Python.

Install

Install the Python reimplementation of synthpop using pip:

pip install py-synthpop

Use

Initialise a synthpop instance:

spop = Synthpop()

Fit the model, passing in a pandas.DataFrame with:

my_data_frame = # source data as a pandas dataframe
my_data_types = None # optionally specify corresponding data types
spop.fit(my_data_frame, dtypes=my_data_types)

Generate a synthetic dataframe:

num_rows = None # optionally specify the number of rows to be generated
spop.generate(k=num_rows)

More information

The README has more usage examples. See also the original synthpop docs.

The key objective of generating synthetic data is to replace sensitive original values with synthetic ones causing minimal distortion of the statistical information contained in the dataset. Variables, which can be categorical or continuous, are synthesised one-by-one using sequential modelling. Replacements are generated by drawing from conditional distributions fitted to the original data using parametric or classification and regression trees models.


Check out the rest of our blog, subscribe using RSS