Python implementation of synthpop
tl;dr
Source code and example usage on
github.com/hazy/synthpop.
Install using pip:
pip install py-synthpop
Reimplementing synthpop in Python
At Hazy, we create smart synthetic data using a range of synthetic data generation models. One of those models is synthpop, a tool for producing synthetic versions of microdata containing confidential information, where the synthetic data is safe to be released to users for exploratory analysis.
Originally, synthpop was released as an R package but, since Python is the most popular language for machine learning and since our data scientists prototype their experiments in Python, we decided to reimplement it in Python using scikit-learn
, numpy
and pandas
.
As a result, you can now create synthpop data using Python.
Install
Install the Python reimplementation of synthpop using pip:
pip install py-synthpop
Use
Initialise a synthpop instance:
spop = Synthpop()
Fit the model, passing in a pandas.DataFrame with:
my_data_frame = # source data as a pandas dataframe
my_data_types = None # optionally specify corresponding data types
spop.fit(my_data_frame, dtypes=my_data_types)
Generate a synthetic dataframe:
num_rows = None # optionally specify the number of rows to be generated
spop.generate(k=num_rows)
More information
The README has more usage examples. See also the original synthpop docs.
The key objective of generating synthetic data is to replace sensitive original values with synthetic ones causing minimal distortion of the statistical information contained in the dataset. Variables, which can be categorical or continuous, are synthesised one-by-one using sequential modelling. Replacements are generated by drawing from conditional distributions fitted to the original data using parametric or classification and regression trees models.