Synthetic data marketplace

Building on the synthetic data library completed in release 4.0.0 which gave the ability to share synthetic datasets with a set of consumers, this release takes that a step further and allows data custodians to share differentially private models with consumers. Consumers can view models available to them and generate their own data for download, scaling up the magnitude or changing the random seed as desired.

Upload data sources

Data sources can now be set to allow user uploads in the Hub (using I/O: "Input + Upload"). This allows users to drag and drop data source files into the product. It can make simple installations quicker to get going for users who may not have access to cloud storage accounts or databases. Data source setup is still scoped however, to block general users from setting them up.

RNN sequential model (beta)

A new proprietary RNN-based model has been developed for sequential data. The RNN pipeline is an autoregressive model that generates sequences conditioned on static attributes. See SDK usage.

Known limitations

RNN is not currently differentially private, but we’re looking to add that in the future.

New DP generative models AIM and MST

Two new differentially private generative models have been introduced:

  • MST is an algorithm introduced in 2021 that relies on discretising the data and fitting a differentially private graphical model to low-dimensional marginals in order to allow for efficient data generation. See SDK usage.
  • Adaptive and iterative mechanism (AIM) for differentially private synthetic data. It relies on a graphical model approach to select a workload defined as a set of queries to approximate. See SDK usage.

Known limitations

Currently these can only be used with single table. This will be addressed in a future release.


Regex sampler: non weighted sampling

Previously, when the regex sampler came across a | indicating a logical OR operation, it sampled either side of the OR weighted towards the cardinality, i.e. the number of unique possible samples.

For example the regex r"([a-f]{1}|X)" samples the values a, b, c, d, e, f and X.

The previous behaviour now conforms to weighted_sampling=True. In this case we'd sample from the discrete distribution:

a = 1/7
b = 1/7
c = 1/7
d = 1/7
e = 1/7
f = 1/7
X = 1/7

By setting weighted_sampling=False we sample from the distribution:

a = 1/12
b = 1/12
c = 1/12
d = 1/12
e = 1/12
f = 1/12
X = 1/2

so the X is now much more likely to be sampled.

Improved data source bootstrap

By supplying a data_sources.json file to the bootstrap an admin can configure the Hub to have access to pre-defined data sources. For env var setup and file format.