Data science on test data: an outdated heresy

tl;dr

From a data science point of view you always want to work on the real data. In the real world, that’s not always possible. Working with smart synthetic data is now a viable alternative, particularly for more exploratory data science.

Outlining the heresy

If you ask a data scientist today whether they need real data to do their work, the majority will unwaveringly answer yes, they do. Data science is the science (and art) of extracting insight from data. How on earth can you do that if you don't have the data in the first place?

This is a pretty common sense orthodoxy. However, it comes with a pretty major drawback: getting the data. Either because it doesn't exist or because the people who control the data won't (or aren't allowed to) give it to you.

This isn't a transitory problem. In the days of GDPR, every data set that hits disk is a headache. Getting access to data inside an organisation is frustrating. Sharing it outside often impossible.

Of course, it is a solved problem. Software developers started off working with real data. Cue leaking your users table. So nowadays, development uses test data and, as long as the unit tests pass, who's complaining? The temptation is to apply this paradigm to data science. Dev and test use fake data, why can't we?

The obvious answer -- hence the common sense orthodoxy -- is that fake data is useless when you're trying to train a model on it. You can't take a DDL schema, run some random number generators and expect anything other than garbage in, garbage out.

So data science on test data is a non starter, right?

Well, wrong. There's a new type of test data -- smart, statistically representative synthetic data -- that can realistically be used for data science. This post outlines what it is and when it can and can't be used.

The case for real data: credit risk

Let's set the case out for the current orthodoxy. At Hazy we focus on financial services, so I'll take an example from an area I know: credit risk.

Credit risk models help banks decide whether, and at what price, to lend money. It's seriously core business. Basis point improvements in credit risk assessment directly impact the bank's bottom line. As a result, data scientists are fighting for fractions of a percent in performance improvements.

Now, imagine a vendor like Hazy turns up and says, hey, we have magic technology that can redact / anonymise / generate / clone / mimic your data. Any loss or variance in signal from the transformation or cloning process is going to clobber any improvements in modelling that might come from working with it.

This is why real world data science needs real data.

The case for test data: smart, statistically representative synthetic data

Let's set out the counter argument for the heresy. Doing data science on test data. Full disclosure, I'm about to describe what Hazy do. This is our core product, our raison d'etre.

The problem with test data is the loss or variance in signal. As said above, you can't take a DDL schema, run some random number generators and expect anything other than garbage in, garbage out.

The thing is, there are more sophisticated ways of generating artificial data, e.g.:

  • Synthpop generates synthetic values by drawing from conditional distributions fitted to the original data using parametric or classification and regression trees models
  • TGAN uses generative adversarial networks (GAN) to synthesize tabular data

Rather than generating test data out of thin air and business rules, these smart, sample generated methods create synthetic data that's not only structurally but also statistically equivalent to the source data.

It's a simplified example but take a column, for example age in this table of US census data. If you plot frequency against value, you'll see a distribution that looks something like this:

Example distribution of age values
Example distribution of age values

If you were then to generate statistically representative synthetic data, it would be made up of artificial values that also follow roughly the same distribution:

Overlay synthetic and real age values, showing statistical similarity
Overlay synthetic and real age values, showing statistical similarity

You can extend this approach from matching the distribution of values in single columns, to matching the correlation of values between multiple columns. For example, age tends to be correlated to income, age-income may be correlated to gender and location, ad so on (into as many dimensions as you have columns in related tables).

Modelling correlations across multiple dimensions
Modelling correlations across multiple dimensions

If you can create synthetic data that maintains all these statistical properties, then by definition you preserve the insight that your model is looking for.

For example, it may turn out that 45-55 year old middle-income females living in Idaho never default on their loans. This may be the nugget of insight that you need your credit risk model to learn.

If the sample based synthetic data preserves all of the statistics and correlations, then your model can just as well be trained on the synthetic data as on the real data.

Realistic limitations of smart synthetic data

We've seen above that real data science needs signal in the data and that smart synthetic data tries to preserve it. The thing is, the case for real data is right: there is a loss of, or variance in the, signal when working with synthetic data.

Even if the loss or variance in statistical equivalence shrinks lower and lower as the "smartness" of the synthesisation increases, there will always be a loss or varience. If a credit risk model is trying to improve by a basis point -- a hundreth of a percent -- then it's no good if the quality of the data has already dropped that basis point.

One of the other angles that feeds into this is privacy. At Hazy, for example, in order to make sure that the synthetic data we generate is safe, we train our generative models to be differentially private. In practice, this means introducing a little bit of noise to the statistical properties that we aim to maintain, in order to make sure they're resilient to the removal of any individual records in the source data.

This addition of noise deliberately increases the variance in statistical similarity, which obviously has knock on impact for more sensitive modelling techniques.

So, given this, when can you use synthetic data?

Realistic scenarios when you can use synthetic data for data science

Well, the first thing to do is to step back and survey what data science in the real world really involves. Because, in reality, optimising a model to deliver basis point improvement on the final training run is just the tip of the iceberg.

Real world data science involves a series of practical steps, for example:

  • Acquisition
  • Cleansing
  • Normalisation
  • Aggregation
  • Feature engineering
  • Analysis and modelling
  • Experimentation and comparison of different techniques
  • Configuration and parameter tuning
  • Peer review and replication
  • Productionisation

The majority of this activity can be done just as well on smart synthetic data as it can be on the real data.

  1. Synthetic data looks and feels just like the real data, so you can "eyeball" it, drop columns, coerce values and engineer features just as well as with the real data.
  2. Whilst the statistical properties of the synthetic data may vary slightly from the real data, this variance (a) should be relatively minor and (b) is obviously consistent across experiments using the same synthetic data. This means that it's perfectly valid to compare algorithms and models trained on the same synthetic data when experimenting and evaluating techniques.
  3. Safe synthetic data also has the happy property that it (or its generator) can be stored alongside a code experiment, without worrying about GDPR compliance, right to be forgotten, etc. This makes it much more suitable than real data for experiment verifiability and as an explainability ingredient.
  4. Using synthetic data generators for repeatability and verifiability, as opposed to a static data set, also helps verify that the results for a particular model or algorithm are repeatable for different, statistically representative data sets. I.e.: that your model hasn't somehow over-fitted to the properties of a single data set.
Using synthetic data for feature engineering and experimentation
Using synthetic data for feature engineering and experimentation

From a data science point of view, the current orthodoxy is that you always want to work with the real data. However, in the real world, that’s not always possible or desirable for data access and compliance reasons.

Synthetic data has now evolved to a point where it's smart enough that you can genuinely do data science on it. It's not suitable for everything -- you need real data to wring the last basis point improvement out of a credit risk model -- but it is now a viable alternative for the more exploratory parts of data science work.

This isn't the end of the story. Synthetic data has different affordances to real data. You can dynamically generate it. Cold store it. Generate 1000x more of it. These affordances unlock new architectures and workflows that will profoundly change the way organisations work with data.


Subscribe to our newsletter

For the latest news and insights