About data

Hazy generates sample based synthetic data, based on a statistical analysis of a source dataset. This source data must be structured numerical and categorical information in tabular / relational form. This can be ingested from a CSV file, or from an SQL database.

Hazy has automatic support for single and multi-column distributions, referential integrity, composite values and aggregated or flattened sequential data. Hazy has bespoke support for more advanced sequential data and conditional generation.

Sampling

Hazy generates sample based synthetic data. This means that Hazy's synthetic data is not just based on a schema or generated out of thin air. It's based on the statistical analysis of a source data set.

Ingesting this source data is the start of the Hazy process - the source data is ingested, analysed and then a generator is trained to learn all of its patterns and correlations. This generator is a serialisable object, containing compressed representations of the patterns and correlations in the source data.

  • Data ingestion
  • Data processing / cleaning
  • Generator training

The process starts by ingesting the source data at the start of the Hazy process. Data processing deals with all necessary transformations required before feeding the data to train Hazy generators. These can be normalization, binarization, 1-hot encoding, logarithm scaling, etc. The generator is trained to learn all the complex relations between different columns in the data (distributions, correlations, etc). This generator is a serialisable object(/docs/outputs/#generator-objects), containing compressed representations of dependencies between the variables. For PrivBayes, the generator consists of a Bayesian Network, for Synthpop the generator is a collection of decision trees.

Synthetic data is then generated by sampling from the generator and propagating the signal through the network, for example, generating values for a column by sampling from its distribution estimate:

Data points are generated by sampling from its sample distribution

The synthetic data is guaranteed to preserve the statistical properties of the source data whilst still being safe - a concept that is detailed in the differential privacy section.

Value types

Hazy supports numerical and categorical value types:

  • numerical: integers, floats, decimals and timestamps
  • categorical: types that can be mapped to a set of categories, including strings, enums, locations, dates, etc.

Numerical types

Hazy can handle all common numeric types found in structured data. This includes:

  • integers
  • floating point numbers
  • fixed precision numerical columns / decimals
  • timestamps

Categorical types

Hazy can handle explicit categorical types (like enums) and will attempt to map string / varchar types to categories, in order to be able to statistically process them.

Explicit categorical types:

  • enums
  • dates
  • postcodes, cities, etc.

Types that can be mapped to categorical:

  • string / varchar

This works by extracting the set of unique values from a string column and treating each distinct value as a category. Note that this technique is only useful for text columns that have a manageable number of distinct values / categories. See the limitation on max cardinality below.

Unsupported types

Hazy does not support non-categorical string values.

String columns that contain a large number of distinct values and / or unstructured text are not values that Hazy currently supports. This can impact data sets that, for example, contain a large number of distinct names in a string column.

Where data contains unsupported columns or values, such as integer IDs or text, these can be dropped or passed through at the validation and subsetting stage.

Features

Hazy can model and preserve the distributions in structured numeric and categorical data. In addition, Hazy supports the following features:

Multi-column distributions

Hazy models not only the distributions of single columns but also the relationships between columns.

Pass-through

If a column contains data that Hazy cannot model, such as free (non-categorical) text, then the generator can be configured to pass through the original values from the source data directly into the synthetic twin. This has a number of implications:

  1. Size of generated data. If a column has been marked for pass-through, then the generator is hard-limited to generate the same number of rows as the source data.

  2. Privacy is no longer guaranteed. Including original source data in the synthetic version has obvious privacy implications including the requirement that the pass-through data must be included in the model file.

Referential integrity

Hazy models can cope with data extracted from multiple tables that require some level of referential to be perserved.

Composite values

Hazy synthesisers can be designed to understand and support composite values, i.e. fields whose content is computed from the values of other fields. This also helps with composite IDS, for example, fields that contain an ID with a prefix, like a PAN number.

Sequential data

Sequential data, also called time series or transactional data, is structured data with correlations between rows as well as columns, for instance bank transactions. Sequential data is challenging to synthesise, as the relationships between rows are often implicit and can have long causation chains (for instance, spending patterns over the Christmas period can be related to spending patterns that occurred in Summer). The trend's underlying distributions in the data can also be dynamic, which makes them hard to predict and generate. For example, when a life event occurs, like marriage or a change in employment status, the patterns of bank transactions may change considerably.

Hazy has a number of different approaches to modelling and generating sequential data. These include, bootstrapping, windowing, temporal Generative Adversarial Networks (GANs), Auto-Encoders and sequential Synthpop.

Due to these complexities, our sequential data support is currently bespoke and configured manually on a case by case and data set by data set basis.

In addition, sequential data can often be handled with workarounds, including flattening and aggregation.

Please contact us if you would like more information and support.

Conditional generation

Hazy also has bespoke support for conditional generation - using our deep learning algorithms to apply statistical control and rebalancing of the output synthetic data set. This can help to address limitations with maintaining quality for imbalanced data / data sets containing low frequency signal, for instance in fraud detection and cyber-security.

Please contact us for more information.

Limitations

Max cardinality

The approach of mapping string values to categories works well when there are a relatively small number of distinct values / categories. Performance will vary depending on the algorithms used.

Text / non-categorical strings

As discussed above, Hazy is a tool for handling structured numerical and categorical data. It does not currently support unstructured text or non-categorical string values.

We do have tooling for converting text into structured entities although there are ongoing efforts to extend our support for unstructured text. If this is important to you, please let us know.

Imbalanced data / outlier detection

In order to make privacy guarantees, Hazy applies "noise" to the data, in the form of ε differential privacy level and can "cut out" outlier data points when applying disclosure risk thresholds. As a result, imbalanced data sets with low frequency signals and downstream modelling and analysis tasks that aim to detect outliers can suffer from a loss of utility.

This can, in some cases, be mitigated by algorithm selection and the use of conditional generation to amplify low frequency signal. If you need help with this, please contact us.

Dynamic distributions

Sequential data with dynamic distributions can be challenging to estimate. We have a number of approaches to address this but they need to be reviewed on a case by case basis.