About data
Hazy generates sample based synthetic data, based on a statistical analysis of a source dataset. This source data must be structured numerical and categorical information in tabular / relational form. This can be ingested from a CSV file, or from an SQL database.
Hazy has automatic support for single and multicolumn distributions, referential integrity, composite values and aggregated or flattened sequential data. Hazy has bespoke support for more advanced sequential data and conditional generation.
Sampling¶
Hazy generates sample based synthetic data. This means that Hazy's synthetic data is not just based on a schema or generated out of thin air. It's based on the statistical analysis of a source data set.
Ingesting this source data is the start of the Hazy process  the source data is ingested, analysed and then a generator is trained to learn all of its patterns and correlations. This generator is a serialisable object, containing compressed representations of the patterns and correlations in the source data.
 Data ingestion
 Data processing / cleaning
 Generator training
The process starts by ingesting the source data at the start of the Hazy process. Data processing deals with all necessary transformations required before feeding the data to train Hazy generators. These can be normalization, binarization, 1hot encoding, logarithm scaling, etc. The generator is trained to learn all the complex relations between different columns in the data (distributions, correlations, etc). This generator is a serialisable object(/docs/outputs/#generatorobjects), containing compressed representations of dependencies between the variables. For PrivBayes, the generator consists of a Bayesian Network, for Synthpop the generator is a collection of decision trees.
Synthetic data is then generated by sampling from the generator and propagating the signal through the network, for example, generating values for a column by sampling from its distribution estimate:
The synthetic data is guaranteed to preserve the statistical properties of the source data whilst still being safe  a concept that is detailed in the differential privacy section.
Value types¶
Hazy supports numerical and categorical value types:
numerical
: integers, floats, decimals and timestampscategorical
: types that can be mapped to a set of categories, including strings, enums, locations, dates, etc.
Numerical types¶
Hazy can handle all common numeric types found in structured data. This includes:
integers
floating point numbers
fixed precision numerical columns
/decimals
timestamps
Categorical types¶
Hazy can handle explicit categorical types (like enums) and will attempt to map string / varchar types to categories, in order to be able to statistically process them.
Explicit categorical types:
enums
dates
postcodes
,cities
, etc.
Types that can be mapped to categorical:
string
/varchar
This works by extracting the set of unique values from a string column and treating each distinct value as a category. Note that this technique is only useful for text columns that have a manageable number of distinct values / categories. See the limitation on max cardinality below.
Unsupported types¶
Hazy does not support noncategorical string values.
String columns that contain a large number of distinct values and / or unstructured text are not values that Hazy currently supports. This can impact data sets that, for example, contain a large number of distinct names
in a string column.
Where data contains unsupported columns or values, such as integer IDs or text, these can be dropped or passed through at the validation and subsetting stage.
Features¶
Hazy can model and preserve the distributions in structured numeric and categorical data. In addition, Hazy supports the following features:
Multicolumn distributions¶
Hazy models not only the distributions of single columns but also the relationships between columns.
Passthrough¶
If a column contains data that Hazy cannot model, such as free (noncategorical) text, then the generator can be configured to pass through the original values from the source data directly into the synthetic twin. This has a number of implications:

Size of generated data. If a column has been marked for passthrough, then the generator is hardlimited to generate the same number of rows as the source data.

Privacy is no longer guaranteed. Including original source data in the synthetic version has obvious privacy implications including the requirement that the passthrough data must be included in the model file.
Referential integrity¶
Hazy models can cope with data extracted from multiple tables that require some level of referential to be perserved.
Composite values¶
Hazy synthesisers can be designed to understand and support composite values, i.e. fields whose content is computed from the values of other fields. This also helps with composite IDS, for example, fields that contain an ID with a prefix, like a PAN number.
Sequential data¶
Sequential data, also called time series or transactional data, is structured data with correlations between rows as well as columns, for instance bank transactions. Sequential data is challenging to synthesise, as the relationships between rows are often implicit and can have long causation chains (for instance, spending patterns over the Christmas period can be related to spending patterns that occurred in Summer). The trend's underlying distributions in the data can also be dynamic, which makes them hard to predict and generate. For example, when a life event occurs, like marriage or a change in employment status, the patterns of bank transactions may change considerably.
Hazy has a number of different approaches to modelling and generating sequential data. These include, bootstrapping, windowing, temporal Generative Adversarial Networks (GANs), AutoEncoders and sequential Synthpop.
Due to these complexities, our sequential data support is currently bespoke and configured manually on a case by case and data set by data set basis.
In addition, sequential data can often be handled with workarounds, including flattening and aggregation.
Please contact us if you would like more information and support.
Conditional generation¶
Hazy also has bespoke support for conditional generation  using our deep learning algorithms to apply statistical control and rebalancing of the output synthetic data set. This can help to address limitations with maintaining quality for imbalanced data / data sets containing low frequency signal, for instance in fraud detection and cybersecurity.
Please contact us for more information.
Limitations¶
Max cardinality¶
The approach of mapping string values to categories works well when there are a relatively small number of distinct values / categories. Performance will vary depending on the algorithms used.
Text / noncategorical strings¶
As discussed above, Hazy is a tool for handling structured numerical and categorical data. It does not currently support unstructured text or noncategorical string values.
We do have tooling for converting text into structured entities although there are ongoing efforts to extend our support for unstructured text. If this is important to you, please let us know.
Imbalanced data / outlier detection¶
In order to make privacy guarantees, Hazy applies "noise" to the data, in the form of ε
differential privacy level and can "cut out" outlier data points when applying disclosure risk thresholds. As a result, imbalanced data sets with low frequency signals and downstream modelling and analysis tasks that aim to detect outliers can suffer from a loss of utility.
This can, in some cases, be mitigated by algorithm selection and the use of conditional generation to amplify low frequency signal. If you need help with this, please contact us.
Dynamic distributions¶
Sequential data with dynamic distributions can be challenging to estimate. We have a number of approaches to address this but they need to be reviewed on a case by case basis.