Use cases

Hazy is enterprise software that lives next to your data as part of your data platform and / or data services. Hazy typically lives alongside test data provisioning and data transformation tools and essentially expands the scope of test data provisioning to data science use cases.

For example, if a simplified view of your data architecture looks like this, with test data used for dev and test but data science and analytics still requiring live data:

With Hazy, this architecture expands to allow exploratory data science and model training to use synthetic data:

This reduces risk and increases agility by providing a safe synthetic data provisioning solution to unlock data science in less trusted environments, including:

  • on-premise data labs
  • proprietary vendor platforms or services
  • public cloud platforms with their advanced AI/ML ecosystems, tooling and APIs

Below we diagram and describe how Hazy typically fits into enterprise data architecture for common use cases / deployment scenarios.

Scenario 1 -- Pathway to cloud

The key pattern with Hazy is to train on synthetic data and evaluate on real. This matches perfectly with moving workloads to the cloud. You can train models in the cloud, using all the tools, apis and resources available there, before then using the model back on-premise to run inference against real data.

In addition, when using Hazy with the cloud, you don't need to transfer any data. Instead, you can just transfer a synthetic data generator.

For example, imagine a typical classification use case, such as fraud detection:

In this case, you have a stream of highly sensitive, live transactions that you want to classify as a positive (fraud) or negative (not-fraud). With Hazy, you can batch up the new transactions, for example on a daily basis, and keep re-training a synthetic data generator that learns all the properties and patterns of the data in an on-premise environment.

You can then automatically push the updated generator into a less trusted cloud environment. This generator object is small and safe to transfer (it's differentially private and doesn't contain any actual data). It can be used in the cloud environment to generate synthetic data that is statistically representative to the live data. You can then train your fraud detection model in the cloud on the synthetic data, for example using a cloud ML pipeline or specialist vendor capabilities. When ready, your trained fraud detection model can then be transferred back to your live data environment, to be used to classify transactions.

This shows how Hazy's train on synthetic data, evaluate on real pattern can be used to move data science workloads to the cloud.

Scenario 2 -- Synthetic twin

Hazy allows you to create a synthetic twin of your data platform.

For example, if you take a simplified view of a typical data platform, using Teradata to provide an SQL query interface for a data lab environment:

Install Hazy between your warehouse and your data lab and you will be able to run a SQL query on your Teradata installation and use the result set as the source data to train a generator and return a synthetic copy of the results:

This allows Hazy to be used as a systemic solution to create a synthetic twin of your data platform.

Scenario 3 -- Iterative data science

One of the key principles behind Hazy is the split between the generator training environment and the data science environment:

This allows the core Hazy training software to be installed in a privileged environment, with access to the source data and the Hazy client software to be installed in the data scientist's environment.

This client is an open source Python library that can be freely installed in any data science project. It allows the data scientist to load and use generator models, apply transformations and calculate metrics — all from within a Python virtualenv or Jupyter Notebook.

Hazy provides safe synthetic data that can be used for data science and analytics workloads.

Because Hazy data is private and synthetic, it can be used freely, eliminating the layers of control blocking data access and innovation and increasing data agility.

Model training

The primary use case for Hazy data is as a drop in replacement for model training. Rather than preparing, training and running models on real data (with all of the associated governance, security, privacy and compliance requirements), models can be:

  • prepared and trained on synthetic data
  • run against the real data

This allows for new workflows that can radically improve data agility. For example validating and testing models and algorithms quickly, preparing and training models on the cloud and harnessing external vendors and APIs. All without needing to meet the typical governance or security requirements that surround working with real data.

Supported modelling techniques

Hazy data works for most common machine learning models and algorithms, including:

Classification

  • Naive Bayes
  • Logistic Regression
  • K Nearest Neighbours
  • Support Vector Machines (with a variety of kernels)
  • Decision Tree
  • Random Forest
  • Light Gradient Boosted Machine (LGBM)
  • Extreme Gradient Boosting (XGBoost)

Regression

  • Linear Regression (including different regularisations)
  • K Nearest Neighbours
  • Decision Tree
  • Random Forest
  • Light Gradient Boosted Machine (LGBM)
  • Extreme Gradient Boosting (XGBoost)

Limitations

Hazy data may not currently be a good fit for tasks that involve:

Limited support

  • Outlier detection
  • Highly imbalanced data
  • Highly dynamic sequential data
  • Unstructured text (or non-categorical string values)

However, we do have capabilities that address each of these, so if you're unsure about use case support, please get in touch and ask us.

Supervised learning / predictive analytics

We support supervised learning and predictive analytics, with built in utility assessments for common techniques.

Unsupervised / clustering

We support unsupervised learning, clustering and segmentation. For these use cases, we measure data quality using similarity and query based utility metrics.

Safe data provisioning

Hazy is a safe data provisioning tool. It can be installed as part of your data infrastructure or service layer and used to provision safe data sets into less trusted environments.

Synthetic twin

Test data is a well known technology, typically used for development and test environments. Using test data reduces the attack surface of live / sensitive / personal data in circulation.

Hazy's smart synthetic data extends this paradigm beyond dev and test, into data science, analytics and innovate environments. It allows an enterprise to create a synthetic twin of the data platform underpinning their data science and analytics activity.

For example, a typical data platform may have a variety of input sources, feeding some kind of data cleasing and normalisation pipeline which feeds data into a data warehouse or lake that underpins a query or provisioning interface.

With this setup, you can either split and synthesise the input sources, so that equivalent synthetic data streams into a "synthetic twin" data warehouse, or synthesise the normalised warehouse tables. In either case, the result is a systematic architectural solution to reducing data risk and increasing data agility that can be used across the enterprise.

Pathway to cloud

Enterprises are often blocked from moving data science and analytics workloads to the cloud by data security and governance restrictions.

Hazy generators can be trained on-premises and then copied onto the cloud. This allows cloud resources and APIs to be harnessed without any real data actually leaving the premises.

Data labs & exploration

Data science is an exploratory activity based on research and experimentation.

Data scientists and machine learning engineers often don't know whether a particular approach or technique is going to work before they try it -- and often whether a technique will work will depend on details of the data that can't be known without access to it.

Even if a modelling technique (such as outlier detection) may perform better when trained on real data, it can be validated and explored on synthetic data. This can help make the case for expediting or approving a data access request and allows the scientist to "fail fast" without waiting on data access.