Distributed architecture at Hazy

Hazy is the leading synthetic data platform providing synthetic data solutions in various industries including financial services, telecommunications and healthcare. With the rise in data regulation and governance, the migration towards cloud applications and the increasing need for deploying cost-effective synthetic data solutions at scale, Hazy has developed a Kubernetes-based distributed architecture deployment to meet these challenges.

In this article, we will describe how this new deployment unlocks further opportunities for synthetic data pipeline integrations. More specifically, we will describe:

  • A quick recap on how the Hazy platform works
  • Hazy's distributed architecture
  • Benefits to customers

A quick recap on how the Hazy Synthetic Data platform works

Before delving further, here's a quick recap on how Hazy works:

Hazy is a synthetic data platform that allows its users to create synthetic data from a centralised Hub. Users can connect their sensitive source data to the platform, trigger synthetic data pipelines and create synthetic data with just a few clicks. If you haven't had a chance yet, try out our free demo platform. The synthetic data pipeline has two key tasks that run on the Hazy platform:

  1. Training: Hazy synthesisers learn from the patterns in the sensitive input data and output a privacy-preserving generative model that may be used for the following step.
  2. Generation: Hazy's generative model is used to generate synthetic data that contains similar statistical properties to the underlying input data without divulging its confidential data.

Hazy's user interface allows for easy creation of synthetic data from complex multi-tabular source data. Currently, Hazy is deployed as a single Docker container instance and generally speaking, as Hazy deals with sensitive source data, tends to be installed inside our customers’ on premise or cloud environments.

In Hazy's single container deployment, Training and Generation tasks run as separate Unix processes on the same host machine. Although Hazy's single container deployment has been very successful at creating robust pipelines for several of our large scale enterprise clients (including Wells Fargo, Nationwide Building Society and Vodafone) , there are some factors to consider when running every Hazy task on the same host:

Security

Access credentials to sensitive input locations are shared amongst the Hub and in-process Hazy synthesiser tasks. Although it is still safe to use the single container deployment of Hazy as the number of ways an attacker could compromise the Hub (also known as its attack surface) is still significantly limited, the potential severity of an attack on a system where credentials are shared amongst sensitive-source-data-accessing Hazy synthesisers and the Hub is considerable. Fortunately, Hazy's distributed architecture reduces this risk greatly as we will discuss in the next section.

Resource constraints

Depending on the shape and size of the source data, Hazy's Training and Generation tasks are potentially resource-expensive to run. Since Hazy's single container deployment shares resources across host Unix processes, several Training/Generation tasks running concurrently will share the resources of the single host they run on. 

As customers require running more concurrent Hazy tasks, since the single container deployment requires all tasks to run on the same host, only vertical scaling (increasing a single host's resources) is possible. Broadly speaking, as there are limits to the size of hardware available, vertical scaling does not scale as effectively as horizontal scaling. 

These resource constraints have limits on the number of concurrent tasks the single container deployment of Hazy can run. It is worth noting that this has not been a significant blocker for several of our customers who run Hazy without requiring concurrent usage, but is a potential limitation of the single container deployment.

Inelastic deployment

Generally speaking, Hazy's Training and Generation tasks are long running and are (infrequently) triggered on-demand. The Hub is a server, meaning it must always be switched on and ready to accept requests for triggering synthetic data pipelines.

As mentioned above, a consequence of sharing the host's resources amongst the Hub and Hazy synthesiser tasks is that there may be significant duration of time where no Training/Generation task is running. Since the host must have enough resources to run these resource-expensive tasks at any one time, there may be frequent periods of overallocation of resources. 

Note: it is possible to limit this issue by only running the Hub temporarily for configuration and manually running the Hazy synthesisers by exporting Training/Generation configurations from the Hub and using our Python client. This solution is less user-friendly than our proposed distributed architecture deployment as you will see below.

Hazy's distributed architecture

Hazy's distributed architecture deployment runs on Kubernetes, an industry-standard container orchestrator and allows for Training/Generation tasks to be run on different hosts. The following illustrates how the distributed architecture works: 

Hazy synthetic data distributed architecture diagram
Hazy Synthetic Data platform distributed architecture

In a distributed architecture deployment of Hazy, the Hub runs as a Kubernetes Deployment and Hazy synthesisers running Training/Generation tasks as dispatched Kubernetes Jobs (rather than Unix processes in the single container deployment). 

These Kubernetes Jobs will run in separate containers, crucially on (potentially) different Nodes (machines) for the duration of the computation of training/generation. Kubernetes can then be configured to scale in the nodes running the Hazy synthesisers once they have no more tasks running.

Note: Unlike single container deployment, for distributed architecture deployment, since the Training and Generation tasks run in their separate containers on potentially different hosts, the sensitive input data and synthetic output data have to be stored in cloud object stores or external relational databases.

Benefits to customers

The distributed architecture deployment has several advantages:

Security: reduced privileged access across the system

Access credentials can be configured to only be attached to Kubernetes Jobs running as special Service Accounts that are granted special access rights to sensitive input data locations. For example, in AWS, it is possible via the IAM Role to Service Account Mapping mechanism to only provide access credentials for sensitive input data to Hazy synthesisers running Training tasks.

Following the Principle of Least Privilege, this reduces the risk of credential compromise on the system since these credentials are not shared across services (unlike in the single container deployment between Hub and Hazy synthesisers). This means the severity of a credential compromise attack is more limited than in the single container single host deployment. In the distributed architecture deployment, were the Hub to be compromised, it would not be possible to retrieve access credentials to the sensitive data locations as these are not known by the Hub.

Multiple security contexts: safely share cross-border models and data

Hazy Training synthesisers require access to secure environments. As noted by the PCI DSS Cloud Guidelines, network segmentation between different contexts provide the advised security isolation for secure systems. With Hazy's distributed architecture, it makes it easy to segregate the training task that is directly accessing sensitive data in a different security contexts. 

Here is a concrete example that illustrates how multiple contexts unlock the potential of synthetic data:

A multinational enterprise has offices in Country A and Country B which have different data regulations over personal sensitive data. We can hence treat these countries as different security contexts. 

Up until now, Country A was unable to access the sensitive data from Country B due to its jurisdiction’s regulation. However, this data would be useful for A for product development, so Country A would like a synthetic version of Country B's sensitive data. 

Through the distributed architecture, it is possible to trigger Training tasks from Country A's context into Country B's security context. This Training task will output the privacy-preserving generative model in a storage location accessible to Country A's security context. Country A can then run a Generation task to create the synthetic data based on Country B's sensitive source data, thus unlocking the data sharing potential between the two security contexts.

Although this could be achieved with the stand alone single container deployment by importing the generative model, the distributed architecture allows for better auditability and traceability of the jobs triggered in each of the security contexts - necessary for large companies’ governance processes as well as a seamless user experience. We will discuss more on running Hazy in multiple security context in a future blog post, so keep your eyes peeled.
Distributed architecture in multiple security contexts
Distributed architecture in multiple security contexts

Elastic scalability: instant flexibility

Through Kubernetes' Taints and Tolerations, Node Selector and Limits and Requests mechanisms and node autoscaling technologies like Cluster Autoscaler or AWS Karpenter, it is possible to allow only for specific Kubernetes Jobs to run on scaled in nodes. For example, the distributed architecture deployment makes it possible to run:

  1. An always available low specification node for the Hub server;
  2. A scaled out high specification node for Training/Generation jobs that may be scaled down once the tasks have completed

Please refer to this page in our documentation for details on this setup.

This capability allows for orders of magnitude more tasks to be run concurrently than we have seen at customer sites at any one time on the single container platform as nodes can be scaled in to run these resource intensive jobs on demand. 

Once these jobs are complete, these can be scaled out, reducing the operational costs and potential overallocation of resources of the platform. As an example, deploying the distributed architecture instead of our single container architecture in our SaaS platform has reduced our internal cloud computing costs by 50%.

Kubernetes: high availability at all times

The distributed architecture is installed via a Helm chart and runs on Kubernetes. One of the key advantages of Kubernetes is the high availability it guarantees as standard. It can be configured to manage the lifecycle of any running process installed across its cluster, a key aspect in ensuring production systems remain operational and robust. 

During the design of the distributed architecture, alternatives to Kubernetes including AWS Fargate, were considered. However, we chose Kubernetes because it is an industry standard in container orchestration, is agnostic to any cloud provider (unlike AWS Fargate that only runs on AWS) and has several more granular configuration options for autoscaling than fully managed orchestration tools like Fargate. Furthermore, Kubernetes has also become standard in MLOps with tooling such as Seldon, Kubeflow, Pachyderm providing powerful tooling therein.

This allows for Hazy admins to manage scalability of the different Training/Generation tasks on a more granular level, which as described above, has significant proven advantages.

Note: It is worth mentioning that the distributed architecture deployment is not intended to be a replacement of the single instance deployment. Running Kubernetes includes significant operational costs as well as a significant up front educational cost. The single container deployment may be more suitable in smaller scale projects or in instances where clients do not want to manage a Kubernetes run application. 

To conclude

Hazy's single container deployment provides a simple way to easily produce synthetic data. With Hazy's new distributed architecture deployment presented to you in this article, firms can benefit from:

  • Reduction in costs by reducing overallocation of resources, especially relevant given the recent rise in generative model training costs, 
  • Increased scalability and collaboration as several synthetic data pipelines can be triggered concurrently,
  • Higher availability as the solution runs on Kubernetes, which has HA guarantees,
  • Improved security by reducing the services owning sensitive source data accessing credentials.

It is worth mentioning that for some customers - those with small, less complex use cases - the single container deployment may be sufficient. However, distributed architecture deployment is a better fit for larger scale deployments of Hazy and may unlock use cases that were previously limited, especially in multi-zonal environments.

As with all software engineer design, the right decision depends on the specific domain, so it is worth assessing which of the single container and distributed architecture deployments best suits your current needs. For advice or more information, get in touch with our technical team.


Subscribe to our newsletter

For the latest news and insights