The modern enterprise use case for synthetic data remains a high priority for many organisations looking to stay both compliant and competitive in today’s data-centric economy. Hazy provides a best-in-class approach to building scalable synthetic data solutions that keep your most sensitive data safe while unlocking and democratising its value for a wide range of contexts and consumers.
In this article, we’ll outline how our platform is designed from the ground up to meet the many challenges of deploying to enterprise architectures and deliver on the promise of privacy-preserving synthetic data for complex infrastructure topologies—namely, our “multi zone” approach.
Crossing the gap: Enterprise environments and their security asymmetry
Network ingress and egress for production data systems are typically limited to only the most highly credentialed services and operations-focused individuals (for “break glass” scenarios”). Enterprise architects often follow best practices and heuristics such as the principle of least privilege to ensure “access is provided only to information and resources that are necessary for its legitimate purpose”, maintaining a strong posture for information security and access control. This is driven in part by the increasing regulation and penalties associated with failure to demonstrate robust safeguards against prospective data leaks and breaches. Legislation (e.g. GDPR, HIPAA) and public interest pose both a financial and reputational risk if companies are found to be lacking in these areas.
Coupled with the segmentation of logical and sometimes physical infrastructure for the most confidential of assets, this has an implicit effect of creating an access & security asymmetry between the storage environments of sensitive data-at-rest and the compute environments for processing or analysing this data in the wider organisation. Internal teams may have a compelling business case for getting access to specific tables or features of production data, but its acquisition and any enabling orchestration process remains intractable due to the (by-design) barriers put in place by their company’s own architecture.
Synthetic data can bridge this gap. It makes demonstrably useful but provably safe data derived from the secure environment data available in lower security contexts: vastly increasing the potential audience and consumers, without leaking its sensitive attributes and properties. You can read more about how our industry-leading privacy-preserving synthetic data algorithms work in our technical white papers, other blogs posts, and our published research articles.
However, training a suitable synthetic data model that meets the technical definitions of privacy is only part of the story—solving for the enterprise to cross the asymmetry gap calls for a deployment and design strategy which fits the security requirements of each customer and their often multi-faceted architecture.
Multi-zone: A truly enterprise-native synthetic data platform
At Hazy, deploying to the enterprise isn’t just an implementation detail. We consider success at the highest organisational scales as one of our fundamental strengths. Accordingly, our design and engineering teams treat “enterprise-native” as a guiding principle in our daily and long-term decision making across all aspects of the Hazy product.
We’ve thought deeply as a company about the concept of “multi zone”—how Hazy can be deployed for customers who have multiple security contexts they must traverse to build a synthetic data function that works for their architecture and organisation. Simply providing the building blocks to construct pipelines isn’t enough to deliver value if the generated synthetic data were to remain siloed and inaccessible to the relevant internal parties.
In addressing the asymmetry gap, our platform offers a few key enabling features:
Modular, flexible deployment options
No two enterprise ecosystems are alike, and each of our component services were designed with flexibility and configurability in mind from the outset. We support both standalone and distributed architectures, both of which can be leveraged to adopt a multi-zone strategy in and network access. This multi-Hub deployment paradigm allows for only the trained Hazy Model to traverse any gap between production and analytics or development environments without any loss of functionality from a synthetic data usage perspective.
Many enterprises make extensive use of message-oriented middleware platforms, and Hazy’s distributed architecture options utilise brokers and queues to dispatch task and job information between compute environments. This is not only useful to elastically schedule resources as-and-when required, but it can also be configured to pass information securely between different enterprise zones. Message queues can be suitably instrumented for audit purposes, and secured accordingly to meet the standards of the architecture.
Vendor-native identity & access management
Hazy is a specialist in synthetic data systems, so we took an early decision to avoid reinventing the wheel when it comes to core functions like identity management and access control. We leverage the IAM functionality of the major cloud vendors out of the box to ensure a deployed Hazy system complies with the highest standards of information security. This includes turnkey support for role-to-service account mapping, identity federation, and vendor-native secrets management across each of our applications. Integration with these vendor services sits atop our core authentication and authorisation stack, provided by the open source identity provider Keycloak. By relying on the best tools for the task, our platform allows enterprises the flexibility to meet the established standards in their pre-existing infrastructure.
Synthetic data federation and separation of concerns
At the core of the process, the Hazy Model File (.hmf) is our serialisation standard for storing trained models. This format comprises all the requisite information for a synthetic dataset of unbounded size to be generated, without the underlying sensitive information from the parent source dataset. Once trained, this model is a truly portable artefact: we can transition this from a secure environment over to a less-secure one (e.g. a model registry available to an internal analytics team), effectively unlocking the parent dataset for consumption by a less- or non-credentialed audience without the risks associated with using the parent dataset directly. This pattern shares some similarities with federated learning approaches to decentralising inference tasks across multiple local datasets.
A trained and released Hazy Model can be queried at a cadence most suitable to your team’s requirements: generating data to a shared repository at regular intervals or entirely on-demand. Our SDK enables more sophisticated pipelines to be constructed wherever data is required, without the friction of ad-hoc privilege escalation and data egress.
Similarly, the Hazy Hub ships with a fully functional API in addition to its rich dashboard interface. This service is responsible for the provisioning and management of synthetic data training and generation jobs, and the API allows for seamless programmatic scheduling of pipelines and tasks. It also provides a secure interface for the design of network segmentation configurations which maintain security boundaries in high-trust environments. For example, training jobs can be remotely scheduled by parties without the need for them to gain privileged access to the underlying production system other than tightly limited API credentials to the Hazy Hub. This follows the aforementioned least privilege pattern without any degradation of functionality of the synthetic data platform.
Graduate your production data to synthetic
We believe complexity often can’t be abstracted away, so the best approach is to provide flexible tools that seamlessly integrate with the demands of enterprise architectures and give organisations the power of synthetic data without compromising on their established engineering standards and practices.