Safe data at scale with Snowflake and Hazy
Hazy has a native Snowflake connector which enables customers to quickly read and write to their Snowflake data lakes. The integration drastically speeds up time to generate synthetic data. Machine Learning Engineer, Ed, dives into this feature and how it is benefitting our customers.
Data lakes in modern data architecture
With the increasing reliance on data across all industries and with the rapidly increasing quantities of data gathered every day, effective data storage solutions are now becoming a necessity for many businesses.
Data lakes are typically used by organisations as a central location to store unprocessed data to be later used by downstream applications or data warehouses.
Despite data lake technology such as Snowflake now being a crucial component of modern data architectures, there are still numerous challenges revolving around the utilisation of the data stored within them.
Namely, some organisations struggle to fully leverage their data because the source datasets are often highly confidential and subject to internal security and external compliance controls.
With Hazy, organisations now have the ability to directly integrate synthetic data generation with their Snowflake databases, allowing for a more seamless way to create safe and realistic synthetic data which can be shared more easily across the organisation.
Benefits for Snowflake & Hazy customers
Snowflake and Hazy enables you to create safe representative machine learning data. Hazy's native Snowflake integration enables:
- Security: greater security of data transfer.
- Speed: greater speed of data transfer. Read and write data to Snowflake in minutes.
- Quality: synthetic versions of your data that can be safely shared but still retain the quality of the original data.
- Scale: leveraging the native Snowflake querying system to unlock fast in-database subsetting, speeding up the synthetic data generation process.
Accelerate your downstream projects
With safe, synthetic datasets created, Snowflake customers can speed up downstream projects including:
Adoption of AI
Testing at scale
Analytics and BI
How it works
In a similar manner to our other supported database connectors, Microsoft SQL Server and IBM Db2, the user can create a data source on the Hazy Hub and specify Snowflake as the connector type. Once selected, the user can provide the relevant connection credentials and indicate whether to use the Snowflake database as an input which holds the real source data to read from, or alternatively as an output location which will hold the synthetic data generated by Hazy.
With this new data source, users can then proceed through the Hazy Hub, following the standard configuration, training and generation workflow.
At the training stage, source data is read from the specified Snowflake data source and used to train and produce a generative model. Once trained, this model can be used to generate synthetic data which can then be stored back into Snowflake if configured by the user.
Finally, after the synthetic data has been written back into Snowflake, it can then be used as a safe drop-in replacement of the original source data for a variety of use cases.
For instance, the synthetic data could be:
- used to interface with a Snowflake Native App on the Snowflake Marketplace;
- securely shared with individuals inside and outside your organisation using Snowflake Data Exchange;
- integrated with your Snowflake Dashboard to extract meaningful insight from a safe and representative artificial version of your original data.
See the Snowflake native integration in action in the video below.
Database subsetting
For instances where very large scale datasets are stored in a Snowflake database, Hazy offers a database subsetting feature which simplifies the handling of this data. Database subsetting allows for large datasets which can often be unwieldy, to be reduced in size into a more manageable subset of the original data. Working with a smaller subset of the source data greatly accelerates the training and generation process, as well as saving heavily on compute resources.
In the multi-table setting it is crucial that the relationships between tables are preserved, as mentioned in our multi-table synthetic data blog post. While it may appear straightforward to simply select a fraction of records from each table individually, treating the tables as independent is a naïve approach that will inevitably result in the loss of valuable cross-table information, as well as potentially break referential integrity constraints.
To avoid these issues, Hazy's database subsetting feature allows the user to select a desired target table to reduce in size, and intelligently adjusts the size of all other related tables. This feature is highly configurable in the Hazy Hub, allowing for the user to subset data based on a percentage of rows, or alternatively a set of specified filters to only preserve rows in a table that meet a certain condition. In addition to being applicable to Snowflake databases, this feature is also available for all other Hazy data connectors.
Check out how the Hazy platform handles large, complex data sets with database subsetting with Snowflake.