Introducing out-of-core sampling: expanding possibilities for large datasets

We are excited to announce a major update to our synthetic data platform – the introduction of support for out-of-core sampling. This new feature provides an effective solution for handling datasets on disk or cloud storage that are larger than the available memory.

With our optimised database subsetting pipeline, you can now leverage the power of Dask to efficiently read and sample down large datasets. This functionality ensures the preservation of referential integrity and crucial insights, even in datasets that were previously too large to be effectively utilised.

What does this mean for you?

By downsizing your dataset, you can retain the most critical aspects and progress with a more manageable dataset. This allows you to run on lower specification hardware and saves valuable time. Hazy has a range of parameters that can be used to tune the similarity/privacy trade off of the synthetic data. This feature speeds up your ability to iterate through different parameters and settings allowing for faster development of synthetic datasets.

Database subsetting in detail

Preserving the relationships between tables is crucial in multi-table settings, as we highlighted in our previous blog post on multi-table synthetic data. Simply selecting a fraction of records from each table independently can lead to the loss of valuable cross-table information and potentially break referential integrity constraints.

Hazy's database subsetting feature offers a smart solution. Users can select a target table to reduce in size, and the platform intelligently adjusts the size of all related tables. This ensures that referential integrity is maintained and valuable information is not lost. Our database subsetting feature is highly configurable in the Hazy Hub. Users can choose to subset data based on a percentage of rows or apply specific filters to preserve only the rows that meet certain conditions.

In cases where very large-scale datasets are stored in a database, our synthetic data platform offers in-database subsetting. This feature computes a full database subset within the database itself. The only data which is downloaded for training is the pre-computed subset which reduces data-transfer time and costs. For datasets which are on disk or in cloud storage, Hazy is able to sample the data down in-memory, before beginning training

To learn more about database subsetting using Dask, please check out our documentation. We are thrilled to bring you this game-changing update that will greatly enhance your experience with our synthetic data platform. Stay tuned for more exciting features and advancements.

Subscribe to our newsletter

For the latest news and insights