Our customers want to scale up production of synthetic data and to do so, they want to enable business users as well as data scientists and engineers to generate synthetic data. The team has invested a lot of effort in improving the usability and intuitiveness of our product to make it possible to democratise the production of synthetic data.
1. Advanced automation
One of our focus areas was on configuration of synthetic data. As the scale of the datasets our customers handle increases, more columns and more tables result in a linear increase in the amount of configuration that is required.
This meant the configuration step was a time consuming and tedious part of the process for users who had to configure columns one by one. It also resulted in an increase in the risk of manual error, especially when users are handling unfamiliar tables and big volumes of data.
To counter this, we have introduced enhanced and automated datatype detection. For frequently used datatypes, we can go deeper and detect format strings (date/time columns), and more granular and country-specific formats (credit card numbers, IBAN, to name a few). The platform can also analyse the data to determine the relationships between tables, meaning we can detect which columns are primary or foreign keys, and which columns should form a composite key.
This enhanced automation works no matter which data source a customer uses. All of this automation results in a much more streamlined process, less work in the configuration stage, and the ability for non-technical users to synthesise data.
2. Flexible and elastic cloud support
We have recently developed our infrastructure and cloud integrations. The Hazy platform now runs natively on Kubernetes clusters and is quick and easy to install into customer environments using helm charts. This offers some key benefits to customers:
- Reduced cost: By integrating Kubernetes with auto scaling functions, our software can spin up appropriate resources on demand, and shut them off when not needed. This greatly reduces costs for customers by reducing unnecessary compute resources.
- Increased stability: Kubernetes additionally brings benefits to stability and monitoring - allowing us to integrate with our customers’ broad ecosystem of infrastructure management software. This allows operators and administrators of the Hazy platform to understand, debug and secure their environments more effectively.
- Simple cost allocation: Compute costs can be allocated to individual business units by running training and generation within their Kubernetes Clusters, all communicating to a central deployment of the Hazy Hub.
- Simplified infrastructure administration: Integrating Hazy into new environments is handled through simple helm commands.
3. Improved multi-table and sequential capabilities
Across industries, organisations structure their data into interconnected tables, allowing for seamless access, analysis, and manipulation of information. By segmenting information into multiple tables, each with its specific purpose and attributes, businesses can achieve a higher level of granularity and precision in data representation.
Synthesising multi-table data is complex and resource intensive, however Hazy’s automated inference of Primary/Foreign keys, including composite keys, makes this process painless.
Hazy has developed a differentially private, multi-table approach capable of synthesising arbitrary sets of related tables which caters to most enterprise synthetic data use cases. The software can also support sequential and reference tables. The team has strengthened these capabilities by introducing a new GAN-based model for handling sequential data. Now we can cater to seasonality, trends over time, as well as activities spread across multiple time periods. For our customers in the finance sector, this is useful for analysing fraudulent activities or data breaches.
For a technical deep-dive into this capability, have a read of our paper presented at ICLR 2023 on many-to-many datasets via random graph generation.
4. Memory optimisation
As customers work with significant increases in data volume, a host of complexities are introduced that necessitate the introduction of multiple performance optimisation features. To get ahead of the curve and mitigate the impact on system performance, Hazy has rolled out a number of features dedicated to optimise data processing and retrieval times.
Notably, we have been enhancing our model training process to cleverly reuse any work that has been carried out in previous training runs instead of starting from scratch. This feature will be expanded to manage any failed training runs by leveraging work already done in previous runs and building on it after corrective actions have been taken.
Alongside the increase in the amount of data, we have noted that repeat training with different model parameters has been a common occurrence in our customers’ workflows.
To aid this time consuming process, we have introduced a data subsetting mechanism that allows users to train on a smaller amount of data, offering a faster feedback loop whilst maintaining a representative view of the full data. With these added features, customers can save up to 25% of their training time, while preserving the referential integrity of the data and therefore its utility.
These were some of the product highlights from the first quarter of 2023 at Hazy. Stay tuned to see the next batch of new features later in the summer.