4.0.0

Features

Synthetic data library

This creates a formal boundary between synthetic data producers and consumers within an organisation. Once data has been vetted it can be published to the synthetic data library. Data on the synthetic data library can be inspected to the schema and properties before it can be downloaded by consumers. Roles have been setup in Keycloak to facilitate defining users who are only allowed to view synthetic data but aren't allowed anywhere near the source data.

Hub redesign

The Hub has been redesigned with a new dashboard as its homepage to allow quick access into different areas of the product. A side-panel allows easy navigation through projects and various stages of the synthetic data workflow. The pages have been reorganised to simplify the experience.

Model import/export via API

Models can now be exported from the Hub using the API. They can then be uploaded to a different Hub to allow truly air-gapped deployments.

Split ID

Allows users to retain a subset of strings when training. An example of it being used can be found here.

Out of core sampling

This update introduces support for out of core sampling, offering a solution for handling larger than memory datasets. Through our database subsetting pipeline, it is now possible to leverage the use of Dask to efficiently read in and sample down large datasets. This will uphold the referential integrity and preserve crucial insights in datasets that were previously too large to be effectively utilized. By downsizing, we retain these aspects while progressing with a more manageable dataset. For more information about database subsetting using Dask, please see here.

Support for big three public cloud storage providers

There is now full support within the product to utilise the 3 major cloud storage providers - AWS S3, Google Cloud Storage and Azure Blob Storage. This incudes reading source data and writing synthetic data. All internal cloud storage used by the Hub including models and logs can now be configured to utilise any of the big three providers. See storage for more information on setup.

Breaking changes

Hazy Helm refactored

Previously repeated configuration has been moved to the global state to create a single source of truth. Easier install via helm install hazy is now supported.

EntityTypes removed

Based off feedback entity types are no longer supported in their current format - this includes CombinationType, LocationType and PersonType. The change means there should be a 1:1 mapping between columns and types now. Previously multiple columns could be represented by a single type. While this made sense for Python developers familiar with object oriented programming, configurations appeared very different to the database schemas they were representing. This brings the hazy_configurator package closer to other database schema representations.

Combination type replaced

CombinationType behaviour has been moved into CategoryType.

# Previously
DataSchema(
    dtypes=[
        CombinationType(cols=["col1", "col2"]),
    ],
)

# Now
DataSchema(
    dtypes=[
        CategoryType(col="col1", entity_id=1),
        CategoryType(col="col2", entity_id=1),
    ],
    entities=[CombinationEntity(entity_id=1)],
)
Person type replaced

PersonType behaviour has moved into a set of sub types eg. NameType, EmailType.

# Previously
DataSchema(
    dtypes=[
        PersonType(
            parts=[
                PersonPart(
                    col="First Name",
                    type=PersonPartType.FIRST_NAME,
                ),
                PersonPart(
                    col="Surname",
                    type=PersonPartType.LAST_NAME,
                ),
                PersonPart(
                    col="Email",
                    type=PersonPartType.EMAIL,
                )
            ],
            locales=[PersonLocales.en_US],
        )
    ],
)

# Now
DataSchema(
    dtypes=[
        NameType(col="First Name", type=NameTypes.FIRST_NAME, entity_id=1),
        NameType(col="Surname", type=NameTypes.LAST_NAME, entity_id=1),
        EmailType(col="Email", entity_id=1),
    ],
    entities=[
        PersonEntity(entity_id=1,locales=[PersonLocales.en_US])
    ],
)
Location type replaced

LocationType behaviour has been moved into a set of sub types eg. DistrictType, CountryType, PostcodeType.

# Previously
DataSchema(
    dtypes=[
        LocationType(
            parts=[
                LocationPart(
                    col="zipcode",
                    type=LocationPartType.POSTCODE,
                ),
                LocationPart(
                    col="street address",
                    type=[LocationPartType.STREET_NUMBER, LocationPartType.STREET],
                    format_string="{street_number} | {street}",
                )
            ],
            locales=[GeoLocales.en_US],
            mismatch=LocationTypeMismatch.RANDOM,
            num_clusters=500,
        )
    ],
)

# Now
DataSchema(
    dtypes=[
        PostcodeType(col="zipcode", entity_id=1),
        CustomAddressType(col="street address", format_string="{street_number} | {street}", entity_id=1),
    ],
    entities=[
        LocationType(
            entity_id=1,
            locales=[GeoLocales.en_US],
            mismatch=LocationTypeMismatch.RANDOM,
            num_clusters=500,
        ),
    ],
)