4.0.0
Features¶
Synthetic data library¶
This creates a formal boundary between synthetic data producers and consumers within an organisation. Once data has been vetted it can be published to the synthetic data library. Data on the synthetic data library can be inspected to the schema and properties before it can be downloaded by consumers. Roles have been setup in Keycloak to facilitate defining users who are only allowed to view synthetic data but aren't allowed anywhere near the source data.
Hub redesign¶
The Hub has been redesigned with a new dashboard as its homepage to allow quick access into different areas of the product. A side-panel allows easy navigation through projects and various stages of the synthetic data workflow. The pages have been reorganised to simplify the experience.
Model import/export via API¶
Models can now be exported from the Hub using the API. They can then be uploaded to a different Hub to allow truly air-gapped deployments.
Split ID¶
Allows users to retain a subset of strings when training. An example of it being used can be found here.
Out of core sampling¶
This update introduces support for out of core sampling, offering a solution for handling larger than memory datasets. Through our database subsetting pipeline, it is now possible to leverage the use of Dask to efficiently read in and sample down large datasets. This will uphold the referential integrity and preserve crucial insights in datasets that were previously too large to be effectively utilized. By downsizing, we retain these aspects while progressing with a more manageable dataset. For more information about database subsetting using Dask, please see here.
Support for big three public cloud storage providers¶
There is now full support within the product to utilise the 3 major cloud storage providers - AWS S3, Google Cloud Storage and Azure Blob Storage. This incudes reading source data and writing synthetic data. All internal cloud storage used by the Hub including models and logs can now be configured to utilise any of the big three providers. See storage for more information on setup.
Breaking changes¶
Hazy Helm refactored¶
Previously repeated configuration has been moved to the global state to create a single source of truth. Easier install via helm install hazy
is now supported.
EntityTypes removed¶
Based off feedback entity types are no longer supported in their current format - this includes CombinationType
, LocationType
and PersonType
. The change means there should be a 1:1 mapping between columns and types now. Previously multiple columns could be represented by a single type. While this made sense for Python developers familiar with object oriented programming, configurations appeared very different to the database schemas they were representing. This brings the hazy_configurator
package closer to other database schema representations.
Combination type replaced¶
CombinationType
behaviour has been moved into CategoryType
.
# Previously
DataSchema(
dtypes=[
CombinationType(cols=["col1", "col2"]),
],
)
# Now
DataSchema(
dtypes=[
CategoryType(col="col1", entity_id=1),
CategoryType(col="col2", entity_id=1),
],
entities=[CombinationEntity(entity_id=1)],
)
Person type replaced¶
PersonType
behaviour has moved into a set of sub types eg. NameType
, EmailType
.
# Previously
DataSchema(
dtypes=[
PersonType(
parts=[
PersonPart(
col="First Name",
type=PersonPartType.FIRST_NAME,
),
PersonPart(
col="Surname",
type=PersonPartType.LAST_NAME,
),
PersonPart(
col="Email",
type=PersonPartType.EMAIL,
)
],
locales=[PersonLocales.en_US],
)
],
)
# Now
DataSchema(
dtypes=[
NameType(col="First Name", type=NameTypes.FIRST_NAME, entity_id=1),
NameType(col="Surname", type=NameTypes.LAST_NAME, entity_id=1),
EmailType(col="Email", entity_id=1),
],
entities=[
PersonEntity(entity_id=1,locales=[PersonLocales.en_US])
],
)
Location type replaced¶
LocationType
behaviour has been moved into a set of sub types eg. DistrictType
, CountryType
, PostcodeType
.
# Previously
DataSchema(
dtypes=[
LocationType(
parts=[
LocationPart(
col="zipcode",
type=LocationPartType.POSTCODE,
),
LocationPart(
col="street address",
type=[LocationPartType.STREET_NUMBER, LocationPartType.STREET],
format_string="{street_number} | {street}",
)
],
locales=[GeoLocales.en_US],
mismatch=LocationTypeMismatch.RANDOM,
num_clusters=500,
)
],
)
# Now
DataSchema(
dtypes=[
PostcodeType(col="zipcode", entity_id=1),
CustomAddressType(col="street address", format_string="{street_number} | {street}", entity_id=1),
],
entities=[
LocationType(
entity_id=1,
locales=[GeoLocales.en_US],
mismatch=LocationTypeMismatch.RANDOM,
num_clusters=500,
),
],
)