Entities

Entities are used to define entity level parameters that are shared across all columns belonging to the entity.

Combination Entity

class hazy_configurator.data_schema.combination_entity.CombinationEntity

Bases: BaseEntity

A combination entity is used to define entity level parameters that are shared across all columns belonging to the entity.

Fields:
field entity_type: Literal['combination'] = 'combination'
field entity_id: int [Required]

The entity ID that this entity represents. This must match the entity IDs provided with hazy data types that belong to this entity.

Constraints:
  • minimum = 0

Location Entity

class hazy_configurator.data_schema.location_entity.LocationEntity

Bases: BaseEntity

A location entity is used to define entity level parameters that are shared across all columns belonging to the entity.

Fields:
field entity_type: Literal['location'] = 'location'
field locales: List[GeoLocales] = [<GeoLocales.en_GB: 'en_GB'>, <GeoLocales.en_US: 'en_US'>]

Locales used for generating location components.

field mismatch: LocationTypeMismatch = LocationTypeMismatch.RANDOM

When synthesizing data, the algorithm reproduces the geographic distribution of the source data. In order to learn the distribution it has to group records in the source data into the predetermined clusters. Some records will not match a cluster, either to being a new postcode, or because they were mistyped and this setting decides how to handle those mismatched addresses. The options are: “drop” - i.e. ignore this address, “approximate” i.e. find the closest matching address in the public database, “random” i.e. pick a random cluster.

field num_clusters: int = 500

When synthesizing data, the algorithm reproduces the geographic distribution of the source data. It does this by grouping addresses in the source data into clusters and learning the distribution of addresses between the different clusters. The synthesized records reproduce the distribution of addresses between the clusters. When assigning an address to a synthesized record, the address is assigned randomly within the cluster from the publicly available addresses within that cluster. This setting sets the number of clusters to group the addresses within that locale into. Note: the clustering algorithm is trained on public data and not on the data provided to the the pipeline

Constraints:
  • exclusiveMinimum = 0

field territory_modelling: LocationTerritoryModellingType = LocationTerritoryModellingType.ASSET_SAMPLING

How lower specificity locations than post/zip code ie country, state, district, city are modelled. ‘combination’ means sample from combinations of the source country/state/district provided. This will mean source distributions are preserved. And allows locations outside of Hazy’s known locales. ‘asset_sampling’ means sample from Hazy location assets using the provided locales.

get_custom_configs(entity_dtypes: List[BaseType]) List[Dict[str, Dict[Literal['pattern'], str]]]
field entity_id: int [Required]

The entity ID that this entity represents. This must match the entity IDs provided with hazy data types that belong to this entity.

Constraints:
  • minimum = 0

Person Entity

class hazy_configurator.data_schema.person_entity.PersonEntity

Bases: BaseEntity

A person entity is used to define entity level parameters that are shared across all columns belonging to the entity.

Fields:
field entity_type: Literal['person'] = 'person'
field locales: List[PersonLocales] = [<PersonLocales.en_GB: 'en_GB'>, <PersonLocales.en_US: 'en_US'>]

Locales used for generating person components.

field entity_id: int [Required]

The entity ID that this entity represents. This must match the entity IDs provided with hazy data types that belong to this entity.

Constraints:
  • minimum = 0