Data Schema

The schema defines the data structure and column features.

Data Schema

class hazy_configurator.data_schema.data_schema.DataSchema

Bases: HazyBaseModel

Describes the data schema.

Tables are defined through the tables field. Any special rules which cannot be defined by the table data types should be defined using custom_handlers.

Fields:
field custom_handlers: List[CustomHandlerConfig] = []

Extra processing which can be carried out. This should only be required for advanced features. For most use cases defining data types is enough. See Custom Handlers. If inputting custom handlers in the UI, please use their JSON formats e.g.

[
{

“type”: “age”, “age_column”: “Age”, “dob_column”: “DoB”, “ref_date”: “2022-12-25”

}

]

field tables: List[HazyTable] [Required]

List of tables which make up the data schema. See Data Table.

field automatic_handlers: Optional[AutomaticHandlerExtractorConfig] = None

Types of rules to check for in the data which will then be enforced in the generated data.

field entities: List[Union[PersonEntity, LocationEntity, CombinationEntity]] = []

Entity level settings.

property multi_config: MultiTableConfig

Builds structured object of links between tables and runs validation to check for configuration errors.

Returns:

Representation of structural links between tables.

Return type:

MultiTableConfig

get_table(name: str) Optional[HazyTable]

Get a specific table from the schema. Search by table name.

Parameters:

name (TableName) – Table name to search for.

Returns:

Table from the data schema. Returns None if not found.

Return type:

Optional[HazyTable]

get_dtype(table: str, col: str) Optional[HazyDataTypeUnion]

Get the dtype config from the schema. Search by table and column.

Parameters:
  • table (TableName) – Name of the table.

  • col (ColumnName) – Name of the Column.

Returns:

If column/table not found, None will be returned.

Return type:

Optional[HazyDataTypeUnion]

normalise() Tuple[DataSchema, List[Union[DenormalItem, SingleColumnNormaliserConfig]]]

If normalisations have been defined in the configuration, this function returns a new DataSchema object with the tables normalised and a list of the normalisation configs.

Automatic Handlers

class hazy_configurator.data_schema.automatic_handlers.ConditionedExtractorConfig

Bases: BaseExtractorConfig

Finds cases where only certain values in one column are allowed as a result of values in other columns.

Either min_n or min_prop must be provided.

Fields:
field type: Literal['conditioned'] = 'conditioned'
field min_n: Optional[int] = 10

Minimum number of examples to determine something as conditioned.

Constraints:
  • exclusiveMinimum = 0

field min_prop: Optional[float] = None

Minimum proportion of examples in a column to determine something as conditioned.

Constraints:
  • minimum = 0.0

  • maximum = 1.0

field condition_numerical: bool = True

Whether or not to allow conditioning on numerical features when trying to extract this rule.

field target_numerical: bool = True

Whether or not to allow numerical features as target columns when trying to extract this rule.

class hazy_configurator.data_schema.automatic_handlers.DeterminedExtractorConfig

Bases: BaseExtractorConfig

Finds cases where one column is entirely determined by another column.

On finding a column is determined by another it means that column does not need to be statistically modelled and be generated entirely from another column.

Fields:
field type: Literal['determined'] = 'determined'
field max_error_proportion: float = 0.0

The maximum proportion of values which do not have to conform to the determined rule. Under this threshold we still consider that column to be determined. The default of 0.0 means the target column must be entirely determined.

Constraints:
  • minimum = 0.0

  • maximum = 1.0

class hazy_configurator.data_schema.automatic_handlers.AutomaticHandlerExtractorConfig

Bases: HazyBaseModel

Fields:
field extractors: List[Union[ConditionedExtractorConfig, DeterminedExtractorConfig]] [Required]

Types of rules to check for in the data which will then be enforced in the generated data.

field ignore: Dict[str, List[str]] = {}

Mapping from table names to lists of columns to ignore for rule extraction.

field ignore_target: Dict[str, List[str]] = {}

Mapping from table names to lists of columns to ignore as targets for rule extraction.

field max_uniqueness_proportion: float = 0.9

Maximum uniqueness rate at which columns will be checked for the existence of rules - where uniqueness rate = cardinality / number of rows.

Constraints:
  • minimum = 0.0

  • maximum = 1.0