Data Connectors

Data Sources

These refer to data sources that have been pre-setup in the HazyHub, this section only applies if you are using the SynthAPI. If you have access to a Hub it’s encouraged to read and write using data sources as it avoids storing credentials in multiple places. Hazy stores credentials using AES-256-GCM normally on the client’s servers.

class hazy_configurator.base.secret_data.SecretDataSource

Bases: VersionedHazyBaseModel

Used to reference encrypted data source credentials stored by Hazy

Example

from hazy_configurator import SecretDataSource

data_source = SecretDataSource(
    id="eee3537f-9ea5-4e8a-af03-af6526fef730",
    name="Input bucket 0",
    source_type=SensitiveDataSourceType.S3
)
Fields:
field id: UUID [Required]

UUID of connection/data source.

field name: Optional[str] = None

Human readable label for the connection/data source.

field source_type: Optional[SensitiveDataSourceType] = None
field io: Optional[DataSourceIO] = None

Data Input/Output

For Training Configuration you will need to define data_input. Similarly for Generation Configuration you will need to define data_output.

In the case of using SynthAPI we recommend you point to your pre setup Data Sources using SynthAPI Data Location.

When using SynthDocker you won’t have access to data sources and can use SynthDocker Data Location.

class hazy_configurator.data_schema.data_location.DataLocationInput

Bases: DataLocation

Specifies the location to read source data from for training.

Fields:
  • location (Union[hazy_configurator.data_schema.data_location.database.DatabaseReadTableConfig, hazy_configurator.data_schema.data_location.path.PathReadTableConfig, hazy_configurator.data_schema.sql_class.SQLConnectorItemRead, str])

  • name (str)

field location: Union[DatabaseReadTableConfig, PathReadTableConfig, SQLConnectorItemRead, str] [Required]

Can be a local path, an S3 path or a database read connection object. If a file path is provided, it must have one of the extensions ['csv', 'csv.gz', 'parquet', 'avro'].

field name: str [Required]

This should match up to a table defined in the DataSchema.

class hazy_configurator.data_schema.data_location.DataLocationOutput

Bases: DataLocation

Specifies the location to write synthetic data to after generation.

Fields:
  • location (Union[hazy_configurator.data_schema.data_location.database.DatabaseWriteTableConfig, hazy_configurator.data_schema.data_location.path.PathWriteTableConfig, hazy_configurator.data_schema.sql_class.SQLConnectorItemWrite, str])

  • name (str)

field location: Union[DatabaseWriteTableConfig, PathWriteTableConfig, SQLConnectorItemWrite, str] [Required]

Tables can be stored in S3 or GCS by providing a path as a string prefixed by s3:// or gs:// respectively, or on the local file system by providing a normal file path as a string. Tables can also be stored in supported databases by providing a SQLConnectorItemWrite object or a DatabaseWriteTableConfig object. If a file path is provided, it must have one of the extensions: ['csv', 'csv.gz', 'parquet', 'avro'].

field name: str [Required]

This should match up to a table defined in the DataSchema.

SynthAPI Data Location

class hazy_configurator.data_schema.data_location.path.PathReadTableConfig

Bases: ReadTableConfig

Specify a single table to read from a cloud/local storage location setup as a data source.

Examples

from uuid import UUID
from hazy_configurator import (
    PathReadTableConfig,
    DataLocationInput,
    DataSchema,
    TabularTable,
    TrainingConfig,
    SecretDataSource,
)

source_id = UUID("76131994-1844-4542-9837-57ca3846ff60")
TrainingConfig(
    data_input=[
        DataLocationInput(
            name="bar",
            location=PathReadTableConfig(connection=source_id, rel_path="foo/bar.csv"),
        )
    ],
    data_sources=[SecretDataSource(id=source_id)],
    data_schema=DataSchema(
        tables=[
            TabularTable(name="bar", dtypes=[...]),
        ],
    ),
    ...
)
Fields:
field rel_path: str [Required]

Relative path to data inside referenced s3/gcs/blob storage location.

field connection: str [Required]

Connection name or data source ID.

class hazy_configurator.data_schema.data_location.path.PathWriteTableConfig

Bases: PathReadTableConfig

Specify a single table to write to a cloud/local storage location setup as a data source.

Examples

from uuid import UUID
from hazy_configurator import (
    PathWriteTableConfig,
    DataLocationOutput,
    GenerationConfig,
    SecretDataSource,
)

source_id = UUID("06fb3915-ecdc-45c0-abac-31b1e7c307ba")
GenerationConfig(
    data_output=[
        DataLocationOutput(
            name="bar",
            location=PathWriteTableConfig(connection=source_id, rel_path="foo/bar.csv"),
        )
    ],
    data_sources=[SecretDataSource(id=source_id)],
    ...,
)
Fields:
field rel_path: str [Required]

Relative path to data inside referenced s3/gcs/blob storage location.

field connection: str [Required]

Connection name or data source ID.

class hazy_configurator.data_schema.data_location.database.DatabaseReadTableConfig

Bases: ReadTableConfig

Specify a single table to read from a database connection. Used by the UI currently.

Examples

from uuid import UUID
from hazy_configurator import (
    DatabaseReadTableConfig,
    DataLocationInput,
    DataSchema,
    TabularTable,
    TrainingConfig,
    SecretDataSource,
)

source_id = UUID("76131994-1844-4542-9837-57ca3846ff60")
TrainingConfig(
    data_input=[
        DataLocationInput(
            name="banking",
            location=DatabaseReadTableConfig(connection=source_id, schema_name="dbo", name="banking"),
        )
    ],
    data_sources=[SecretDataSource(id=source_id)],
    data_schema=DataSchema(
        tables=TabularTable(name="banking", dtypes=[...])
    ),
    ...
)
Fields:
field schema_name: str [Required]

Schema name.

field table: str [Required]

Table name.

parameterize()

Used for converting to SQLConnectorItemRead.

field connection: str [Required]

Connection name or data source ID.

class hazy_configurator.data_schema.data_location.database.DatabaseWriteTableConfig

Bases: DatabaseReadTableConfig

Specify a single table to write to on a database. Used by the UI currently.

Examples

from uuid import UUID
from hazy_configurator import (
    DatabaseWriteTableConfig,
    DataLocationOutput,
    GenerationConfig,
    SecretDataSource,
    DatabaseTableExistMode,
)

source_id = UUID("4a8da7b7-637e-49bb-b157-21ee58abe848")
GenerationConfig(
    data_output=[
        DataLocationOutput(
            name="banking",
            location=DatabaseWriteTableConfig(
                connection=source_id,
                schema_name="dbo",
                name="banking",
                if_exists=DatabaseTableExistMode.REPLACE
            ),
        )
    ],
    data_sources=[SecretDataSource(id=source_id)],
    ...
)
Fields:
field if_exists: DatabaseTableExistMode = DatabaseTableExistMode.FAIL

How to handle writing if a table already exists with the same name in the database.

field index: bool = False

Whether or not to store the index of the synthesised data frame.

field index_label: Optional[str] = None

Name of the column in the database where the data frame index should be stored. If None, the name of the data frame index is used. Only effective if index is True.

field fail_safe_path: Optional[Union[PathWriteTableConfig, str]] = None

Path to store the synthesised table in the event of a failure during database connection or writing.

field debug_write: bool = False

Whether or not to debug data when writing to SQL.

field error_dir: Optional[Union[PathWriteTableConfig, str]] = None

Location of directory where an error log when debugging.

parameterize()

Used for converting to SQLConnectorItemWrite.

field schema_name: str [Required]

Schema name.

field table: str [Required]

Table name.

field connection: str [Required]

Connection name or data source ID.

SynthDocker Data Location

Standard Python strings for filepaths and cloud storage paths can be used. If you wish to read from a database you should use SQLConnectorItemRead and if you wish to write to a database you should use SQLConnectorItemWrite.

class hazy_configurator.data_schema.sql_class.SQLStringParameterItem

String SQL connection parameter.

If an environment variable is provided, its value will be read as a string and used.

Fields:
field value: Optional[str] = None

Either the literal value or the name of an environment variable.

get_value() Optional[Union[str, int]]

Retrieve the literal or environment variable value.

field source: ValueType = ValueType.LITERAL

If literal (default), the value argument must correspond to the actual intended value of the parameter. If an environment variable, the value argument must be the name of a set environment variable which holds the intended value of the parameter — the recommended approach for sensitive information such as usernames and passwords.

class hazy_configurator.data_schema.sql_class.SQLIntParameterItem

Integer SQL connection parameter.

If an environment variable is provided, its value will be read as an integer and used.

Fields:
field value: Optional[Union[str, int]] = None

Either the literal value or the name of an environment variable.

get_value() Optional[Union[str, int]]

Retrieve the literal or environment variable value.

field source: ValueType = ValueType.LITERAL

If literal (default), the value argument must correspond to the actual intended value of the parameter. If an environment variable, the value argument must be the name of a set environment variable which holds the intended value of the parameter — the recommended approach for sensitive information such as usernames and passwords.

class hazy_configurator.data_schema.sql_class.KerberosItem

Kerberos authentication details.

Fields:
field username: Union[SensitiveSQLParameterItem, SQLStringParameterItem] [Required]

Kerberos username.

field password: Union[SensitiveSQLParameterItem, SQLStringParameterItem] [Required]

Kerberos password.

class hazy_configurator.data_schema.sql_class.DatabaseConnectionConfig

Credentials and settings required for connecting to a database. This is used by (i) children read/write classes, (ii) data_sources.json setup.

Examples:

Database connection credentials and settings for connection to a relational database.

Examples:

  • Connection Config to a SQL Server database.

from hazy_configurator import (
    DatabaseConnectionConfig,
    SQLStringParameterItem,
    DatabaseDriverType,
    ValueType,
)

DatabaseConnectionConfig(
    drivername=DatabaseDriverType.MSSQL,
    host=SQLStringParameterItem(value="mssql.database.url"),
    port=SQLStringParameterItem(value=3000),
    database=SQLStringParameterItem(value="DBName"),
    username=SQLStringParameterItem(value="USER_ENV", source=ValueType.ENV),
    password=SQLStringParameterItem(value="PASSWORD_ENV", source=ValueType.ENV),
)
Fields:
field drivername: DatabaseDriverType = DatabaseDriverType.MSSQL

Type of database driver to use for connection.

field username: Optional[Union[SensitiveSQLParameterItem, SQLStringParameterItem]] = None

Database username.

field password: Optional[Union[SensitiveSQLParameterItem, SQLStringParameterItem]] = None

Database password.

field host: Union[SensitiveSQLParameterItem, SQLStringParameterItem] [Required]

Database server host name or IP address.

field port: Optional[Union[SensitiveSQLParameterItem, SQLIntParameterItem]] = None

Database port.

field database: Union[SensitiveSQLParameterItem, SQLStringParameterItem] [Required]

Database name.

field params: Optional[Dict[str, str]] = None

Additional connection parameters for establishing a database connection. For IBM Db2, this may include CLI/ODBC configuration keywords. Hazy synthesiser containers come with pre-installed drivers for both Microsoft SQL Server and IBM Db2, meaning that this parameter should not include a driver key.

field kerberos_auth: Optional[KerberosItem] = None

Kerberos authentication credentials consisting of "username" and "password" keys.

field tls_cert: Optional[SQLStringParameterItem] = None

TLS certificate key.

field service_account_key_json: Optional[Dict] = None
field echo: bool = False

Whether or not to log all SQL statements. See more on this setting from the SQLAlchemy documentation.

field hide_parameters: bool = True

Whether or not to display SQL query parameters in log output. See more on this setting from the SQLAlchemy documentation.

get_sensitive_fields()
property connection: DatabaseConnectionConfig
class hazy_configurator.data_schema.sql_class.SQLConnectorItemRead

Database connection credentials and settings for reading data from tables in a relational database.

For additional information on the query and index_col parameters, please see the official Pandas documentation for the read_sql function (noting that query corresponds to the sql parameter).

Examples:

  • Reading from a SQL Server database using Kerberos authentication.

from hazy_configurator import (
    SQLConnectorItemRead,
    SQLStringParameterItem,
    SQLIntParameterItem,
    SensitiveSQLParameterItem,
    KerberosItem,
    DatabaseDriverType,
    ValueType,
)

SQLConnectorItemRead(
    drivername=DatabaseDriverType.MSSQL,
    host=SensitiveSQLParameterItem(source=ValueType.ENV, value="DB_URL"),
    port=SensitiveSQLParameterItem(source=ValueType.ENV, value="DB_PORT"),
    database=SensitiveSQLParameterItem(source=ValueType.ENV, value="DB_NAME"),
    kerberos_auth=KerberosItem(
        username=SensitiveSQLParameterItem(source=ValueType.ENV, value="KERBEROS_USERNAME"),
        password=SensitiveSQLParameterItem(source=ValueType.ENV, value="KERBEROS_PASSWORD")
    ),
    query="SELECT Column1, Column2 FROM MyTable",
    chunksize=2_000
)
  • Reading from an IBM Db2 database by username and password, providing additional connection parameters.

from hazy_configurator.data_schema.sql_class import (
    SQLConnectorItemRead,
    SQLStringParameterItem,
    SensitiveSQLParameterItem,
    SQLIntParameterItem,
    DatabaseDriverType,
    ValueType,
)

SQLConnectorItemRead(
    drivername=DatabaseDriverType.DB2,
    host=SensitiveSQLParameterItem(source=ValueType.ENV, value="DB_URL"),
    username=SensitiveSQLParameterItem(source=ValueType.ENV, value="DB_USERNAME"),
    password=SensitiveSQLParameterItem(source=ValueType.ENV, value="DB_PASSWORD"),
    database=SensitiveSQLParameterItem(source=ValueType.ENV, value="DB_NAME"),
    params=dict(
        ProgramName="hazy_synth",
        StmtConcentrator="WITHLITERALS"
    ),
    query="SELECT Column1, Column2 FROM MyTable",
    chunksize=2_000
)
Fields:
field query: str [Required]

SQL query to execute on the database in order to fetch data for a single table. Alternatively, providing a table name will select all columns from that table. The schema_name argument is ignored if a query is provided here, as a schema may be included as part of the query instead.

field index_col: Optional[str] = None

Column from the database table to set as the Pandas data frame index.

property connection: DatabaseConnectionConfig
get_schema() Optional[str]
get_sensitive_fields()
field schema_name: Optional[SQLStringParameterItem] = None

Database schema (if supported by the database management system).

field chunksize: Optional[int] = None

The number of records to be included in a single chunk during a database read or write. This parameter is useful for optimizing read/write performance, as well as dealing with large scale data where writing many rows in a single operation may be infeasible due to database memory limits on multi-row inserts.

field drivername: DatabaseDriverType = DatabaseDriverType.MSSQL

Type of database driver to use for connection.

field username: Optional[Union[SensitiveSQLParameterItem, SQLStringParameterItem]] = None

Database username.

field password: Optional[Union[SensitiveSQLParameterItem, SQLStringParameterItem]] = None

Database password.

field host: Union[SensitiveSQLParameterItem, SQLStringParameterItem] [Required]

Database server host name or IP address.

field port: Optional[Union[SensitiveSQLParameterItem, SQLIntParameterItem]] = None

Database port.

field database: Union[SensitiveSQLParameterItem, SQLStringParameterItem] [Required]

Database name.

field params: Optional[Dict[str, str]] = None

Additional connection parameters for establishing a database connection. For IBM Db2, this may include CLI/ODBC configuration keywords. Hazy synthesiser containers come with pre-installed drivers for both Microsoft SQL Server and IBM Db2, meaning that this parameter should not include a driver key.

field kerberos_auth: Optional[KerberosItem] = None

Kerberos authentication credentials consisting of "username" and "password" keys.

field tls_cert: Optional[SQLStringParameterItem] = None

TLS certificate key.

field service_account_key_json: Optional[Dict] = None
field echo: bool = False

Whether or not to log all SQL statements. See more on this setting from the SQLAlchemy documentation.

field hide_parameters: bool = True

Whether or not to display SQL query parameters in log output. See more on this setting from the SQLAlchemy documentation.

class hazy_configurator.data_schema.sql_class.SQLConnectorItemWrite

Database connection credentials and settings for writing data to tables in a relational database.

For additional information on the if_exists, index, and index_label parameters, please see the official Pandas documentation for the to_sql function.

Examples:

  • Writing to a SQL Server database using Kerberos authentication.

from hazy_configurator import (
    SQLConnectorItemWrite,
    SQLStringParameterItem,
    SQLIntParameterItem,
    KerberosItem,
    DatabaseDriverType,
    SensitiveSQLParameterItem,
    DatabaseTableExistMode,
    ValueType,
)

SQLConnectorItemWrite(
    drivername=DatabaseDriverType.MSSQL,
    host=SensitiveSQLParameterItem(source=ValueType.ENV, value="DB_URL"),
    port=SensitiveSQLParameterItem(source=ValueType.ENV, value="DB_PORT"),
    database=SensitiveSQLParameterItem(source=ValueType.ENV, value="DB_NAME"),
    kerberos_auth=KerberosItem(
        username=SensitiveSQLParameterItem(source=ValueType.ENV, value="KERBEROS_USERNAME"),
        password=SensitiveSQLParameterItem(source=ValueType.ENV, value="KERBEROS_PASSWORD")
    ),
    table="MySynthTable",
    if_exists=DatabaseTableExistMode.APPEND,
    index=True,
    fail_safe_path="/path/to/my_synth_table.csv"
)
  • Writing to an IBM Db2 database by username and password, providing additional connection parameters.

from hazy_configurator import (
    SQLConnectorItemWrite,
    SQLStringParameterItem,
    SQLIntParameterItem,
    SensitiveSQLParameterItem,
    DatabaseDriverType,
    DatabaseTableExistMode,
    ValueType,
)

SQLConnectorItemWrite(
    drivername="db2",
    host=SensitiveSQLParameterItem(source=ValueType.ENV, value="DB_URL"),
    username=SensitiveSQLParameterItem(source=ValueType.ENV, value="DB_USERNAME"),
    password=SensitiveSQLParameterItem(source=ValueType.ENV, value="DB_PASSWORD"),
    database=SensitiveSQLParameterItem(source=ValueType.ENV, value="DB_NAME"),
    params=dict(
        ProgramName="hazy_synth",
        StmtConcentrator="WITHLITERALS"
    ),
    table="MySynthTable",
    if_exists=DatabaseTableExistMode.APPEND,
    index=True,
    fail_safe_path="/path/to/my_synth_table.csv"
)
Fields:
field table: str [Required]

Name of the database table to write synthesised data to. For best compatibility, it is best to abide by naming conventions for the respective DBMS, e.g. object naming rules for Db2 if using the "db2" drivername.

field if_exists: DatabaseTableExistMode = DatabaseTableExistMode.FAIL

Specifies how to handle writing if a table already exists with the same name in the database. Fail raises a ValueError. Replace drops the existing table before inserting new values (data types for the new table will be automatically determined). Append inserts new values into the existing table or creates a new table if it doesn’t exist (data types will be automatically determined if the table does not exist).

field index: bool = False

Whether or not to store the index of the synthesised data frame as a column in the database. When False, the index column will not be written to the table.

field index_label: Optional[str] = None

Name of the column in the database where the data frame index should be stored. If None, the name of the data frame index is used. Only effective if index is True.

field fail_safe_path: Optional[Union[PathWriteTableConfig, str]] = None

Path to store the synthesised table in the event of a failure during database connection or writing. If None, the table will not be saved. If a path is provided for one table, a path must also be provided for all other tables. Supported file formats are csv, csv.gz, parquet and avro. The path can be either the path to a directory or to an S3 bucket. If using the Client Library, the path to disk can be local, however, if running using Docker CLI, then the path to disk must belong on an external volume mounted into the docker file system that will be accessible once the synth completes.

field debug_write: bool = False

Whether or not to debug data when writing to SQL i.e. write out data by batches or line by line if needs be.

field error_dir: Optional[Union[PathWriteTableConfig, str]] = None

Location of directory where an error log and invalid data should be stored when debugging.

field n_splits: int = 50

Number of times the data should be divided when attempting to write to an SQL database in batches (if debug_write is True).

property connection: DatabaseConnectionConfig
get_schema() Optional[str]
get_sensitive_fields()
field schema_name: Optional[SQLStringParameterItem] = None

Database schema (if supported by the database management system).

field chunksize: Optional[int] = None

The number of records to be included in a single chunk during a database read or write. This parameter is useful for optimizing read/write performance, as well as dealing with large scale data where writing many rows in a single operation may be infeasible due to database memory limits on multi-row inserts.

field drivername: DatabaseDriverType = DatabaseDriverType.MSSQL

Type of database driver to use for connection.

field username: Optional[Union[SensitiveSQLParameterItem, SQLStringParameterItem]] = None

Database username.

field password: Optional[Union[SensitiveSQLParameterItem, SQLStringParameterItem]] = None

Database password.

field host: Union[SensitiveSQLParameterItem, SQLStringParameterItem] [Required]

Database server host name or IP address.

field port: Optional[Union[SensitiveSQLParameterItem, SQLIntParameterItem]] = None

Database port.

field database: Union[SensitiveSQLParameterItem, SQLStringParameterItem] [Required]

Database name.

field params: Optional[Dict[str, str]] = None

Additional connection parameters for establishing a database connection. For IBM Db2, this may include CLI/ODBC configuration keywords. Hazy synthesiser containers come with pre-installed drivers for both Microsoft SQL Server and IBM Db2, meaning that this parameter should not include a driver key.

field kerberos_auth: Optional[KerberosItem] = None

Kerberos authentication credentials consisting of "username" and "password" keys.

field tls_cert: Optional[SQLStringParameterItem] = None

TLS certificate key.

field service_account_key_json: Optional[Dict] = None
field echo: bool = False

Whether or not to log all SQL statements. See more on this setting from the SQLAlchemy documentation.

field hide_parameters: bool = True

Whether or not to display SQL query parameters in log output. See more on this setting from the SQLAlchemy documentation.