Data sources

Hazy synthesisers support reading and writing data using either file or database storage.

Data sources can be added on project/settings tab. Input data sources can also be set up during the configuration flow.

File storage

  • Supported mediums: Local disk, Amazon S3, Google Cloud Storage, Azure Blob Storage.
  • Supported data formats: CSV, Parquet, Avro

The format of the data is inferred from the file extension, with supported extensions .csv, .csv.gz, .parquet, .avro.

All above formats are supported for both local disk and Amazon S3 storage.

Database storage

  • Supported databases: Microsoft SQL Server, IBM Db2 for Linux/Unix & Windows (LUW) and z/OS, Snowflake, Databricks, PostgreSQL, Oracle Database, Google BigQuery.

The Hub supports connecting to a database to obtain training data and write synthesised data.

Note: Training data read from a database is never written to disk at any point during the training or generation process.

Authentication

The following database authentication methods are supported:

  • Microsoft SQL Server

    • Server authentication (username/password)
    • Kerberos authentication
    • Windows authentication (trusted connection)
  • IBM Db2 for Linux, Unix & Windows (LUW) and z/OS

    • Server authentication (username/password) with TLS
  • Snowflake, PostgreSQL, Oracle Database

    • Server authentication (username/password)
  • Databricks

    • Server (token) for given http_path and hostname.
  • Google BigQuery

    • Service account key (JSON file) with access to relevant BigQuery project

Data types

Db2

Supported data types
  • Strings: CHAR, VARCHAR, GRAPHIC, VARGRAPHIC, CLOB, DBCLOB

  • Integers/Boolean: BOOLEAN, SMALLINT, INT, BIGINT

    The supported upper limit for BIGINT is 9223372036854774784, which is 1023 less than the officially stated limit by IBM.

  • Binary precision decimal numbers: REAL, DOUBLE

  • Fixed precision decimal numbers: DECIMAL

    Due to conversion to and from binary floating point representation during synthesis, the true lower and upper bounds of values inserted into DECIMAL columns in Db2 are restricted to double precision floating point limits.

  • Date/time: DATE, TIME, TIMESTAMP

    • Can read TIMESTAMP values up to precision 12 (picosecond level)
    • Can synthesise TIMESTAMP values up to precision 3 (millisecond level)
  • Other: XML

Note: While the above types are supported for read and write, some types such as XML are not possible to synthesise.

Unsupported data types:
  • Binary objects: BINARY, VARBINARY, BLOB
  • Decimal floating points: DECFLOAT

Aditional Notes

Databricks

Currently, hazy only supports INPUT I/O type for Databricks connection. If you would like to write out to Databricks, you can use an intermediate object storage (e.g. S3, GCS) and use Databrick's COPY FROM provided you have configured object store access from within Databricks SQL Warehouse appropriately. See Tutorial: Configure S3 access with instance profile.

Fail-safe mechanism

The synthesiser uses database transactions in order to roll back all of the databases and tables to their original state in the event that the write operation fails for any single table.

If an issue arises causing a write failure, the synthesiser can be configured to write tables to a fail-safe file storage location.

Due to the use of large multi-row insertions in each transaction for a write operation, it is possible that the database transaction log may fill up faster than expected. For Db2, if you encounter issues related to the size of the transaction log during training, please see the official IBM documentation for instructions on increasing the database transaction log size.

How-Tos: See the Using a database for storage how-to for a demonstration of using a database as a storage medium.

Storage combinations

There is no requirement that source data and synthesised data be of the same format or stored in the same type of storage.

For example, you may have source data consisting of four input tables:

  • one stored in CSV format on local disk,
  • one stored in Parquet format on Amazon S3,
  • one stored in an IBM Db2 database running locally,
  • one stored in a remote Microsoft SQL Server database.

The synthesised tables may also be stored in different formats and storage types.