Using a database for storage

This tutorial demonstrates how a database can be used as storage for Hazy synthesisers.

We will be looking at how to:

  • train a generator model on data stored in database tables,
  • generate synthetic data and write it back into new tables in another database.

Getting started

This tutorial makes use of a generator trained on a small dataset consisting of a single table named children that holds information such as the name, age and height of children. The tutorial demonstrates the usage of both IBM Db2 and Microsoft SQL Server databases.

Please follow the setup instructions for Db2 and SQL Server if you have not done so already and would like to follow the steps in this tutorial to run the code for yourself.

Database credentials

The database instances created by the aforementioned setup files will have the following database credentials:

  • Database name: Master
  • Database server port:
    • 1433 for SQL Server
    • 50000 for Db2
  • Username:
    • SA for SQL Server
    • db2inst1 for Db2
  • Password: Hazy123!

Additionally you will require a database server host IP address. For the demonstration shown in this tutorial, the host can be obtained by executing the following command which sets an environment variable DB_HOST that holds the host address.

<CONTAINER> should be replaced with hazy-tutorial-mssql if using SQL Server, or hazy-tutorial-db2 if using Db2.

export DB_HOST=$(docker inspect <CONTAINER> | grep -w "IPAddress" | awk '{ print $2 }' | head -n 1 | cut -d "," -f1)

Specifying input locations

DataLocationInput is used to specify the location to read data for a single table. In order to specify database tables as an input location, we can provide database credentials and table information by using SQLConnectorItemRead.

Database connection credentials are provided as environment variables which are passed to the Docker container to be evaluated. These should be provided to the synthesizer train and generate methods.

database_credentials = {
    "DB_HOST": "...",
    "DB_PORT": "...",
    "DB_USER": "...",
    "DB_PASSWORD": "...",
    "DB_DATABASE": "...",
}  

data_input = [
    DataLocationInput(
        name="children",
        location=SQLConnectorItemRead(
            query="children",
            drivername="db2",
            host=SQLStringParameterItem(
                value="DB_HOST",
                source="env_var",
            ),
            port=SQLIntParameterItem(
                value="DB_PORT",
                source="env_var",
            ),
            username=SQLStringParameterItem(
                value="DB_USER",
                source="env_var",
            ),
            password=SQLStringParameterItem(
                value="DB_PASSWORD",
                source="env_var",
            ),
            database=SQLStringParameterItem(
                value="DB_DATABASE",
                source="env_var",
            ),
            params=dict(
                ProgramName="hazy_synth_read",
            ),
        ),
    )
]

The params argument is optional and allows for additional connection parameters to be provided to the connection engine. These parameters are specific to the type of database being used — in this case, ProgramName is used to improve the ease of log monitoring by identifying the Hazy synthesiser to the Db2 server as a client application.

Training a generator model

We now train a new generator model by providing a TrainingConfig which includes the data schema from the Getting Started page and the input locations that we defined in data_input, along with our database host and password as an environment variable.

synth = SynthDocker(image="docker_image:tag")

synth.train(
    cfg=TrainingConfig(
        model_output="children.hmf",
        data_input=data_input,
        data_schema=DataSchema(...),  # schema from multi table tutorial
    ),
    env=database_credentials,
)

Note: docker_image:tag must be changed to a valid Hazy Synthesiser image.

Specifying output locations

DataLocationOutput is used to specify the location to write a synthesised table to. In order to specify database tables as an output location, we can provide database credentials and table information by using SQLConnectorItemWrite.

Similarly to DataLocationInput, database credentials are provided as environment variables.

When writing an output table to a database, an if_exists argument can be provided to specify the action to take in the event that the table already exists in the table. In this case, we use "append" which will create the new tables if they do not already exist.

The fail-safe mechanism is configured by providing a fail_safe_path for each table. In this case we specify fail/children_synth.csv as the fail-safe path for data to be stored if the database write fails.

database_credentials = {
    "DB_HOST": "...",
    "DB_PORT": "...",
    "DB_USER": "...",
    "DB_PASSWORD": "...",
    "DB_DATABASE": "...",
}

data_output = [
    DataLocationOutput(
        name="children",
        location=SQLConnectorItemWrite(
            table="children_synth",
            drivername="db2",
            host=SQLStringParameterItem(
                value="DB_HOST",
                source="env_var",
            ),
            port=SQLIntParameterItem(
                value="DB_PORT",
                source="env_var",
            ),
            username=SQLStringParameterItem(
                value="DB_USER",
                source="env_var",
            ),
            password=SQLStringParameterItem(
                value="DB_PASSWORD",
                source="env_var",
            ),
            database=SQLStringParameterItem(
                value="DB_DATABASE",
                source="env_var",
            ),
            params=dict(
                ProgramName="hazy_synth_write",
            ),
            if_exists="append",
            fail_safe_path="fail/children_synth.csv",
        ),
    )
]

Synthesising data from a trained generator model

We now fetch the trained generator model and synthesise data by providing a GenerationConfig and passing our database output locations, along with our database host and password as environment variables.

synth.generate(
    cfg=GenerationConfig(
        model="children.hmf",
        data_output=data_output,
    ),
    env=database_credentials,
)

Once generation is complete, you should be able to see synthetic data written to the new children_synth table in the target database.