Using a database for storage
This tutorial demonstrates how a database can be used as storage for Hazy synthesisers.
We will be looking at how to:
- train a generator model on data stored in database tables,
- generate synthetic data and write it back into new tables in another database.
Getting started¶
This tutorial makes use of a generator trained on a small dataset consisting of a single table named children
that holds information such as the name, age and height of children. The tutorial demonstrates the usage of both IBM Db2 and Microsoft SQL Server databases.
Please follow the setup instructions for Db2 and SQL Server if you have not done so already and would like to follow the steps in this tutorial to run the code for yourself.
Database credentials¶
The database instances created by the aforementioned setup files will have the following database credentials:
- Database name:
Master
- Database server port:
1433
for SQL Server50000
for Db2
- Username:
SA
for SQL Serverdb2inst1
for Db2
- Password:
Hazy123!
Additionally you will require a database server host IP address. For the demonstration shown in this tutorial, the host can be obtained by executing the following command which sets an environment variable DB_HOST
that holds the host address.
<CONTAINER>
should be replaced with hazy-tutorial-mssql
if using SQL Server, or hazy-tutorial-db2
if using Db2.
export DB_HOST=$(docker inspect <CONTAINER> | grep -w "IPAddress" | awk '{ print $2 }' | head -n 1 | cut -d "," -f1)
Specifying input locations¶
DataLocationInput
is used to specify the location to read data for a single table. In order to specify database tables as an input location, we can provide database credentials and table information by using SQLConnectorItemRead
.
Database connection credentials are provided as environment variables which are passed to the Docker container to be evaluated.
These should be provided to the synthesizer train
and generate
methods.
database_credentials = {
"DB_HOST": "...",
"DB_PORT": "...",
"DB_USER": "...",
"DB_PASSWORD": "...",
"DB_DATABASE": "...",
}
data_input = [
DataLocationInput(
name="children",
location=SQLConnectorItemRead(
query="children",
drivername="db2",
host=SQLStringParameterItem(
value="DB_HOST",
source="env_var",
),
port=SQLIntParameterItem(
value="DB_PORT",
source="env_var",
),
username=SQLStringParameterItem(
value="DB_USER",
source="env_var",
),
password=SQLStringParameterItem(
value="DB_PASSWORD",
source="env_var",
),
database=SQLStringParameterItem(
value="DB_DATABASE",
source="env_var",
),
params=dict(
ProgramName="hazy_synth_read",
),
),
)
]
database_credentials = {
"DB_HOST": "...",
"DB_PORT": "...",
"DB_USER": "...",
"DB_PASSWORD": "...",
"DB_DATABASE": "...",
"DB_SCHEMA": "..." # normally defaults to dbo for msft sql server
}
data_input = [
DataLocationInput(
name="children",
location=SQLConnectorItemRead(
query="children",
drivername="mssql+pyodbc",
host=SQLStringParameterItem(
value="DB_HOST",
source="env_var",
),
port=SQLIntParameterItem(
value="DB_PORT",
source="env_var",
),
username=SQLStringParameterItem(
value="DB_USER",
source="env_var",
),
password=SQLStringParameterItem(
value="DB_PASSWORD",
source="env_var",
),
database=SQLStringParameterItem(
value="DB_DATABASE",
source="env_var",
),
schema_name=SQLStringParameterItem(
value="DB_SCHEMA",
source="env_var",
),
)
)
]
The params
argument is optional and allows for additional connection parameters to be provided to the connection engine. These parameters are specific to the type of database being used — in this case, ProgramName
is used to improve the ease of log monitoring by identifying the Hazy synthesiser to the Db2 server as a client application.
Training a generator model¶
We now train a new generator model by providing a TrainingConfig
which includes the data schema from the Getting Started page and the input locations that we defined in data_input
, along with our database host and password as an environment variable.
synth = SynthDocker(image="docker_image:tag")
synth.train(
cfg=TrainingConfig(
model_output="children.hmf",
data_input=data_input,
data_schema=DataSchema(...), # schema from multi table tutorial
),
env=database_credentials,
)
Note: docker_image:tag
must be changed to a valid Hazy Synthesiser image.
Specifying output locations¶
DataLocationOutput
is used to specify the location to write a synthesised table to. In order to specify database tables as an output location, we can provide database credentials and table information by using SQLConnectorItemWrite
.
Similarly to DataLocationInput
, database credentials are provided as environment variables.
When writing an output table to a database, an if_exists
argument can be provided to specify the action to take in the event that the table already exists in the table. In this case, we use "append"
which will create the new tables if they do not already exist.
The fail-safe mechanism is configured by providing a fail_safe_path
for each table. In this case we specify fail/children_synth.csv
as the fail-safe path for data to be stored if the database write fails.
database_credentials = {
"DB_HOST": "...",
"DB_PORT": "...",
"DB_USER": "...",
"DB_PASSWORD": "...",
"DB_DATABASE": "...",
}
data_output = [
DataLocationOutput(
name="children",
location=SQLConnectorItemWrite(
table="children_synth",
drivername="db2",
host=SQLStringParameterItem(
value="DB_HOST",
source="env_var",
),
port=SQLIntParameterItem(
value="DB_PORT",
source="env_var",
),
username=SQLStringParameterItem(
value="DB_USER",
source="env_var",
),
password=SQLStringParameterItem(
value="DB_PASSWORD",
source="env_var",
),
database=SQLStringParameterItem(
value="DB_DATABASE",
source="env_var",
),
params=dict(
ProgramName="hazy_synth_write",
),
if_exists="append",
fail_safe_path="fail/children_synth.csv",
),
)
]
database_credentials = {
"DB_HOST": "...",
"DB_PORT": "...",
"DB_USER": "...",
"DB_PASSWORD": "...",
"DB_DATABASE": "...",
"DB_SCHEMA": "..." # normally defaults to dbo for msft sql server
}
data_output = [
DataLocationOutput(
name="children",
location=SQLConnectorItemWrite(
table="children_synth",
drivername="mssql+pyodbc",
host=SQLStringParameterItem(
value="DB_HOST",
source="env_var",
),
port=SQLIntParameterItem(
value="DB_PORT",
source="env_var",
),
username=SQLStringParameterItem(
value="DB_USER",
source="env_var",
),
password=SQLStringParameterItem(
value="DB_PASSWORD",
source="env_var",
),
database=SQLStringParameterItem(
value="DB_DATABASE",
source="env_var",
),
schema_name=SQLStringParameterItem(
value="DB_SCHEMA",
source="env_var",
),
if_exists="append",
fail_safe_path="fail/children_synth.csv",
),
)
]
Synthesising data from a trained generator model¶
We now fetch the trained generator model and synthesise data by providing a GenerationConfig
and passing our database output locations, along with our database host and password as environment variables.
synth.generate(
cfg=GenerationConfig(
model="children.hmf",
data_output=data_output,
),
env=database_credentials,
)
Once generation is complete, you should be able to see synthetic data written to the new children_synth
table in the target database.