Using Cloud Object stores

This tutorial demonstrates how cloud storage can be used as data input/output for Hazy synthesisers via the Python SDK.

Hazy currently supports the following:

  • Local file system where the Hub is provisioned
  • AWS S3.
  • GCP Cloud Storage.
  • Azure Blob Storage.

We will be looking at how to setup:

  • train generator model on data stored in cloud storage.
  • store the resultant hazy mode in cloud storage.
  • generate synthetic data and write it back into cloud storage.

Python SDK for cloud Storages

Getting started

This tutorial makes use of a generator trained on the MovieLens dataset, consisting of a user, movie and ratings table.

We will be using the data schema already introduced in the Setting up a multi table synthesiser tutorial — please see this tutorial if you are not familiar with configuring a multi table synthesiser.

The storage of each bucket is assumed to have the following structure

hazy-s3-demo
├── datasets               # holds the MovieLens training data
│   └── movielens
│       ├── movie.csv
│       ├── ratings.csv
│       └── user.csv
├── models                 # holds trained Hazy model `.hmf` files
└── synth                  # holds synthesised data
    └── movielens

Note: The hazy-cloud-demo bucket is only a placeholder and not a real cloud bucket that you can connect to.

Example with AWS S3

Setting AWS access credentials

To provide the synthesiser access to S3 storage, an AWS access key ID and secret access key must be set.

These will be passed into the synthesiser container in the form of environment variables with expected names AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY during training and generation.

You may specify these credentials as strings in your Python code directly.

env = {
    "AWS_ACCESS_KEY_ID": "my-access-key-id",
    "AWS_SECRET_ACCESS_KEY": "my-secret-access-key",
}

However for sensitive information such as access keys, it is usually preferable to define environment variables on your own machine and read these into Python code using os.environ.

env = {
    "AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
    "AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
}

Specifying input locations

DataLocationInput is used to specify the location to read data for a single table.

We specify paths to the S3 cloud objects holding the training data by prefixing with s3://.

USER = "user"
MOVIE = "movie"
RATINGS = "ratings"

data_input = [
    DataLocationInput(
        name=table,
        location=f"s3://hazy-s3-demo/datasets/movielens/{table}.csv",
    )
    for table in (USER, MOVIE, RATINGS)
]

Training and storing a generator model

We now train a new generator model by providing a TrainingConfig which includes the data schema from the multi table tutorial and the input locations that we defined in data_input and our AWS access credentials that we specified in env.

We will store the trained generator in our S3 bucket at s3://hazy-s3-demo/models/movielens.hmf.

synth = SynthDocker(image="docker_image:tag")

synth.train(
    cfg=TrainingConfig(
        model_output="s3://hazy-s3-demo/models/movielens.hmf",
        data_input=data_input,
        data_schema=DataSchema(...),  # schema from multi table tutorial
    ),
    env=env,
)

Once training is complete, s3://hazy-s3-demo/models/movielens.hmf should now be available in your S3 bucket.

Fetching a trained generator model and synthesising data

The trained model can be retrieved by setting s3://hazy-s3-demo/models/movielens.hmf in GenerationConfig.

We can specify our desired synthetic data output locations as S3 paths.

Note: Environment variables must be provided to both the train and generate methods.

In this example we use the same set of environment variables for training and generation, but this is not necessary.

synth.generate(
    cfg=GenerationConfig(
        model="s3://hazy-s3-demo/models/movielens.hmf",
        data_output=[
            DataLocationOutput(
                name=USER,
                location=f"s3://hazy-s3-demo/synth/movielens/{USER}.csv",
            ),
            DataLocationOutput(
                name=MOVIE,
                location=f"s3://hazy-s3-demo/synth/movielens/{MOVIE}.csv",
            ),
            DataLocationOutput(
                name=RATINGS,
                location=f"s3://hazy-s3-demo/synth/movielens/{RATINGS}.csv",
            ),
        ],
    ),
    env=env,
)

Once generation is complete, you should be able to see the output CSV files in s3://hazy-s3-demo/synth/movielens/.

Other cloud storages

Please refer to your corresponding cloud provider and provide the right environment variables for accessing these buckets. Hazy uses the standard environment for each cloud storage provider.