Manual Synth Installation

The Hazy Synthesiser container can be executed in isolation within your secure environment, next to your production data, and train Generator Models. These Generator Models can then be moved out of the secure environment and uploaded to the Hub with the Hub sitting within the wider network where it is accessible to Data Scientists or other data pipelines.

The Synthesiser should be run as a batch job via cron or as part of some orchestration service.

The following assumes you are working on a x86-64 Linux server as specified in the requirements.

Training

The Hazy Synthesiser requires close integration with the host filesystem (or Kubernetes Volumes) to function.

The inputs to the Synthesiser are:

  1. Training parameters.

    These are specific to the Synthesiser version, your data and environment. They specify the origin of the source data, for example, a filesystem path to a data file, the file path of the final Generator Model file and also various Synthesiser-specific training and evaluation parameters that determine the quality and privacy of the Generator Model. Please refer to the Synthesiser documentation for details.

  2. Source data.

    Hazy can work off of a file-based snapshot of your source or by connecting to a database.

Running Standalone

To train via Docker directly, first you must write some parameter files on the host filesystem. In the examples below it is assumed that the file is saved to train/config/params.json. The parameter files format is documented in the Synthesisers section under the different synthesisers.

All paths in the parameters file refer to files within the container so must be consistent with the volumes mounted into the container at runtime.

# or sudo docker...
$ docker run \ 
    --rm \
    -v $(pwd)/train/config:/mnt/config:ro,Z \
    -v $(pwd)/train/data:/mnt/data:ro,Z \
    -v $(pwd)/train/output:/mnt/output:Z \
    --tmpfs /tmp:rw,noexec \
    --security-opt=no-new-privileges \
    --user $(id -u):$(id -g) \
    registry.northwindtraders.com/hazy/multi-table:4.0.0 run --parameters /mnt/config/params.json

Configure the container to run as the current user, via the --user argument, so that the trained Generator Models have the correct ownership.

Scaling

Currently Hazy Synthesisers do not support horizontal scaling — that is you cannot parallelise the training of a single model over multiple machines.

To improve training speeds, you have two options:

  1. For single model training time improvements, running the training on a machine with a faster processor, with more RAM (to avoid usage of swap), results in lower training times. Shifting to a bare-metal system rather than a VM may also improve training times by fixing any problems with "noisy-neighbours" - CPU stealing by VMs running on the same host.

  2. When training multiple models with differing quality, privacy or source data parameters, Hazy’s commercial license allows you to run as many Synthesiser instances as you need. So, rather than training each model variation sequentially, you could train in batches of n where n is the number of servers you allocate. Note: Running the training on multiple VMs hosted on the same bare-metal server does not result in significant performance improvements due to resource contention.

For the latter case, we suggest using a queue-based solution to efficiently pack the workload across the available machines. The Hazy distributed architecture offers some options for automated set up of such a system.

Generation

To generate data you must write some parameter files on the host filesystem. The parameter files format is documented in the Synthesisers section under the different synthesisers. In the example below it is assumed that the file is saved to generate/config/params.json.

# or sudo docker...
$ docker run \ 
    --rm \
    -v $(pwd)/generate/config/params.json:/mnt/config/params.json:ro,Z \
    -v $(pwd)/train/output/model.hazymodel:/mnt/config/model.hazymodel:ro,Z \
    -v $(pwd)/generate/output:/mnt/output:Z \
    --tmpfs /tmp:rw,noexec \
    --security-opt=no-new-privileges \
    --user $(id -u):$(id -g) \
    registry.northwindtraders.com/hazy/multi-table:4.0.0 run --parameters /mnt/config/params.json