Synthesiser Installation

The Hazy Synthesiser container runs within your secure environment, next to your production data, and trains Generator Models. These Generator Models are then moved out of the secure environment and uploaded to the Hub. The Hub sits within the wider network where it is accessible to Data Scientists or other data pipelines.

The Synthesiser should be run as a batch job via cron or as part of some orchestration service.

The following assumes you are working on a x86-64 Linux server as specified in the requirements.

Training

The Hazy Synthesiser requires close integration with the host filesystem (or Kubernetes Volumes) in order to function.

The inputs to the Synthesiser are:

  1. Training parameters.

    These will be specific to the Synthesiser version, your data and environment. They specify the origin of the source data, e.g. a filesystem path to a data file, the file path of the final Generator Model file and also various Synthesiser-specific training and evaluation parameters that determine the quality and privacy of the Generator Model. Please refer to the Synthesiser documentation for details.

  2. Source data.

    Hazy can work off of a file-based snapshot of your source or by connecting to a database.

Running Standalone

To train via Docker or Podman directly, first you must write some parameter files on the host filesystem:

$ mkdir -p $(pwd)/train/{config,data,output}

$ cat <<EOF > $(pwd)/train/config/params.json
{
    "action": "train",
    "params": {
        "epsilon": 1.0,
        "n_bins": 3,
        "input_path": "/mnt/data/credit-risk.csv",
        "dtypes_path": "/mnt/config/dtypes.json",
        "model_output": "/mnt/output/model.hazymodel",
        "evaluate": true,
        "train_test_split": false,
        "label_columns": ["Risk"],
        "predictors": ["lgbm"],
        "sample_generate_params": {
            "params": {
                "num_rows": 25
            },
            "implementation_override": false
        },
        "evaluation_generate_params": {
            "params": {
                "num_rows": 1000
            },
            "implementation_override": true
        },
        "evaluation_exclude_columns": ["Age"],
        "development_only": false
    }
}
EOF

cat <<EOF > $(pwd)/train/config/dtypes.json
{
    "Age": "int64",
    "Sex": "category",
    "Job": "category",
    "Housing": "category",
    "Saving accounts": "category",
    "Checking account": "category",
    "Credit amount": "int64",
    "Duration": "int64",
    "Purpose": "category",
    "Risk": "category"
}
EOF

The source data should be placed in train/data/credit-risk.csv.

All paths in the parameters refer to files within the container so must be consistent with the volumes mounted into the container at runtime.

Docker

# or sudo docker...
$ docker run \ 
    --rm \
    -v $(pwd)/train/config:/mnt/config:ro,Z \
    -v $(pwd)/train/data:/mnt/data:ro,Z \
    -v $(pwd)/train/output:/mnt/output:Z \
    --tmpfs /tmp:rw,noexec \
    --security-opt=no-new-privileges \
    --user $(id -u):$(id -g) \
    hazy/tabular-synthesiser:latest run --parameters /mnt/config/params.json

Configure the container to run as the current user, via the --user argument, so that the trained Generator Models have the correct ownership.

Podman


$ podman run \ 
    --rm \
    -v $(pwd)/train/config:/mnt/config:ro,Z \
    -v $(pwd)/train/data:/mnt/data:ro,Z \
    -v $(pwd)/train/output:/mnt/output:Z \
    --tmpfs /tmp:rw,noexec \
    --security-opt=no-new-privileges \
    hazy/tabular-synthesiser:latest run --parameters /mnt/config/params.json

Running under Kubernetes

Training of Generator Models can be accomplished under Kubernetes using CronJobs.

These use the Cron format string to specify when the job should run. The following simple example defines a retraining job to run every Sunday morning at 1am.

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  labels:
    app: hazy-synthesiser
  name: hazy-synthesiser-data
  namespace: hazy
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 256Gi

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  labels:
    app: hazy-synthesiser
  name: hazy-synthesiser-output
  namespace: hazy
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 256Gi

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: hazy-synthesiser-config
data:
  params.json: |
    {
        "action": "train",
        "params": {
            "epsilon": 1.0,
            "n_bins": 3,
            "input_path": "/mnt/data/credit-risk.csv",
            "dtypes_path": "/mnt/config/dtypes.json",
            "model_output": "/mnt/output/model.hmf",
            "evaluate": true,
            "train_test_split": false,
            "label_columns": ["Risk"],
            "predictors": ["lgbm"],
            "sample_generate_params": {
                "params": {
                    "num_rows": 25
                },
                "implementation_override": false
            },
            "evaluation_generate_params": {
                "params": {
                    "num_rows": 1000
                },
                "implementation_override": true
            },
            "evaluation_exclude_columns": ["Age"],
            "development_only": false
        }
    }
  dtypes.json: |
    {
        "Age": "int64",
        "Sex": "category",
        "Job": "category",
        "Housing": "category",
        "Saving accounts": "category",
        "Checking account": "category",
        "Credit amount": "int64",
        "Duration": "int64",
        "Purpose": "category",
        "Risk": "category"
    }

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: hazy-synthesiser-train
  namespace: hazy
spec:
  schedule: "0 1 * * 0"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: hazy-tabular-synthesisr
              image: registry.northwindtraders.com/hazy/tabular-synthesiser:latest
              imagePullPolicy: Always
              command:
                - run
                - --parameters
                - /mnt/config/params.json

              volumeMounts:
                - name: synthesiser-config
                  mountPath: /mnt/config
                  readOnly: true
                - name: synthesiser-data
                  mountPath: /mnt/data
                  readOnly: true
                - name: synthesiser-output
                  mountPath: /mnt/output

              securityContext:
                runAsNonRoot: true
                allowPrivilegeEscalation: false
                capabilities:
                  drop: ["ALL"]

          volumes:
            - name: synthesiser-config
              configMap:
                name: hazy-synthesiser-config
                items:
                  - key: "config.json"
                    path: "config.json"
                  - key: "dtypes.json"
                    path: "dtypes.json"

            - name: synthesiser-data
              persistentVolumeClaim:
                claimName: hazy-synthesiser-data

            - name: synthesiser-output
              persistentVolumeClaim:
                claimName: hazy-synthesiser-output

Scaling

Currently Hazy Synthesisers do not support horizontal scaling -- that is you cannot parallelise the training of a single model over multiple machines.

To improve training speeds, you have two options:

For single model training time improvements, you can potentially run the training on a machine with a faster processor, potentially with more RAM (to avoid usage of swap), will result in lower training times. Shifting to a bare-metal system rather than a VM may also improve training times by fixing any problems with "noisy-neighbours" - CPU stealing by VMs running on the same host.

When training multiple models with differing quality, privacy or source data parameters, Hazy's commercial license allows you to run as many Synthesiser instances as you need. So rather than training each model variation sequentially, you could train in batches of n where n is the number of servers you allocate. Note that running the training on multiple VMs hosted on the same bare-metal server will not result in significant performance improvements due to resource contention.

For the latter case we suggest using a queue-based solution to efficiently pack the workload accross the available machines.

Generation

$ mkdir -p $(pwd)/generate/{config,output}

$ cat <<EOF > $(pwd)/generate/config/params.json
{
    "action": "generate",
    "params": {
        "output": "/mount/output/synth_data.csv",
        "model": "/mount/config/model.hazymodel",
        "num_rows": 10000
    }
}
EOF

Docker

# or sudo docker...
$ docker run \ 
    --rm \
    -v $(pwd)/generate/config/params.json:/mnt/config/params.json:ro,Z \
    -v $(pwd)/train/output/model.hazymodel:/mnt/config/model.hazymodel:ro,Z \
    -v $(pwd)/generate/output:/mnt/output:Z \
    --tmpfs /tmp:rw,noexec \
    --security-opt=no-new-privileges \
    --user $(id -u):$(id -g) \
    hazy/tabular-synthesiser:latest run --parameters /mnt/config/params.json

Podman

$ podman run \ 
    --rm \
    -v $(pwd)/generate/config/params.json:/mnt/config/params.json:ro,Z \
    -v $(pwd)/train/output/model.hazymodel:/mnt/config/model.hazymodel:ro,Z \
    -v $(pwd)/generate/output:/mnt/output:Z \
    --tmpfs /tmp:rw,noexec \
    --security-opt=no-new-privileges \
    --user $(id -u):$(id -g) \
    hazy/tabular-synthesiser:latest run --parameters /mnt/config/params.json