Single container install

Overview¶

Single container (SC) installations allow users of Hazy to run a fully-featured version of the Hazy application from a single Docker image. This is particularly useful for scenarios where cloud-native platforms like Kubernetes are unavailable.

Requirements¶

Hardware¶

Hazy runs equally well on bare-metal or within a Virtual Machine (VM).

The hardware requirements below are for guidance only. Your installation may need more or fewer resources depending on your requirements and data.

For <= 100MB datasets and demo¶

Processor x86_64, 4+ cores
RAM: >= 32GB
AWS EC2 instance option: r5.xlarge
Disk Space for Hub docker volume: >= 128GB

For > 100MB¶

Processor x86_64, 8+ cores
RAM: >= 128GB
AWS EC2 instance option: r5.4xlarge
Disk Space for Hub docker volume: >= 1TB

Disk space¶

Storage requirements are very installation specific. The Hub stores:

Synthesiser container images (~1 GB per image).
Generator Models (size very dependent on data, ~ 1 Mb to 1 GB per Model).
Database State and snapshots (~1 GB).

Over time the number of Synthesiser images and Generator Models grows linearly.

The Synthesiser has no persistent state but does require available storage to read in source data and write out trained Generator Models.

Operating System¶

A Linux based server. We recommend the latest Ubuntu Server LTS but other distributions are equally valid.

Network¶

Hazy is designed to work entirely within an on-premises installation or within a private cloud environment and to respect the expected security constraints of production environments. As such, Hazy applications do not require internet access to function.

The single container install requires access to the source data so must be installed within the same network partition as the data.

Importing container images¶

Only the hazy/multi-table image is required pulled to local registry. More details can be found here. Other images can be used for more complex configurations such as hazy/keycloak for multi user auth configuration.

Install¶

Quick install (via `hazy-install.sh`)¶

You can run our convenience installation script with the following:

bash -c "$(wget -O - https://hazy.com/docs/scripts/hazy-install.sh)"

Running with the UI¶

The Hub can be started from the command line on any host that has access to the docker registry from which the Hazy synthesiser is made available.

CONFIGURATOR_INPUT=/host/input/volume                      # see note 1.
CONFIGURATOR_OUTPUT=/host/output/volume                    # see note 2.
CONFIGURATOR_APP_DATA=/host/data/volume                    # see note 3.
CONFIGURATOR_ENV_FILE=$(pwd)/.env                          # see note 4.
CONFIGURATOR_DOCKER_IMAGE="hazy/multi-table:unknown" # see note 5.
CONFIGURATOR_PORT=5001                                     # see note 6.
CONFIGURATOR_FEATURES=features.json                        # see note 7.
CONFIGURATOR_FEATURES_SIG=features.sig.json
CONFIGURATOR_DOCKER_SOCKET=/var/run/docker.sock            # see note 8.
CONFIGURATOR_WORK_DIR=/host/workdir/volume                 # see note 9.
docker run                                                      \
  --detach                                                      \
  -u $(id -u):$(id -g)                                          \
  --group-add $(getent group docker | cut -d: -f1)              \
  --read-only                                                   \
  --tmpfs /tmp:rw,noexec                                        \
  --pids-limit 20                                               \
  --security-opt=no-new-privileges                              \
  -m 4g --cpus=2                                                \
  --name hazy-configurator                                      \
  -p $CONFIGURATOR_PORT:5001                                    \
  -v $CONFIGURATOR_APP_DATA:/configurator-app-data              \
  -v $CONFIGURATOR_INPUT:/configurator-input:ro                 \
  -v $CONFIGURATOR_OUTPUT:/configurator-output                  \
  -v $CONFIGURATOR_FEATURES:/var/lib/hazy/features.json         \
  -v $CONFIGURATOR_FEATURES_SIG:/var/lib/hazy/features.sig.json \
  -v $CONFIGURATOR_DOCKER_SOCKET:/var/run/docker.sock           \
  -v $CONFIGURATOR_WORK_DIR:$CONFIGURATOR_WORK_DIR              \
  -e CONFIGURATOR_FEATURES=$CONFIGURATOR_FEATURES               \
  -e CONFIGURATOR_FEATURES_SIG=$CONFIGURATOR_FEATURES_SIG       \
  -e CONFIGURATOR_INPUT=$CONFIGURATOR_INPUT                     \
  -e CONFIGURATOR_OUTPUT=$CONFIGURATOR_OUTPUT                   \
  -e CONFIGURATOR_APP_DATA=$CONFIGURATOR_APP_DATA               \
  -e CONFIGURATOR_DB_URI=sqlite:////configurator-app-data/db    \
  -e CONFIGURATOR_WORK_DIR=$CONFIGURATOR_WORK_DIR               \
  --env-file $CONFIGURATOR_ENV_FILE                             \
  $CONFIGURATOR_DOCKER_IMAGE                                    \
  configure

Notes:

[1] A volume must be mounted into the docker container that enables the configurator to access sample source data for analysis. The host volume can be any on the filesystem as long as permissions are set such that it can be mounted readable by the container. Here we specify the path as an environment variable and later provide it as both a bind mount (see note 3) and an environment variable (see note 5) to the container.
[2] A volume must be mounted into the docker container that enables the configurator to dump synthetic data. The host volume can be any on the filesystem as long as permissions are set such that it can be mounted writeable by the container. Here we specify the path as an environment variable and later provide it as both a bind mount and an environment variable to the container.
[3] A host volume must be mounted that can be used to store the application's data. An sqlite database will be created in that folder by the application when first run.
[4] This .env file is for storing secrets as key value pairs. This file should be protected with necessary permissions so only certain users can see the contents.
[5] This is the standard synth image which also houses the configurator. Hazy will inform the client which docker image to use.
[6] The configurator web server runs on port 5001; in order for the web site to be accessible from the docker host the port must be mapped. The example maps the container port 5001 to the host of the same port number. The host port number (the left hand side of the mapping) can be modified if necessary.
[7] As of version 2.3, licensing files are distributed separately and must be mounted into the container at launch in order to activate correctly.
[8] The docker socket must be mounted into the container. This is to allow generation from old model files. Models trained from previous releases must be loaded into their previous synth image to allow the model to properly deserialise. In order for the container to communicate with the docker daemon, the user and group settings (-u and --add-groups) are important to verify.
[9] Work directory is mapped to the same directory inside the container to allow mapping between docker images. This again is to allow generation from old model files.

The container restrictions, copied below, are advisory and should be tuned based on the requirements of the host system and context.

  --read-only                                                   \
  --tmpfs /tmp:rw,noexec                                        \
  --pids-limit 20                                               \
  --security-opt=no-new-privileges                              \
  -m 4g --cpus=2                                                \

In `.env`¶

A set of environment variables can be configured to use a Secrets Manager.

Required¶

HAZY_DS_ROOT_KEY This should be at least 32 bytes base64 encoded token. It can be generated using the command openssl rand -base64 32 or openssl rand 32 | openssl enc -A -base64.
HAZY_ANALYSIS_ENCRYPTION_KEY This must be exactly 32 byte base64 encoded token. It can be generated using the command openssl rand -base64 64 or openssl rand 64 | openssl enc -A -base64.

Optional¶

BOOTSTRAP_DATA_JSON path (default=None) The path to a data_sources.json file which gets mounted into the container, containing data sources which should be configured by default. See below for schema and example contents.

The Database Subsetting feature can make use of an encrypted cache to speed up subsequent training runs. To be able to use this feature in the product you will need to setup the following:

HAZY_CACHE_FOLDER This defaults to the mounted "${CONFIGURATOR_APP_DATA}/cache". It can be overridden to point at cloud storage locations using s3:// or gs://.
HAZY_CACHE_PASSWORD There is no default for this, and is required to use this feature. The customer is responsible for picking a strong password.

The following env vars are optional and can be added to the .env file to alter the type of automated data type detection that occurs during the analysis stage of configuration:

ANALYSIS_VERSION Literal[0, 1] (default=1) When set to 1, an enhanced version of analysis is conducted during configuration. This should yield more detailed and accurate analysis results, making the configuration of a dataset a less manual process. It is important to note that the results of this analysis should always be reviewed by someone with an understanding of the data. If enhanced analysis is on, the following env vars can also be used to override defaults and change the behaviour of the analysis:
- DTYPE_ANALYSIS_MIN_CONFIDENCE float (0, 1] (default=0.5) The confidence threshold above which a data type will be considered a match.
- DTYPE_ANALYSIS_STOP_SEARCH_CONFIDENCE float (0, 1] (default=1.0) The confidence threshold above which further search for a data type match will be stopped and the first data type that crossed this confidence threshold will be chosen for the column. Decreasing this may reduce the accuracy of results.
- ANALYSIS_NROWS int (default=500000) The number of records that will be considered for analysis. Increasing this may improve the accuracy of results but analysis will take longer to complete.
- ANALYSER_SAMPLE_SIZE int (default=100) To improve performance, each analyser initially attempts to find matches on a sample of the column. Only if matches are found, will the analyser be run on the full column. This env var dictates how many records will be included in the initial sample. Increasing this may improve the accuracy of results but analysis will take longer to complete.
- ANALYSE_TABLE_KEYS bool (default=True) When set to True, the analysis will attempt to deduce the primary and foreign key relationships of the tables in the dataset. This is done either by querying the key constraints when reading the data from a database, or when reading the data from disk/S3, an attempt is made to infer the keys by analysing the underlying data itself.
- FALLBACK_KEY_ANALYSIS bool (default=True) When set to True, if the source data is being read from a database but the configurator is unable to read the key constraints, an attempt will be made to infer the constraints from the underlying data itself.
- FKEY_ALLOWED_ORPHANED_RECORDS float [0, 1] (default=0.05) The proportion of orphaned records in a child table column above which it will no longer be a considered a foreign key candidate. If the data contains a high number of orphaned records between tables then increasing this env var may yield better results.
- ANALYSER_INCLUDE_SAMPLE_DATA bool (default=True) When set to True, table analysis results will include examples and statistics from the source data. This data will not be returned (nor persisted) if set to False..

Example of a .env file containing only required variables.

HAZY_ANALYSIS_ENCRYPTION_KEY="7fJGtyNPMz76ki0kua7sNB2cwOsSOowOhhRllatC6r8="
HAZY_DS_ROOT_KEY="yPykt9NFoxvNTDSeEl3RUdqoidnR1TwQz6ZqPx/Ar+CkbisSh+mFZ9PpSEA9i7Fvo5iPvYp7ixrbOJlyYQcPDw=="

In `data_sources.json`¶

The environment variable BOOTSTRAP_DATA_JSON needs to point to this file.
Each source must have a source_type of either "s3", "gcs", "azure", "disk", or "database".
Each source must have a name identifier which should be a unique string.
Each source must have an io type of either "input", "output", "input_output", "upload" or "download".
Admins should set the is_global flag to true for bootstrap.
For database sources, the port, username, password and params fields are optionals and may be omitted from connection if not needed.
For s3, gs, and azure sources, the path value should be a directory path starting with s3://, gs://, or azure://.
For azure, a connection_string is required to inform of connection credentials.

Example of a data_sources.json file demonstrating both S3 and database connection details.

{
    "sources": [
        // Database source example
        {
            "source_type": "database",  
            "io": "input",
            "name": "my-db2-database-source",
            "drivername": "db2",
            "username": "DB_USERNAME",
            "password": "DB_PASSWORD",
            "host": "DB_HOST",
            "port": "DB_PORT",
            "database": "DB_DATABASE",
            "params": {
                "Security": "SSL",
                "SSLServerCertificate": "/etc/ssl/certs/db2/DigiCertGlobalRootCA.crt"
            },
            "is_global": true
        },
        {
            "source_type": "database",
            "io": "input_output",
            "name": "my-mssql-database-source",
            "drivername": "mssql+pyodbc",
            "username": "SA",
            "password": "StrongPassword123!",
            "host": "localhost",
            "port": "1444",
            "database": "master",
            "is_global": true
        },
        // S3 source example
        {
            "source_type": "s3",
            "io": "output",
            "name": "my-s3-source-2",
            "path": "s3://my-s3-bucket",
            "is_global": true
        },
        // GCS source example
        {
            "source_type": "gcs",
            "io": "upload",
            "name": "my-gcs-source",
            "path": "gs://my-gs-bucket",
            "is_global": true
        }
    ]
}

JSON schema for `data_sources.json`¶

The hub database¶

The hub stores external state relating to the progress of the configuration workflows that have taken place or are currently in place. The database should be located on an externally mounted host volume. Backups should be taken routinely and the backup files should be placed on reliable storage.

Importing container images

Distributed arch. install