DA - Further configuration

Distribute architecture: Further configuration

Training and generation jobs can be fine-tuned to meet specific resource requirements.

Each Kubernetes Job is created using a Kubernetes ConfigMap that can be configured from within Helm. This includes:

  • Adding CPU/RAM resource requests and limits,
  • Adding toleration and taints to nodes to allow auto-scaling of memory/cpu intensive job on purposed Kubernetes Nodes.

The following examples show how these can be configured.

Example 1: Auto-scaling training Job on purposed node

Training a Hazy model can take a lot of resources, especially memory. DA installs allow for purposed Nodes to be provisions for training Jobs - auto-scaling up & down dependent on the requests that have been dispatched.

dispatcher:
  trainConfigMap:
    ...
    nodeSelector:
      node.kubernetes.io/instance-type: m5.large # AWS large EC2 instance
    tolerations: 
    - key: "role"
      operator: "Equal"
      value: "synth"
      effect: "NoSchedule"

This is specifically useful in order to auto-scale expensive large nodes in an on-demand fashion. The above configuration will create a large m5.large node to schedule the training job. Only pods running the training job that match the toleration key=synth can be scheduled on that node. Once all the jobs running on that node are finished, it will automatically tear down, reducing resource wastage.

Example 2: Resource requesting/limiting

It is also possible to request/limit CPU/Memory resources for specific Jobs, as shown for a Generation job below.

dispatcher:
  generateConfigMap:
    ...
    resources:
      requests:
        memory: "16Gi"
        cpu: "1000m"
      limits:
        memory: "64Gi"
        cpu: "2000m"
      

Sub-chart configuration

Full configuration options are available for each of the DA services via their Helm sub-charts (values.yaml), which can be referred to below: