DA - Further configuration
Distribute architecture: Further configuration¶
Training and generation jobs can be fine-tuned to meet specific resource requirements.
Each Kubernetes Job
is created using a Kubernetes ConfigMap
that can be configured from within Helm. This includes:
- Adding CPU/RAM resource requests and limits,
- Adding toleration and taints to nodes to allow auto-scaling of memory/cpu intensive job on purposed Kubernetes
Nodes
.
The following examples show how these can be configured.
Example 1: Auto-scaling training Job
on purposed node¶
Training a Hazy model can take a lot of resources, especially memory. DA installs allow for purposed Nodes to be provisions for training Jobs - auto-scaling up & down dependent on the requests that have been dispatched.
dispatcher:
trainConfigMap:
...
nodeSelector:
node.kubernetes.io/instance-type: m5.large # AWS large EC2 instance
tolerations:
- key: "role"
operator: "Equal"
value: "synth"
effect: "NoSchedule"
This is specifically useful in order to auto-scale expensive large nodes in an on-demand fashion. The above configuration
will create a large m5.large
node to schedule the training job. Only pods running the training job that match the toleration key=synth
can be scheduled on that node. Once all the jobs running on that node are finished, it will automatically tear down,
reducing resource wastage.
Example 2: Resource requesting/limiting¶
It is also possible to request/limit CPU/Memory resources for specific Jobs, as shown for a Generation job below.
dispatcher:
generateConfigMap:
...
resources:
requests:
memory: "16Gi"
cpu: "1000m"
limits:
memory: "64Gi"
cpu: "2000m"
Sub-chart configuration¶
Full configuration options are available for each of the DA services via their Helm sub-charts (values.yaml
), which can be referred
to below: