Configuring Elastic Scaling for an Inference Job

For MindIE Motor inference jobs, you can configure job-level elastic scaling. If a hardware or software fault occurs and the current resources are insufficient to start all instances, the number of running instances can be reduced to ensure that inference jobs continue to run. When the fault is rectified or new hardware is added, the job instances that are waiting to be started can be rescheduled.

Restrictions

Currently, this function can be used only for MindIE Motor inference jobs.

Supported Products

Atlas 800I A2 inference server
Atlas 800I A3 SuperPoD Server

Principles

Figure 1 Principles of elastic scaling

You can configure multiple jobs that belong to one inference process, divide jobs into multiple groups, and configure a scaling rule.
The scaling rule is deployed in the cluster as a ConfigMap. Instances of different types correspond to different groups in the scaling rule. For example, all prefill instances can be classified as group0, and all decode instances can be classified as group1.
In the rescheduling scenario, when a hardware or software fault occurs, Ascend Device Plugin and NodeD report the fault, and Volcano deletes all pods of the corresponding instance.
ClusterD sends global-ranktable to MindIE Controller. For details about global-ranktable, see global-ranktable Description.
MindIE Controller determines the instance that needs to exit based on global-ranktable and instructs the non-zero process in the container to exit.
After detecting pod exceptions, volcano-scheduler deletes all pods of the corresponding instance.
After detecting pod deletion, Ascend Operator collects the running status of all instances under the scaling rule of MindIE Motor.
Ascend Operator determines whether to create pods for the current instance based on the scaling rule.
If pods can be created, the scheduler schedules them after they are created.
For pods in the Pending state, they are automatically scheduled when resources are sufficient.
If no pod can be created, create pods after other instances successfully run.

Creating a ConfigMap for Scaling Rules

The following example illustrates how to set specific scaling rules and deploy them as a ConfigMap in the Kubernetes cluster:

apiVersion: v1
data:
  elastic_scaling.json: |          # Fixed field. Do not change it.
    {
      "version": "1.0",           # Fixed field. Do not change it.
      "elastic_scaling_list": [ # The following is a template. You can set it as required.
        {
          "group_list": [                # A normal ratio for job running           {
              "group_name": "group0",     # User-defined
              "group_num": "2",            # User-defined. The value must remain non-increasing in sequence.
              "server_num_per_group": "2"  # User-defined. The value of this parameter must remain consistent across identical group_name entries.
            },
            {
              "group_name": "group1",      
              "group_num": "1",
              "server_num_per_group": "2"
            }
          ]
        },
        {
          "group_list": [                # Another normal ratio for job running
            {
              "group_name": "group0",
              "group_num": "1",
              "server_num_per_group": "2"
            },
            {
              "group_name": "group1",
              "group_num": "1",
              "server_num_per_group": "2"
            }
          ]
        }
      ]
    }
kind: ConfigMap
metadata:
  name: scaling-rule              # User-defined
  namespace: mindie-service     # User-defined, which must be the same as that of the inference job.

For example, if no jobs from group0 or group1 are running, the group_list entry at index 1 will be selected. In this case, either group0 or group1 must be executed. The corresponding pod is then created for the chosen job and scheduled accordingly.
If a job from group0 is running while no job from group1 is active, a pod is created only for the group1 job. The group0 job will create its pod after the group1 job has successfully completed.

The following table describes the fields that can be modified in the preceding ConfigMap.

**Table 1** Parameters
Parameter	Description	Value	Mandatory or Not
metadata.name	Name of the ConfigMap that contains the scaling rule. You can set it as required, but make sure the value corresponds to the job label mind-cluster/scaling-rule which means the job is controlled by the scaling rule.	String	Yes
metadata.namespace	Namespace of the ConfigMap that contains the scaling rule. You can set it as required, but make sure the value is the same as that of the inference job. If this parameter is not set, the namespace is default.	String	No
group_name	Name of a group. It corresponds to the job label mind-cluster/group-name, meaning the job belongs to a specified group.	String	Yes
group_num	Number of target jobs in a group. If the number of running jobs in a group is less than the target number, jobs in another group will be started.	String	Yes
server_num_per_group	Number of replicas of target jobs in a group. The value must be identical for the same group_name across all group_list entries.	String	Yes

Modifying Scaling Rules

Assume you need to add a job to group0 while two jobs in group0 and one job in group1 are already running. In this case, modify the scaling template and deliver the job as follows.

apiVersion: v1
data:
  elastic_scaling.json: |          # Fixed field. Do not change it.
    {
      "version": "1.0",           # Fixed field. Do not change it.
      "elastic_scaling_list": [   # The following is a template. You can set it as required.
        {
          "group_list": [                    # Add an entry to elastic_scaling_list.             
            {
              "group_name": "group0",     
              "group_num": "3",             # Change the value of group_num of group0.
              "server_num_per_group": "2"  
            },
            {
              "group_name": "group1",      
              "group_num": "1",
              "server_num_per_group": "2"
            }
          ]
        },
        {
          "group_list": [                             
            {
              "group_name": "group0",      
              "group_num": "2",            
              "server_num_per_group": "2"  
            },
            {
              "group_name": "group1",      
              "group_num": "1",
              "server_num_per_group": "2"
            }
          ]
        },
        {
          "group_list": [                # Another normal ratio for job running           {
            {
              "group_name": "group0",
              "group_num": "1",
              "server_num_per_group": "2"
            },
            {
              "group_name": "group1",
              "group_num": "1",
              "server_num_per_group": "2"
            }
          ]
        }
      ]
    }
kind: ConfigMap
metadata:
  name: scaling-rule              # User-defined
  namespace: mindie-service     # User-defined, which must be the same as that of the inference job.

If you need to reduce the number of jobs in group0 while jobs are running properly, modify the template as follows before deletion:

apiVersion: v1
data:
  elastic_scaling.json: |          # Fixed field. Do not change it.
    {
      "version": "1.0",           # Fixed field. Do not change it.
      "elastic_scaling_list": [    # The following is a template. You can set it as required.
        {
          "group_list": [                 # Deleted a group_list.
            {
              "group_name": "group0",
              "group_num": "1",            # The target value of group_num of group0 is 1.
              "server_num_per_group": "2"
            },
            {
              "group_name": "group1",
              "group_num": "1",
              "server_num_per_group": "2"
            }
          ]
        }
      ]
    }
kind: ConfigMap
metadata:
  name: scaling-rule              # User-defined
  namespace: mindie-service     # User-defined, which must be the same as that of the inference job.

Preparing a Job YAML File

In the job YAML file, modify or add the following fields to enable job-level scaling.

... 
metadata:  
   labels:  
     ...  
     fault-scheduling: "force"
     fault-retry-times: "100000000"    # To rectify service plane faults, you must configure the number of unconditional retries on the service plane.
     jobID: mindie-xxx     # User-defined
     app: mindeie-ms-server
     mind-cluster/scaling-rule: scaling-rule    # The value must be the same as the name of the ConfigMap of the scaling rule.
     mind-cluster/group-name: group0          # The value must be the same as the value of group_name in the ConfigMap of the scaling rule.
spec:
  schedulerName: volcano      # This parameter is valid only when the startup parameter enableGangScheduling of Ascend Operator is set to true.
  runPolicy:
    backoffLimit: 3      # Number of job rescheduling times
...

Parent topic: Best Practices of MindIE Motor Inference Jobs