Ascend Operator

YAML Parameters (acjob)

For acjob, understand the YAML parameters before configuring the YAML file. For details, see the table below.

Each acjob YAML file contains some fixed fields, such as apiVersion and kind. For more information about these fields, see Key Fields in acjob.

Table 1 YAML parameters

Parameter

Value

Description

framework

  • mindspore
  • pytorch
  • tensorflow

-

jobID

Unique ID of the MindIE Motor job in the cluster. Set this parameter as required.

This parameter is supported only on Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server.

app

Role of MindIE Motor in the AscendJob. The value can be mindie-ms-controller, mindie-ms-coordinator, or mindie-ms-server.

  • If the YAML file of acjob contains both jobID and app, Ascend Operator automatically passes the environment variables MINDX_TASK_ID, APP_TYPE, and MINDX_SERVER_IP and identifies the job as a MindIE inference job.
  • For details about the preceding environment variables, see Table 2.
  • This parameter is supported only on Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server.

mind-cluster/scaling-rule: scaling-rule

Name of the ConfigMap of the scaling rule.

This parameter can be used only for MindIE Motor inference jobs on the Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server.

mind-cluster/group-name: group0

Name of the group of the scaling rule.

This parameter can be used only for MindIE Motor inference jobs on the Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server.

podAffinity

Scheduling to a physical SuperPoD with more affinity pods.

This parameter can be used only for MindIE Motor inference jobs on the Atlas 800I A3 SuperPoD Server.

sp-fit

SuperPoD scheduling policy.

idlest: Scheduling to a more idle physical SuperPoD.

This parameter can be used only for MindIE Motor inference jobs on the Atlas 800I A3 SuperPoD Server.

ring-controller.atlas

  • For Atlas A2 training product, A200T A3 Box8 SuperPoD Server, Atlas 900 A3 SuperPoD, and Atlas 800T A3 SuperPoD Server, the value is ascend-{xxx}b.
  • For an Atlas 800 training server and a server with Atlas 300T training cards, the value is ascend-910.

Processor type for specified products.

You need to set this parameter both in ConfigMap and task.

schedulerName

The default value is volcano. Set this parameter based on your actual requirements.

Scheduler selected when gang scheduling is enabled for Ascend Operator.

minAvailable

The default value is the total number of job replicas.

Total number of job replicas when Ascend Operator enables gang scheduling and the scheduler is Volcano.

queue

The default value is default. Set this parameter based on your actual requirements.

Queue to which a job belongs. This parameter takes effect when Ascend Operator enables gang scheduling and the scheduler is Volcano.

(Optional) successPolicy

  • The value is left empty by default. If you do not set this parameter, an empty value is used.
  • AllWorkers

Prerequisite for a successful job. An empty value indicates that if only one pod succeeds, the entire job is considered successful. AllWorkers indicates that all pods need to succeed for the job to be considered as successful.

container.name

ascend

The name of the training container must be ascend.

(Optional) ports

If you do not set corresponding parameters, the system fills in the following values by default:

  • name: ascendjob-port
  • containerPort: 2222

Collective communication port for distributed training. You can set containerPort as required. If containerPort is not set, the default port 2222 is used.

replicas

  • Single server: 1
  • Distributed: N

N indicates the number of job replicas.

image

-

Training image name. Set this parameter as required.

(Optional) host-arch

Arm: huawei-arm

x86_64: huawei-x86

Architecture of the node where a training job is executed. Set this parameter as required.

In a distributed training job, ensure that the nodes running the training job have the same architecture.

huawei.com/schedule_policy

See Table 3 for its configurations.

Job's AI processor layout to be scheduled. Volcano selects a proper scheduling policy based on this field. If this parameter is not set, the scheduling policy is selected based on accelerator-type.

NOTE:

This field can be used only on the Atlas training product, Atlas A2 training product, Atlas A3 training product, Atlas A2 inference products, and Atlas A3 inference product.

sp-block

Number of processors on logical SuperPoDs.

  • For a single-server system, the value must be the same as the number of processors requested by a job.
  • For a distributed system, the value must be an integer multiple of the number of processors on a node, and the total number of processors requested by a job must be an integer multiple of the value.

Cluster scheduling components divide logical SuperPoDs on physical SuperPoDs based on the division policy for affinity scheduling of training jobs. If this field is not specified, Volcano sets the size of the logical SuperPoD of a job to the total number of NPUs configured for the job during scheduling.

For details, see UnifiedBus Interconnect Device Network Description.

NOTE:

tor-affinity

  • large-model-schema: foundation model jobs or padding jobs
  • normal-schema: common job
  • null: switch affinity scheduling not used
    NOTE:

    You need to select a job type based on the number of job replicas. If the number of job replicas is less than 4, the job is a padding job. If the number of job replicas is greater than or equal to 4, the job is a foundation model job. The number of replicas of a common job is not limited.

The default value is null, indicating that switch affinity scheduling is not used. You need to set this parameter based on the job type.

NOTE:
  • Switch affinity scheduling 1.0 supports Atlas training products and Atlas A2 training products, as well as PyTorch and MindSpore frameworks.
  • Switch affinity scheduling 2.0 supports Atlas A2 training products, as well as the PyTorch framework.

pod-rescheduling

  • on: Enable pod-level rescheduling.
  • Other values or not using this field: Disables pod-level rescheduling.

For pod-level rescheduling, if a job is faulty, the system does not delete all pods of the job. Instead, the system deletes the faulty pods, creates new pods, and reschedules the pods.

NOTE:
  • Job-level rescheduling is the default rescheduling mode. To enable pod-level rescheduling, add this field.
  • Currently, TensorFlow does not support pod-level rescheduling.

recover-strategy

Available recovery policy.

  • retry: process-level online recovery
  • recover: process-level rescheduling
  • recover-in-place: process-level in-place recovery
  • elastic-training: elastic training
  • dump: saving dying gasp
  • exit: exiting training

recover-strategy is configured in annotations of the job YAML file. The value can be any combination of the six policies. Use commas (,) to separate them.

process-recover-enable

  • on: Enable process-level rescheduling and process-level online recovery.

    Process-level rescheduling and graceful fault tolerance cannot be enabled at the same time. If both of them are enabled, training is resumed through job-level rescheduling.

  • pause: Temporarily disable process-level rescheduling and process-level online recovery.
  • off or not using this field: Disable process-level rescheduling and process-level online recovery.

Ascend Operator automatically adds the process-recover-enable=on label to the job based on the configured recover-strategy. You do not need to manually specify the label.

subHealthyStrategy

  • ignore: Ignore the subhealthy node. The node is not preferentially scheduled during affinity scheduling of subsequent jobs.
  • graceExit: Stop using the subhealthy node and perform rescheduling after the dying gasp checkpoint file is saved. Subsequent jobs will not be scheduled to this node.
  • forceExit: Stop using the subhealthy node, exit the job without saving files, and perform rescheduling. Subsequent jobs will not be scheduled to this node.
  • hotSwitch: Execute hot switching. After starting the backup pod, suspend the training job, and restart the training job on the new node.
  • The default value is ignore.

Processing policy for nodes in the SubHealthy status.

NOTE:
  • When the graceExit policy used, ensure that the training framework can receive the SIGTERM signal and save the checkpoint file.
  • For details about the restrictions on the hotSwitch policy, see Restrictions.

accelerator-type

  • Atlas 800 training server (full configuration of NPUs): module
  • Atlas 800 training server (half configuration of NPUs): half
  • Server (with Atlas 300T training cards): card
  • Atlas 800T A2 training server and Atlas 900 A2 PoD cluster basic unit: module-{xxx}b-8
  • Atlas 200T A2 Box16 heterogeneous subrack and Atlas 200I A2 Box16 heterogeneous subrack: module-{xxx}b-16
  • Atlas 900 A3 SuperPoD: module-a3-16-super-pod

Set this parameter based on the type of the node where a training job is executed. For the Atlas 800 training server (NPU full configuration), this parameter can be omitted.

NOTE:

You can run the npu-smi info command to query the number in the processor model name, which is indicated by the Name field in the returned message. As an example below, the value of {xxx} is 910.

huawei.com/Ascend910

Atlas 800 training server (full configuration of NPUs):
  • Single-server single-processor job: 1
  • Single-server multi-processor job: 2, 4, 8
  • Distributed job: 1, 2, 4, 8
Atlas 800 training server (half configuration of NPUs):
  • Single-server single-processor job: 1
  • Single-server multi-processor job: 2, 4
  • Distributed job: 1, 2, 4
Server (with Atlas 300T training cards):
  • Single-server single-processor job: 1
  • Single-server multi-processor job: 2
  • Distributed job: 2
Atlas 800T A2 training server and Atlas 900 A2 PoD cluster basic unit:
  • Single-server single-processor job: 1
  • Single-server multi-processor job: 2, 3, 4, 5, 6, 7, 8
  • Distributed job: 1, 2, 3, 4, 5, 6, 7, 8
Atlas 200T A2 Box16 heterogeneous subrack and Atlas 200I A2 Box16 heterogeneous subrack:
  • Single-server single-processor job: 1
  • Single-server multi-processor job: 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16
  • Distributed job: 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16
Atlas 900 A3 SuperPoD, A200T A3 Box8 SuperPoD Server, and Atlas 800T A3 SuperPoD Server:
  • Single-server single-processor job: 1
  • Single-server multi-processor job: 2, 4, 6, 8, 10, 12, 14, 16
  • Distributed job: 2, 4, 6, 8, 10, 12, 14, 16
  • Logical SuperPoD affinity for Atlas 900 A3 SuperPoD: 16

Number of requested NPUs. Set this parameter as required.

(.kind=="AscendJob").spec.replicaSpecs.[Master|Scheduler|Worker].template.spec.containers[0].env[name==ASCEND_VISIBLE_DEVICES].valueFrom.fieldRef.fieldPath

The value is in the format of metadata.annotations['huawei.com/AscendXXX'], where XXX indicates the processor model (910, 310, or 310P). The value must be the same as the actual processor type in the environment.

Ascend Docker Runtime obtains the value of this parameter to mount NPUs of the corresponding type to the container.

NOTE:

This parameter applies only to full NPU scheduling of the Volcano scheduler. If you use static vNPU scheduling and other schedulers, delete fields of this parameter from the example YAML file.

fault-scheduling

grace

Enable the graceful deletion mode for a job to gracefully delete the original pod during the process. If the failure persists after 15 minutes, forcibly delete the original pod.

Set this parameter to grace for process-level rescheduling and process-level online recovery.

force

Enable the forcible deletion mode for a job to forcibly delete the original pod during the process.

off

The job does not use the resumable training feature, but maxRetry of Kubernetes still takes effect.

None (no fault-scheduling field)

Other values

fault-retry-times

> 0

To rectify service plane faults, you must configure the number of unconditional retries on the service plane.

NOTE:
  • To use unconditional retry, ensure that a training process will cause the container to exit abnormally when it fails. If the container does not exit abnormally, the retry will fail.
  • Currently, only the Atlas 800T A2 training server and Atlas 900 A2 PoD cluster basic unit support unconditional retry.
  • Service plane faults will be triggered during process-level recovery. This parameter must be set if process-level recovery is required.

None (no fault-retry-times) or 0

The job does not use unconditional retry and cannot detect service plane faults, but maxRetry of VolcanoJob still takes effect.

backoffLimit

> 0

Number of rescheduling times when a job is faulty. If the number of rescheduling times is the same as the value of backoffLimit, the job will not be rescheduled.

NOTE:

If both backoffLimit and fault-retry-times are configured, and the number of rescheduling times is the same as the value of either backoffLimit or fault-retry-times, rescheduling is not performed.

None (no backoffLimit) or backoffLimit0

The total number of rescheduling times is not limited.

NOTE:

If backoffLimit is not configured but fault-retry-times is configured, the number of rescheduling times is specified by fault-retry-times.

restartPolicy

  • Never: never restart
  • Always: always restart
  • OnFailure: restart upon failures
  • ExitCode: determines whether to restart the pod based on the process exit error code. If the error code ranges from 1 to 127, the pod is not restarted. If the error code ranges from 128 to 255, the pod is restarted.
    NOTE:

    Training jobs of the vcjob type do not support ExitCode.

Container restart policy. When unconditional retry upon service plane faults is configured, the value of this parameter must be Never.

terminationGracePeriodSeconds

0 < terminationGracePeriodSeconds < value of grace-over-time

Duration from the time when the container receives SIGTERM to the time when the container is forcibly stopped by Kubernetes. The value must be greater than 0 and less than the value of grace-over-time in the volcano-v{version}.yaml file. In addition, ensure that the checkpoint file can be saved completely. Change the value as required. For details, see Container Lifecycle Hooks on the Kubernetes official website.

NOTE:

This field takes effect only when fault-scheduling is set to grace. If fault-scheduling is set to force, this field is invalid.

hostNetwork

  • true: The host IP address is used to create a pod.
  • false: The host IP address is not used to create a pod.
  • If the cluster scale is large (the number of nodes is greater than 1000), you are advised to use the host IP address to create a pod.
  • If this parameter is not specified, the host IP address is not used to create a pod by default.
    NOTE:

    If HostNetwork is set to true and the RankTable file path is mounted to the job YAML file, you can parse the RankTable file in the training script to obtain the host IP address of the pod to establish a link. If the RankTable file path is not mounted to the job YAML file, the service IP address is used to establish a link.

YAML Parameters (deploy or vcjob)

Table 2 YAML parameters

Parameter

Value

Description

minAvailable

  • Single server: 1
  • Distributed: N

N indicates the number of nodes. This parameter is not required for Deployment jobs. You are advised to set this parameter to the same value as replicas.

replicas

  • Single server: 1
  • Distributed: N

N indicates the number of job replicas.

image

-

Training image name. Change it based on your actual requirements. (It matches the image name created in the image preparation section.)

(Optional) host-arch

Arm: huawei-arm

x86_64: huawei-x86

Architecture of the node where a training job is executed. Set this parameter as required.

In a distributed training job, ensure that the nodes running the training job have the same architecture.

huawei.com/schedule_policy

See Table 3 for its configurations.

Job's AI processor layout to be scheduled. Volcano selects a proper scheduling policy based on this field. If this parameter is not set, the scheduling policy is selected based on accelerator-type.

NOTE:

This field can be used only on the Atlas training product, Atlas A2 training product, Atlas A3 training product, Atlas A2 inference products, and Atlas A3 inference product.

sp-block

Number of processors on logical SuperPoDs.

  • For a single-server system, the value must be the same as the number of processors requested by a job.
  • For a distributed system, the value must be an integer multiple of the number of processors on a node, and the total number of processors requested by a job must be an integer multiple of the value.

Cluster scheduling components divide logical SuperPoDs on physical SuperPoDs based on the division policy for affinity scheduling of training jobs. If this field is not specified, Volcano sets the size of the logical SuperPoD of a job to the total number of NPUs configured for the job during scheduling.

For details, see UnifiedBus Interconnect Device Network Description.

NOTE:

tor-affinity

  • large-model-schema: foundation model jobs or padding jobs
  • normal-schema: common job
  • null: switch affinity scheduling not used
    NOTE:

    You need to select a job type based on the number of job replicas. If the number of job replicas is less than 4, the job is a padding job. If the number of job replicas is greater than or equal to 4, the job is a foundation model job. The number of replicas of a common job is not limited.

The default value is null, indicating that switch affinity scheduling is not used. You need to set this parameter based on the job type.

NOTE:
  • Switch affinity scheduling 1.0 supports Atlas training products and Atlas A2 training products, as well as PyTorch and MindSpore frameworks.
  • Switch affinity scheduling 2.0 supports Atlas A2 training products, as well as the PyTorch framework.

accelerator-type

The value varies according to the processor type, including:

  • Atlas 800 training server (full configuration of NPUs): module
  • Atlas 800 training server (half configuration of NPUs): half
  • Atlas 800T A2 training server and Atlas 900 A2 PoD cluster basic unit: module-{xxx}b-8
  • Atlas 200T A2 Box16 heterogeneous subrack and Atlas 200I A2 Box16 heterogeneous subrack: module-{xxx}b-16
  • Atlas 900 A3 SuperPoD: module-a3-16-super-pod

Set this parameter based on the type of the node where a training job is executed. For the Atlas 800 training server (NPU full configuration), this parameter can be omitted.

NOTE:

You can run the npu-smi info command to query the number in the processor model name, which is indicated by the Name field in the returned message. As an example below, the value of {xxx} is 910.

huawei.com/Ascend910

The value varies according to the processor type, including:

  • Atlas 800 training server (full configuration of NPUs):
    • Single-server single-processor: 1
    • Single-server multi-processor: 2, 4, 8
    • Distributed: 1, 2, 4, 8
  • Atlas 800 training server (half configuration of NPUs):
    • Single-server single-processor: 1
    • Single-server multi-processor: 2, 4
    • Distributed: 1, 2, 4
  • Atlas 800T A2 training server and Atlas 900 A2 PoD cluster basic unit:
    • Single-server single-processor: 1
    • Single-server multi-processor: 2, 3, 4, 5, 6, 7, 8
    • Distributed: 1, 2, 3, 4, 5, 6, 7, 8
  • Atlas 200T A2 Box16 heterogeneous subrack and Atlas 200I A2 Box16 heterogeneous subrack:
    • Single-server single-processor: 1
    • Single-server multi-processor: 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16
    • Distributed: 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16

Number of requested NPUs. Set this parameter as required. The vNPU cannot be requested when the entire NPU is requested.

NOTE:
  • Graceful fault tolerance mode supports Atlas 800 training servers, and the number of requested resources can only be 4N or 8N (N indicates the number of training nodes).
  • Graceful fault tolerance mode supports Atlas 800T A2 training server or Atlas 900 A2 PoD cluster basic unit, and the number of requested resources can only be 8N (N indicates the number of training nodes).

ring-controller.atlas

  • For Atlas A2 training product, A200T A3 Box8 SuperPoD Server, Atlas 900 A3 SuperPoD, and Atlas 800T A3 SuperPoD Server, the value is ascend-{xxx}b.
  • For an Atlas 800 training server and a server with Atlas 300T training cards, the value is ascend-910.

Processor type used by a job. You need to set this parameter both in ConfigMap and task.

metadata.annotations['huawei.com/AscendXXX']

XXX indicates the processor model. The value can be 910, 310, or 310P. The value must be the same as the actual processor type in the environment.

Ascend Docker Runtime obtains the value of this parameter to mount NPUs of the corresponding type to the container.

fault-scheduling

grace

Enable the graceful deletion mode for a job to gracefully delete the original pod during the process. If the failure persists after 15 minutes, forcibly delete the original pod.

Set this parameter to grace for process-level rescheduling and process-level online recovery.

force

Enable the forcible deletion mode for a job to forcibly delete the original pod during the process.

off

The job does not use the resumable training feature, but maxRetry of Kubernetes still takes effect.

None (no fault-scheduling field)

Other values

recover-strategy

Job restoration policy.

  • retry: process-level online recovery
  • recover: process-level rescheduling
  • recover-in-place: process-level in-place recovery
  • dump: saving dying gasp
  • exit: exiting training

recover-strategy is configured in annotations of the job YAML file. The value can be any combination of the five policies. Use commas (,) to separate them.

pod-rescheduling

  • on: Enable pod-level rescheduling.
  • Other values or not using this field: Disables pod-level rescheduling.

For pod-level rescheduling, if a job is faulty, the system does not delete all pods of the job. Instead, the system deletes the faulty pods, creates new pods, and reschedules the pods.

NOTE:
  • Job-level rescheduling is the default rescheduling mode. To enable pod-level rescheduling, add this field.
  • Currently, TensorFlow does not support pod-level rescheduling.

subHealthyStrategy

  • ignore: Ignore the subhealthy node. The node is not preferentially scheduled during affinity scheduling of subsequent jobs.
  • graceExit: Stop using the subhealthy node and perform rescheduling after the dying gasp checkpoint file is saved. Subsequent jobs will not be scheduled to this node.
  • forceExit: Stop using the subhealthy node, exit the job without saving files, and perform rescheduling. Subsequent jobs will not be scheduled to this node.
  • The default value is ignore.

Processing policy for nodes in the SubHealthy status.

NOTE:

When the graceExit policy used, ensure that the training framework can receive the SIGTERM signal and save the checkpoint file.

fault-retry-times

> 0

To rectify service plane faults, you must configure the number of unconditional retries on the service plane.

NOTE:
  • To use the unconditional retry function, ensure that a training process will cause the container to exit abnormally when it fails. If the container does not exit abnormally, the retry will fail.
  • Currently, only the Atlas 800T A2 training server and Atlas 900 A2 PoD cluster basic unit support unconditional retry.
  • Service plane faults will be triggered during process-level recovery. This parameter must be set if process-level recovery is required.

None (no fault-retry-times) or 0

The job does not use unconditional retry and cannot detect service plane faults, but maxRetry of VolcanoJob still takes effect.

policies

Options of event:

  • PodFailed: The pod is failed.
  • PodEvicted: The pod is evicted.

Pod status. This field is used together with the action field to indicate the processing policy of Volcano when the pod is in a certain status. The default value is PodEvicted.

Options of action:

  • RestartJob: restarts a training job.
  • Ignore: The open-source Volcano does not perform any operation. Instead, Ascend-volcano-plugin performs the operation.

Volcano specifies the policy for processing pods in a certain status. The default value is RestartJob.

NOTE:

maxRetry

0 < maxRetry

Number of rescheduling times when a job is faulty. If the number of rescheduling times is the same as the value of maxRetry, the job will not be rescheduled.

NOTE:

If both maxRetry and fault-retry-times are configured, and the number of rescheduling times is the same as the value of either maxRetry or fault-retry-times, rescheduling is not performed.

None (no maxRetry) or maxRetry is set to 0

If maxRetry is not set or is set to 0, the system performs rescheduling for three times by default.

restartPolicy

  • Never: never restart
  • Always: always restart
  • OnFailure: restart upon failures
  • ExitCode: determines whether to restart the pod based on the process exit error code. If the error code ranges from 1 to 127, the pod is not restarted. If the error code ranges from 128 to 255, the pod is restarted.
    NOTE:

    Training jobs of the vcjob type do not support ExitCode.

Container restart policy. When unconditional retry upon service plane faults is configured, the value of this parameter must be Never.

terminationGracePeriodSeconds

0 < terminationGracePeriodSeconds < value of grace-over-time

Duration from the time when the container receives SIGTERM to the time when the container is forcibly stopped by Kubernetes. The value must be greater than 0 and less than the value of grace-over-time in the volcano-v{version}.yaml file. In addition, ensure that the checkpoint file can be saved completely. Change the value as required. For details, see Container Lifecycle Hooks on the Kubernetes official website.

NOTE:

This field takes effect only when fault-scheduling is set to grace. If fault-scheduling is set to force, this field is invalid.

Table 3 huawei.com/schedule_policy configuration description

Configuration

Description

chip4-node8

One node has eight processors, and four processors form an interconnection ring, for example, the processor layout of the Atlas 800 training server (model 9000) or Atlas 800 training server (model 9010).

chip1-node2

One node has two processors. For example, one Atlas 300T training card can be equipped with only one processor, and one node can be equipped with a maximum of two Atlas 300T training cards.

chip4-node4

One node has four processors, and four processors form an interconnection ring, for example, the processor layout of the Atlas 800 training server (model 9000) or Atlas 800 training server (model 9010).

chip8-node8

One node has eight processors, and eight processors form on one interconnection ring, for example, the processor layout of the Atlas 800T A2 training server.

chip8-node16

One node has 16 processors, and eight processors form on one interconnection ring, for example, the processor layout of the Atlas 200T A2 Box16 heterogeneous subrack.

chip2-node16

One node has 16 processors, and two processors form on one interconnection ring, for example, the processor layout of the Atlas 800T A3 SuperPoD Server.

chip2-node16-sp

One node has 16 processors, and two processors form on one interconnection ring, and multiple servers form a SuperPoD, for example, the processor layout of the Atlas 900 A3 SuperPoD.

rings-config-<job name>

Table 4 rings-config-<job name>

Field

Parameter

Description

Value

Remarks

hccl.json

version

Format version used by RankTable

1.0

-

server_count

Number of nodes used by a job

Integer

-

server_list

Information about the node used by a job

-

-

- server_id

AI server ID, which is globally unique.

String

-

- host_ip

Host IP address of the AI server

String

-

device

Information about the processor used by a job

-

-

- device_id

Physical ID of the processor used by a job

String

-

- device_ip

IP address of the processor used by a job

String

-

- rank_id

Rank ID of the processor used by a job

String

-

version

-

Version of the hccl.json file used by a job

String

-