Job Information

Fault-config-<job name>

Table 1 Fault-config-<job name>

Field

Meaning

Value

Remarks

fault-npus

Rank information of the faulty processor used by a faulty job

String

-

checkCode

Verification code

String

-

reset-config-<job name>

Table 2 reset-config-<job-name>

Field

Parameter

Meaning

Value

Remarks

reset.json

RankList

Processor list

-

-

-RankId

Rank information used by the faulty job

Integer

-

-LogicId

Logic ID of a processor

32-bit integer

-

-Status

Processor status

  • unrecovered: not recovered
  • recovered: recovered successfully
  • failed: recovery failed

-

-Policy

Hot reset policy

  • empty: no fault found
  • ignore: ignore the fault.
  • restart_request: re-execute the current request.
  • restart: re-execute the training job.
  • free_reset: restart the device when no job is running on the NPU.
  • reset: restart the device.
  • isolate: isolate the device.

-

-InitialPolicy

Initial hot reset policy

  • empty: no fault found
  • ignore: ignore the fault.
  • restart_request: re-execute the current request.
  • restart: re-execute the training job.
  • free_reset: restart the device when no job is running on the NPU.
  • reset: restart the device.
  • isolate: isolate the device.

-

-ErrorCode

Decimal fault code

64-bit integer array

-

-ErrorCodeHex

Hexadecimal fault code

String

-

GracefulExit

Managing policies for training processes

The value is 0 or 1.

  • The value 1 indicates that all training processes are killed.
  • The value 0 indicates that no action is performed.

-

UpdateTime

ConfigMap update time

-

-

RetryTime

Number of pod rescheduling times

Integer

-

FaultFlushing

Notifies Elastic Agent whether a fault is being updated.

The value can be true or false.

  • true: A fault is being updated.
  • false: No fault is updated.

Elastic Agent starts a training process only when the value of this field is false and the faulty RankList does not contain this node fault.

RestartFaultProcess

Notifies Elastic Agent whether only the faulty process on the current node is restarted.

The value can be true or false.

  • true: Elastic Agent does not exit, and only the faulty process on the current node is restarted.
  • false: Elastic Agent exits if there is a faulty process on the current node.

-

restartType

-

reset.json update type

podReschedule or hotReset

podReschedule is used for single-pod rescheduling, and hotReset is used for hot rest.

checkCode

-

Verification code

String

-

data-trace-<job name>

Stores the switch status of each type of dotting data, which is mounted by Ascend Device Plugin to the compute node for storage. After this file is mounted to the training container, TaskD reads the file to control the dotting data.

Table 3 data-trace-<job name> ConfigMap fields

Field

Meaning

Value

Type

Communication

Communication operator

on/off

String

Step

Step latency

on/off

String

SaveCheckpoint

Time taken by SaveCheckpoint

on/off

String

FP

Forward propagation data

on/off

String

DataLoader

Time taken by DataLoader

on/off

String

  • This ConfigMap must be in the same namespace as the training job and be named data-trace-<job name>, with the label reset=true included.
  • This ConfigMap is mounted by Ascend Device Plugin to the /user/cluster-info/datatrace-config/namespace.data-trace-job name/* folder on the training node. The file is named as profilingSwitch.
  • If this ConfigMap is not created, ClusterD will automatically create it when calling the gRPC interface ModifyTrainingDataTraceSwitch for the first time.
  • To use this function, mount the profilingSwitch file on the node to /user/cluster-info/datatrace-config/ in the container in hostPath mode.
  • Currently, Step, SaveCheckpoint, FP, and DataLoader are enabled by default. These four types of data must be enabled or disabled at the same time. If all five types of data are set to off, dotting is completely disabled. Otherwise, these four types remain enabled by default, while communication operator dotting is controlled by the communication operator switch status.

steptime-dtpgroup

This file stores and controls the iteration latency and group information of a job. When the job is started, you can configure ConfigMap parameters through the CCAE management platform to determine whether job performance is degraded.

Table 4 steptime-dtpgroup ConfigMap fields

Level-1 Parameter

Level-2 Parameter

Meaning

Value

Remarks

data

PerfDumpPath

Path for saving iteration latency and group information

String

-

PerfDumpConfig

Switch of iteration latency and group information

String

-