Job Information

Fault-config-<job name>

**Table 1** Fault-config-*<job name>*
Field	Meaning	Value	Remarks
fault-npus	Rank information of the faulty processor used by a faulty job	String	-
checkCode	Verification code	String	-

reset-config-<job name>

**Table 2** reset-config-*<job-name>*
Field	Parameter	Meaning	Value	Remarks
reset.json	RankList	Processor list	-	-
	-RankId	Rank information used by the faulty job	Integer	-
	-LogicId	Logic ID of a processor	32-bit integer	-
	-Status	Processor status	unrecovered: not recovered recovered: recovered successfully failed: recovery failed	-
	-Policy	Hot reset policy	empty: no fault found ignore: ignore the fault. restart_request: re-execute the current request. restart: re-execute the training job. free_reset: restart the device when no job is running on the NPU. reset: restart the device. isolate: isolate the device.	-
	-InitialPolicy	Initial hot reset policy	empty: no fault found ignore: ignore the fault. restart_request: re-execute the current request. restart: re-execute the training job. free_reset: restart the device when no job is running on the NPU. reset: restart the device. isolate: isolate the device.	-
	-ErrorCode	Decimal fault code	64-bit integer array	-
	-ErrorCodeHex	Hexadecimal fault code	String	-
	GracefulExit	Managing policies for training processes	The value is 0 or 1. The value 1 indicates that all training processes are killed. The value 0 indicates that no action is performed.	-
	UpdateTime	ConfigMap update time	-	-
	RetryTime	Number of pod rescheduling times	Integer	-
	FaultFlushing	Notifies Elastic Agent whether a fault is being updated.	The value can be true or false. true: A fault is being updated. false: No fault is updated.	Elastic Agent starts a training process only when the value of this field is false and the faulty RankList does not contain this node fault.
	RestartFaultProcess	Notifies Elastic Agent whether only the faulty process on the current node is restarted.	The value can be true or false. true: Elastic Agent does not exit, and only the faulty process on the current node is restarted. false: Elastic Agent exits if there is a faulty process on the current node.	-
restartType	-	reset.json update type	podReschedule or hotReset	podReschedule is used for single-pod rescheduling, and hotReset is used for hot rest.
checkCode	-	Verification code	String	-

data-trace-<job name>

Stores the switch status of each type of dotting data, which is mounted by Ascend Device Plugin to the compute node for storage. After this file is mounted to the training container, TaskD reads the file to control the dotting data.

**Table 3** data-trace-<*job name*> ConfigMap fields
Field	Meaning	Value	Type
Communication	Communication operator	on/off	String
Step	Step latency	on/off	String
SaveCheckpoint	Time taken by SaveCheckpoint	on/off	String
FP	Forward propagation data	on/off	String
DataLoader	Time taken by DataLoader	on/off	String

This ConfigMap must be in the same namespace as the training job and be named data-trace-<job name>, with the label reset=true included.
This ConfigMap is mounted by Ascend Device Plugin to the /user/cluster-info/datatrace-config/namespace.data-trace-job name/* folder on the training node. The file is named as profilingSwitch.
If this ConfigMap is not created, ClusterD will automatically create it when calling the gRPC interface ModifyTrainingDataTraceSwitch for the first time.
To use this function, mount the profilingSwitch file on the node to /user/cluster-info/datatrace-config/ in the container in hostPath mode.
Currently, Step, SaveCheckpoint, FP, and DataLoader are enabled by default. These four types of data must be enabled or disabled at the same time. If all five types of data are set to off, dotting is completely disabled. Otherwise, these four types remain enabled by default, while communication operator dotting is controlled by the communication operator switch status.

steptime-dtpgroup

This file stores and controls the iteration latency and group information of a job. When the job is started, you can configure ConfigMap parameters through the CCAE management platform to determine whether job performance is degraded.

**Table 4** steptime-dtpgroup ConfigMap fields
Level-1 Parameter	Level-2 Parameter	Meaning	Value	Remarks
data	PerfDumpPath	Path for saving iteration latency and group information	String	-
data	PerfDumpConfig	Switch of iteration latency and group information	String	-

Parent topic: Ascend Device Plugin