Job Information

tor-share-cm

**Table 1** tor-share-cm
Parameter	Description	Value	Remarks
IsHealthy	Switch status corresponding to a node	String	-
IsSharedTor	Switch attribute corresponding to a node	String	-
NodeIP	Node IP address	String	-
NodeName	Node name	String	-
JobName	Job name	String	-

vcjob-fault-npu-cm

**Table 2** Description of the vcjob-fault-npu-cm field
Parameter	Description	Value	Remarks
fault-node	Information about the faulty node	-	-
- NodeName	Node name	String	-
- UpdateTime	-	64-bit integer	-
- UnhealthyNPU	Faulty processors on the faulty node	String slice	-
- NetworkUnhealthyNPU	Processors with faulty network on the faulty node	String slice	-
- NodeDEnable	Whether to detect node status	True False	-
- NodeHealthState	Node health status	String	-
FaultDeviceList	-	-	-
- fault_type	Fault type object	CardUnhealthy: processor fault CardNetworkUnhealthy: processor network fault NodeUnhealthy: node fault	-
- npu_name	Name of the faulty processor. This parameter is left empty if the node is faulty.	String	-
- fault_level	Fault handling type. This parameter is left empty for node faults.	NotHandleFault: requires no handling. RestartRequest: re-executes inference requests in the inference scenario, or re-executes training services in the training scenario. RestartBusiness: re-executes services. FreeRestartNPU: resets an idle processor when faults affect service execution. RestartNPU: directly resets processors and re-executes services. SeparateNPU: isolates processors. PreSeparateNPU: pre-isolates processors and determines whether to perform rescheduling based on the actual running status of the training job.	NOTE: The functions of large_model_fault_level, fault_level and fault_handling are the same. fault_handling is recommended. When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled.
- fault_handling
- large_model_fault_level
- fault_code	Fault code, a string of characters separated by commas (,).	String	Disconnected: The processor network is disconnected. heartbeatTimeOut: The node status is lost.
remain-retry-times	Information about remaining jobs that can be rescheduled	-	-
- UUID	Job UID	String	-
- Times	Number of times that remaining jobs can be rescheduled	Integer	-

reset-config-<job name>

The MindCluster cluster scheduling components write information such as the device and training job status to the reset-config-<job name> ConfigMap through Kubernetes and map the information to the container. Elastic Agent reads the information and performs fault detection and processing.

**Table 3** reset-config-*<job-name>*
Field	Parameter	Description	Value	Remarks
reset.json	RankList	Processor list	-	-
	RankId	Rank information used by the faulty job	Integer	-
	LogicId	Logic ID of a processor	32-bit integer	-
	Status	Processor status	unrecovered: not recovered recovered: recovered successfully failed: recovery failed	-
	Policy	Hot reset policy	empty: no fault found ignore: ignore the fault. restart_request: re-execute the current request. restart: re-execute the training job. free_reset: restart the device when no job is running on the NPU. reset: restart the device. isolate: isolate the device.	-
	InitialPolicy	Initial hot reset policy	empty: no fault found ignore: ignore the fault. restart_request: re-execute the current request. restart: re-execute the training job. free_reset: restart the device when no job is running on the NPU. reset: restart the device. isolate: isolate the device.	-
	ErrorCode	Decimal fault code	64-bit integer array	-
	GracefulExit	Managing policies for training processes	The value is 0 or 1. The value 1 indicates that all training processes are killed. The value 0 indicates that no action is performed.	-
	FaultFlushing	Notifies Elastic Agent whether a fault is being updated.	The value can be true or false. true: A fault is being updated. false: No fault is updated.	Elastic Agent starts a training process only when the value of this field is false and the faulty RankList does not contain this node fault.
	RestartFaultProcess	Notifies Elastic Agent whether only the faulty process on the current node is restarted.	The value can be true or false. true: When the current node is faulty, only the faulty process is restarted. false: When the current node is faulty, all processes on the node and Elastic Agent exit.	This field takes effect only when the faulty RankList contains this node fault.
	ErrorCodeHex	Hexadecimal fault code	String	-
restartType	-	reset.json update type	podReschedule or hotReset	podReschedule is used for single-pod rescheduling, and hotReset is used for hot rest.

mindx-dl/job-reschedule-reason

This ConfigMap is used to record historical job rescheduling information. By default, the latest ten rescheduling records of a job are saved. When the ConfigMap content exceeds 950 KB, the earliest record of each job is deleted in sequence.

**Table 4** Job field description
Field	Parameter	Description	Value	Remarks
Job ns/name	-	Name of the job to be rescheduled	String	-
	JobID	Job ID	String	-
	TotalRescheduleTimes	Total number of rescheduling times of the job in Volcano lifetime	Integer	-
	RescheduleRecords	Detailed information about job rescheduling	-	-

**Table 5** RescheduleRecords description
Field	Parameter	Description	Value	Remarks
RescheduleRecords	LogFileFormatTime	Rescheduling time recorded in the Volcano log format	String	-
	RescheduleTimeStamp	Timestamp when rescheduling occurs	String	-
	ReasonOfTask	Detailed information about the rescheduling	-	-

**Table 6** ReasonOfTask description
Field	Parameter	Description	Value	Remarks
ReasonOfTask	RescheduleReason	Reason for rescheduling	String	-
	PodName	Pod that is triggered first during rescheduling	String	-
	NodeName	Node name	String	Node that is triggered first during rescheduling
	NodeRankIndex	Rank of the node that is first triggered during rescheduling in a training process	String	-

Parent topic: Volcano