Job Information
tor-share-cm
Parameter |
Description |
Value |
Remarks |
|---|---|---|---|
IsHealthy |
Switch status corresponding to a node |
String |
- |
IsSharedTor |
Switch attribute corresponding to a node |
String |
- |
NodeIP |
Node IP address |
String |
- |
NodeName |
Node name |
String |
- |
JobName |
Job name |
String |
- |
vcjob-fault-npu-cm
Parameter |
Description |
Value |
Remarks |
|---|---|---|---|
fault-node |
Information about the faulty node |
- |
- |
- NodeName |
Node name |
String |
- |
- UpdateTime |
- |
64-bit integer |
- |
- UnhealthyNPU |
Faulty processors on the faulty node |
String slice |
- |
- NetworkUnhealthyNPU |
Processors with faulty network on the faulty node |
String slice |
- |
- NodeDEnable |
Whether to detect node status |
|
- |
- NodeHealthState |
Node health status |
String |
- |
FaultDeviceList |
- |
- |
- |
- fault_type |
Fault type object |
|
- |
- npu_name |
Name of the faulty processor. This parameter is left empty if the node is faulty. |
String |
- |
- fault_level |
Fault handling type. This parameter is left empty for node faults. |
|
NOTE:
|
- fault_handling |
|||
- large_model_fault_level |
|||
- fault_code |
Fault code, a string of characters separated by commas (,). |
String |
|
remain-retry-times |
Information about remaining jobs that can be rescheduled |
- |
- |
- UUID |
Job UID |
String |
- |
- Times |
Number of times that remaining jobs can be rescheduled |
Integer |
- |
reset-config-<job name>
The MindCluster cluster scheduling components write information such as the device and training job status to the reset-config-<job name> ConfigMap through Kubernetes and map the information to the container. Elastic Agent reads the information and performs fault detection and processing.
Field |
Parameter |
Description |
Value |
Remarks |
|---|---|---|---|---|
reset.json |
RankList |
Processor list |
- |
- |
RankId |
Rank information used by the faulty job |
Integer |
- |
|
LogicId |
Logic ID of a processor |
32-bit integer |
- |
|
Status |
Processor status |
|
- |
|
Policy |
Hot reset policy |
|
- |
|
InitialPolicy |
Initial hot reset policy |
|
- |
|
ErrorCode |
Decimal fault code |
64-bit integer array |
- |
|
GracefulExit |
Managing policies for training processes |
The value is 0 or 1.
|
- |
|
FaultFlushing |
Notifies Elastic Agent whether a fault is being updated. |
The value can be true or false.
|
Elastic Agent starts a training process only when the value of this field is false and the faulty RankList does not contain this node fault. |
|
RestartFaultProcess |
Notifies Elastic Agent whether only the faulty process on the current node is restarted. |
The value can be true or false.
|
This field takes effect only when the faulty RankList contains this node fault. |
|
ErrorCodeHex |
Hexadecimal fault code |
String |
- |
|
restartType |
- |
reset.json update type |
podReschedule or hotReset |
podReschedule is used for single-pod rescheduling, and hotReset is used for hot rest. |
mindx-dl/job-reschedule-reason
This ConfigMap is used to record historical job rescheduling information. By default, the latest ten rescheduling records of a job are saved. When the ConfigMap content exceeds 950 KB, the earliest record of each job is deleted in sequence.
Field |
Parameter |
Description |
Value |
Remarks |
|---|---|---|---|---|
Job ns/name |
- |
Name of the job to be rescheduled |
String |
- |
JobID |
Job ID |
String |
- |
|
TotalRescheduleTimes |
Total number of rescheduling times of the job in Volcano lifetime |
Integer |
- |
|
RescheduleRecords |
Detailed information about job rescheduling |
- |
- |
Field |
Parameter |
Description |
Value |
Remarks |
|---|---|---|---|---|
RescheduleRecords |
LogFileFormatTime |
Rescheduling time recorded in the Volcano log format |
String |
- |
RescheduleTimeStamp |
Timestamp when rescheduling occurs |
String |
- |
|
ReasonOfTask |
Detailed information about the rescheduling |
- |
- |
Field |
Parameter |
Description |
Value |
Remarks |
|---|---|---|---|---|
ReasonOfTask |
RescheduleReason |
Reason for rescheduling |
String |
- |
PodName |
Pod that is triggered first during rescheduling |
String |
- |
|
NodeName |
Node name |
String |
Node that is triggered first during rescheduling |
|
NodeRankIndex |
Rank of the node that is first triggered during rescheduling in a training process |
String |
- |