Job Information

tor-share-cm

Table 1 tor-share-cm

Parameter

Description

Value

Remarks

IsHealthy

Switch status corresponding to a node

String

-

IsSharedTor

Switch attribute corresponding to a node

String

-

NodeIP

Node IP address

String

-

NodeName

Node name

String

-

JobName

Job name

String

-

vcjob-fault-npu-cm

Table 2 Description of the vcjob-fault-npu-cm field

Parameter

Description

Value

Remarks

fault-node

Information about the faulty node

-

-

- NodeName

Node name

String

-

- UpdateTime

-

64-bit integer

-

- UnhealthyNPU

Faulty processors on the faulty node

String slice

-

- NetworkUnhealthyNPU

Processors with faulty network on the faulty node

String slice

-

- NodeDEnable

Whether to detect node status

  • True
  • False

-

- NodeHealthState

Node health status

String

-

FaultDeviceList

-

-

-

- fault_type

Fault type object

  • CardUnhealthy: processor fault
  • CardNetworkUnhealthy: processor network fault
  • NodeUnhealthy: node fault

-

- npu_name

Name of the faulty processor. This parameter is left empty if the node is faulty.

String

-

- fault_level

Fault handling type. This parameter is left empty for node faults.

  • NotHandleFault: requires no handling.
  • RestartRequest: re-executes inference requests in the inference scenario, or re-executes training services in the training scenario.
  • RestartBusiness: re-executes services.
  • FreeRestartNPU: resets an idle processor when faults affect service execution.
  • RestartNPU: directly resets processors and re-executes services.
  • SeparateNPU: isolates processors.
  • PreSeparateNPU: pre-isolates processors and determines whether to perform rescheduling based on the actual running status of the training job.
NOTE:
  • The functions of large_model_fault_level, fault_level and fault_handling are the same. fault_handling is recommended.
  • When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled.

- fault_handling

- large_model_fault_level

- fault_code

Fault code, a string of characters separated by commas (,).

String

  • Disconnected: The processor network is disconnected.
  • heartbeatTimeOut: The node status is lost.

remain-retry-times

Information about remaining jobs that can be rescheduled

-

-

- UUID

Job UID

String

-

- Times

Number of times that remaining jobs can be rescheduled

Integer

-

reset-config-<job name>

The MindCluster cluster scheduling components write information such as the device and training job status to the reset-config-<job name> ConfigMap through Kubernetes and map the information to the container. Elastic Agent reads the information and performs fault detection and processing.

Table 3 reset-config-<job-name>

Field

Parameter

Description

Value

Remarks

reset.json

RankList

Processor list

-

-

RankId

Rank information used by the faulty job

Integer

-

LogicId

Logic ID of a processor

32-bit integer

-

Status

Processor status

  • unrecovered: not recovered
  • recovered: recovered successfully
  • failed: recovery failed

-

Policy

Hot reset policy

  • empty: no fault found
  • ignore: ignore the fault.
  • restart_request: re-execute the current request.
  • restart: re-execute the training job.
  • free_reset: restart the device when no job is running on the NPU.
  • reset: restart the device.
  • isolate: isolate the device.

-

InitialPolicy

Initial hot reset policy

  • empty: no fault found
  • ignore: ignore the fault.
  • restart_request: re-execute the current request.
  • restart: re-execute the training job.
  • free_reset: restart the device when no job is running on the NPU.
  • reset: restart the device.
  • isolate: isolate the device.

-

ErrorCode

Decimal fault code

64-bit integer array

-

GracefulExit

Managing policies for training processes

The value is 0 or 1.

  • The value 1 indicates that all training processes are killed.
  • The value 0 indicates that no action is performed.

-

FaultFlushing

Notifies Elastic Agent whether a fault is being updated.

The value can be true or false.

  • true: A fault is being updated.
  • false: No fault is updated.

Elastic Agent starts a training process only when the value of this field is false and the faulty RankList does not contain this node fault.

RestartFaultProcess

Notifies Elastic Agent whether only the faulty process on the current node is restarted.

The value can be true or false.

  • true: When the current node is faulty, only the faulty process is restarted.
  • false: When the current node is faulty, all processes on the node and Elastic Agent exit.

This field takes effect only when the faulty RankList contains this node fault.

ErrorCodeHex

Hexadecimal fault code

String

-

restartType

-

reset.json update type

podReschedule or hotReset

podReschedule is used for single-pod rescheduling, and hotReset is used for hot rest.

mindx-dl/job-reschedule-reason

This ConfigMap is used to record historical job rescheduling information. By default, the latest ten rescheduling records of a job are saved. When the ConfigMap content exceeds 950 KB, the earliest record of each job is deleted in sequence.

Table 4 Job field description

Field

Parameter

Description

Value

Remarks

Job ns/name

-

Name of the job to be rescheduled

String

-

JobID

Job ID

String

-

TotalRescheduleTimes

Total number of rescheduling times of the job in Volcano lifetime

Integer

-

RescheduleRecords

Detailed information about job rescheduling

-

-

Table 5 RescheduleRecords description

Field

Parameter

Description

Value

Remarks

RescheduleRecords

LogFileFormatTime

Rescheduling time recorded in the Volcano log format

String

-

RescheduleTimeStamp

Timestamp when rescheduling occurs

String

-

ReasonOfTask

Detailed information about the rescheduling

-

-

Table 6 ReasonOfTask description

Field

Parameter

Description

Value

Remarks

ReasonOfTask

RescheduleReason

Reason for rescheduling

String

-

PodName

Pod that is triggered first during rescheduling

String

-

NodeName

Node name

String

Node that is triggered first during rescheduling

NodeRankIndex

Rank of the node that is first triggered during rescheduling in a training process

String

-