Job Information
Fault-config-<job name>
Field |
Meaning |
Value |
Remarks |
|---|---|---|---|
fault-npus |
Rank information of the faulty processor used by a faulty job |
String |
- |
checkCode |
Verification code |
String |
- |
reset-config-<job name>
Field |
Parameter |
Meaning |
Value |
Remarks |
|---|---|---|---|---|
reset.json |
RankList |
Processor list |
- |
- |
-RankId |
Rank information used by the faulty job |
Integer |
- |
|
-LogicId |
Logic ID of a processor |
32-bit integer |
- |
|
-Status |
Processor status |
|
- |
|
-Policy |
Hot reset policy |
|
- |
|
-InitialPolicy |
Initial hot reset policy |
|
- |
|
-ErrorCode |
Decimal fault code |
64-bit integer array |
- |
|
-ErrorCodeHex |
Hexadecimal fault code |
String |
- |
|
GracefulExit |
Managing policies for training processes |
The value is 0 or 1.
|
- |
|
UpdateTime |
ConfigMap update time |
- |
- |
|
RetryTime |
Number of pod rescheduling times |
Integer |
- |
|
FaultFlushing |
Notifies Elastic Agent whether a fault is being updated. |
The value can be true or false.
|
Elastic Agent starts a training process only when the value of this field is false and the faulty RankList does not contain this node fault. |
|
RestartFaultProcess |
Notifies Elastic Agent whether only the faulty process on the current node is restarted. |
The value can be true or false.
|
- |
|
restartType |
- |
reset.json update type |
podReschedule or hotReset |
podReschedule is used for single-pod rescheduling, and hotReset is used for hot rest. |
checkCode |
- |
Verification code |
String |
- |
data-trace-<job name>
Stores the switch status of each type of dotting data, which is mounted by Ascend Device Plugin to the compute node for storage. After this file is mounted to the training container, TaskD reads the file to control the dotting data.
Field |
Meaning |
Value |
Type |
|---|---|---|---|
Communication |
Communication operator |
on/off |
String |
Step |
Step latency |
on/off |
String |
SaveCheckpoint |
Time taken by SaveCheckpoint |
on/off |
String |
FP |
Forward propagation data |
on/off |
String |
DataLoader |
Time taken by DataLoader |
on/off |
String |
- This ConfigMap must be in the same namespace as the training job and be named data-trace-<job name>, with the label reset=true included.
- This ConfigMap is mounted by Ascend Device Plugin to the /user/cluster-info/datatrace-config/namespace.data-trace-job name/* folder on the training node. The file is named as profilingSwitch.
- If this ConfigMap is not created, ClusterD will automatically create it when calling the gRPC interface ModifyTrainingDataTraceSwitch for the first time.
- To use this function, mount the profilingSwitch file on the node to /user/cluster-info/datatrace-config/ in the container in hostPath mode.
- Currently, Step, SaveCheckpoint, FP, and DataLoader are enabled by default. These four types of data must be enabled or disabled at the same time. If all five types of data are set to off, dotting is completely disabled. Otherwise, these four types remain enabled by default, while communication operator dotting is controlled by the communication operator switch status.
steptime-dtpgroup
This file stores and controls the iteration latency and group information of a job. When the job is started, you can configure ConfigMap parameters through the CCAE management platform to determine whether job performance is degraded.
Level-1 Parameter |
Level-2 Parameter |
Meaning |
Value |
Remarks |
|---|---|---|---|---|
data |
PerfDumpPath |
Path for saving iteration latency and group information |
String |
- |
PerfDumpConfig |
Switch of iteration latency and group information |
String |
- |