Slow Node Diagnosis
Description
If the node training performance in an AI cluster deteriorates, this feature can be used to detect slow nodes caused by compute or network problems in real time, allowing you to isolate these nodes via switchover or other mitigation methods.
Currently, only integration with ClusterD and NodeD for online deployment is supported. For details about how to deploy ClusterD and NodeD, see Installation and Deployment.
- Slow node algorithm: Identifies real-time performance degradation via key training metrics and demarcates slow compute and communication issues by analyzing the synchronization between their respective operators.
- Slow node cleaning: Converts and cleans incremental data on a node and generates a cleaning result CSV file.
- Slow node scheduling: Schedules the overall process of slow nodes and controls data cleaning and slow node algorithm.
Usage Example
Procedures for starting a diagnosis task on slow nodes:
- Add a function call to obtain parallel domain information in the training iteration of the training script. The following uses the PyTorch-MindSpeed scenario as an example. You need to add the following fields in bold to the ./mindspeed_llm/training/training.py file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
def train(forward_step_func, model, optimizer, opt_param_scheduler, train_data_iterator, valid_data_iterator, process_non_loss_data_func, config): …… if is_profile_enabled(): prof = get_profiler() prof.start() m_iter = 0 while iteration < args.train_iters: …… args.curr_iteration = iteration loss_dict, skipped_iter, grad_norm, num_zeros_in_grad = \ train_step(forward_step_func, train_data_iterator, model, optimizer, opt_param_scheduler, config) iteration += 1 m_iter += 1 if m_iter == 5: from taskd.python.adaptor.pytorch.group_info import dump_group_info dump_group_info() batch_size = mpu.get_data_parallel_world_size() * \ args.micro_batch_size * \ get_num_microbatches()
- Complete operations in Before You Start and Deployment Mode.
- Run the kubectl apply -f ajob-2pod-16npu.yaml command to create a slow node diagnosis task and write the task to the ConfigMap.

- Check the content of ajob-2pod-16npu.yaml. For details about the command output, see Table 1.

The following is a YAML example, which is for reference only and cannot be directly copied for compilation and running.
--- apiVersion: v1 kind: ConfigMap metadata: name: ras-feature-slownode-default-test-pytorch-2pod-16npu # The value of JobName must be the same as the name attribute of the following job. The prefix ras-feature-slownode- cannot be modified. namespace: mindx-dl labels: fd-ol-slow-node: "true" data: FeatConf: | {"jobName":"default-test-pytorch-2pod-16npu","jobNamespace":"default","normalNumber":20,"nSigma":3,"degradationPercentage":0.3,"nConsecAnomaliesSignifySlow":3,"nSecondsDoOneDetection":30,"clusterMeanDistance":1.3,"cardOneNode":16,"SlowNode":1} ---Table 1 Filed description of the YAML file Field
Default Value
Description
jobNamespace
default
Job namespace
jobName
-
Job name
normalNumber
20
Initial computing threshold
nSigma
3
Number of sigmas for calculating upper and lower thresholds
degradationPercentage
0.3
Deterioration rate. A value of 0.3 represents a 30% performance drop.
nConsecAnomaliesSignifySlow
3
Number of exceptions. Detection is triggered only when exceptions occur for multiple consecutive times.
nSecondsDoOneDetection
30s
Interval for detection, in seconds.
clusterMeanDistance
1.3
Threshold distance (mean1 and mean2) between two clusters after clustering
cardOneNode
16
Number of cards on a node
slowNode
1
Whether to enable the task.
- 1: enabled
- 0: disabled
Querying Slow Node Diagnosis Results
After creating a slow node diagnosis task, you can query the logs of ClusterD and NodeD to view the task details.
Method 1: Querying Slow Node Diagnosis Logs in a Cluster Using Kubernetes Logs
- Run the kubectl get pods -n mindx-dl command to query the ClusterD and NodeD data.

- Then, run the kubectl logs -n mindx-dl clusterd-7d5db546d8-kdslz | grep "got degradation, slow rank" command to query the log data.
- Check the log information. If information similar to the following is displayed, the node deteriorates.

Method 2: Querying Slow Node Diagnosis Logs in a Cluster Using Flushed Logs
- Run the cat /var/log/mindx-dl.clusterd.clusterd.log | grep "got degradation, slow rank" command to query the log data.
- Check the log information. If information similar to the following is displayed, the node deteriorates.

Method 3: Querying Slow Node Diagnosis Logs on a Node
Run the kubectl logs -n mindx-dl node-9ld8k | grep "is degradation" command to query the log data. If the information similar to the following is displayed, the node deteriorates.

Known Slow Node Faults
Fault Code |
Fault Description |
Fault Level |
|---|---|---|
110001010 |
Slow node fault (single-event reporting) |
SubHealthFault |
100001011 |
Deterioration rectified |
NotHandleFault |