Slow Node Diagnosis

Description

If the node training performance in an AI cluster deteriorates, this feature can be used to detect slow nodes caused by compute or network problems in real time, allowing you to isolate these nodes via switchover or other mitigation methods.

Currently, only integration with ClusterD and NodeD for online deployment is supported. For details about how to deploy ClusterD and NodeD, see Installation and Deployment.

Slow node algorithm: Identifies real-time performance degradation via key training metrics and demarcates slow compute and communication issues by analyzing the synchronization between their respective operators.
Slow node cleaning: Converts and cleans incremental data on a node and generates a cleaning result CSV file.
Slow node scheduling: Schedules the overall process of slow nodes and controls data cleaning and slow node algorithm.

Usage Example

Procedures for starting a diagnosis task on slow nodes:

Add a function call to obtain parallel domain information in the training iteration of the training script. The following uses the PyTorch-MindSpeed scenario as an example. You need to add the following fields in bold to the ./mindspeed_llm/training/training.py file.

def train(forward_step_func, model, optimizer, opt_param_scheduler,
          train_data_iterator, valid_data_iterator,
          process_non_loss_data_func, config):
    ……
    if is_profile_enabled():
        prof = get_profiler()
        prof.start()
    m_iter = 0
    while iteration < args.train_iters:
        ……
        args.curr_iteration = iteration
        loss_dict, skipped_iter, grad_norm, num_zeros_in_grad = \
            train_step(forward_step_func,
                       train_data_iterator,
                       model,
                       optimizer,
                       opt_param_scheduler,
                       config)
        iteration += 1
        m_iter += 1
        if m_iter == 5:
            from taskd.python.adaptor.pytorch.group_info import dump_group_info
            dump_group_info()
        batch_size = mpu.get_data_parallel_world_size() * \
                     args.micro_batch_size * \
                     get_num_microbatches()

Complete operations in Before You Start and Deployment Mode.
Run the kubectl apply -f ajob-2pod-16npu.yaml command to create a slow node diagnosis task and write the task to the ConfigMap.

Check the content of ajob-2pod-16npu.yaml. For details about the command output, see Table 1.

The following is a YAML example, which is for reference only and cannot be directly copied for compilation and running.

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ras-feature-slownode-default-test-pytorch-2pod-16npu    # The value of JobName must be the same as the name attribute of the following job. The prefix ras-feature-slownode- cannot be modified.
  namespace: mindx-dl
  labels:
    fd-ol-slow-node: "true"
data:
  FeatConf: |
    {"jobName":"default-test-pytorch-2pod-16npu","jobNamespace":"default","normalNumber":20,"nSigma":3,"degradationPercentage":0.3,"nConsecAnomaliesSignifySlow":3,"nSecondsDoOneDetection":30,"clusterMeanDistance":1.3,"cardOneNode":16,"SlowNode":1}
---

**Table 1** Filed description of the YAML file
Field	Default Value	Description
jobNamespace	default	Job namespace
jobName	-	Job name
normalNumber	20	Initial computing threshold
nSigma	3	Number of sigmas for calculating upper and lower thresholds
degradationPercentage	0.3	Deterioration rate. A value of 0.3 represents a 30% performance drop.
nConsecAnomaliesSignifySlow	3	Number of exceptions. Detection is triggered only when exceptions occur for multiple consecutive times.
nSecondsDoOneDetection	30s	Interval for detection, in seconds.
clusterMeanDistance	1.3	Threshold distance (mean1 and mean2) between two clusters after clustering
cardOneNode	16	Number of cards on a node
slowNode	1	Whether to enable the task. 1: enabled 0: disabled

Querying Slow Node Diagnosis Results

After creating a slow node diagnosis task, you can query the logs of ClusterD and NodeD to view the task details.

Method 1: Querying Slow Node Diagnosis Logs in a Cluster Using Kubernetes Logs

Run the kubectl get pods -n mindx-dl command to query the ClusterD and NodeD data.
Then, run the kubectl logs -n mindx-dl clusterd-7d5db546d8-kdslz | grep "got degradation, slow rank" command to query the log data.
Check the log information. If information similar to the following is displayed, the node deteriorates.

Method 2: Querying Slow Node Diagnosis Logs in a Cluster Using Flushed Logs

Run the cat /var/log/mindx-dl.clusterd.clusterd.log | grep "got degradation, slow rank" command to query the log data.
Check the log information. If information similar to the following is displayed, the node deteriorates.

Method 3: Querying Slow Node Diagnosis Logs on a Node

Run the kubectl logs -n mindx-dl node-9ld8k | grep "is degradation" command to query the log data. If the information similar to the following is displayed, the node deteriorates.

Known Slow Node Faults

Fault Code	Fault Description	Fault Level
110001010	Slow node fault (single-event reporting)	SubHealthFault
100001011	Deterioration rectified	NotHandleFault

Parent topic: Slow Node and Slow Network Faults