Slow Node Diagnosis

Description

If the node training performance in an AI cluster deteriorates, this feature can be used to detect slow nodes caused by compute or network problems in real time, allowing you to isolate these nodes via switchover or other mitigation methods.

Currently, only integration with ClusterD and NodeD for online deployment is supported. For details about how to deploy ClusterD and NodeD, see Installation and Deployment.

  • Slow node algorithm: Identifies real-time performance degradation via key training metrics and demarcates slow compute and communication issues by analyzing the synchronization between their respective operators.
  • Slow node cleaning: Converts and cleans incremental data on a node and generates a cleaning result CSV file.
  • Slow node scheduling: Schedules the overall process of slow nodes and controls data cleaning and slow node algorithm.

Usage Example

Procedures for starting a diagnosis task on slow nodes:

  1. Add a function call to obtain parallel domain information in the training iteration of the training script. The following uses the PyTorch-MindSpeed scenario as an example. You need to add the following fields in bold to the ./mindspeed_llm/training/training.py file.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    def train(forward_step_func, model, optimizer, opt_param_scheduler,
              train_data_iterator, valid_data_iterator,
              process_non_loss_data_func, config):
        ……
        if is_profile_enabled():
            prof = get_profiler()
            prof.start()
        m_iter = 0
        while iteration < args.train_iters:
            ……
            args.curr_iteration = iteration
            loss_dict, skipped_iter, grad_norm, num_zeros_in_grad = \
                train_step(forward_step_func,
                           train_data_iterator,
                           model,
                           optimizer,
                           opt_param_scheduler,
                           config)
            iteration += 1
            m_iter += 1
            if m_iter == 5:
                from taskd.python.adaptor.pytorch.group_info import dump_group_info
                dump_group_info()
            batch_size = mpu.get_data_parallel_world_size() * \
                         args.micro_batch_size * \
                         get_num_microbatches()
    
  2. Complete operations in Before You Start and Deployment Mode.
  3. Run the kubectl apply -f ajob-2pod-16npu.yaml command to create a slow node diagnosis task and write the task to the ConfigMap.

  4. Check the content of ajob-2pod-16npu.yaml. For details about the command output, see Table 1.

    The following is a YAML example, which is for reference only and cannot be directly copied for compilation and running.

    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: ras-feature-slownode-default-test-pytorch-2pod-16npu    # The value of JobName must be the same as the name attribute of the following job. The prefix ras-feature-slownode- cannot be modified.
      namespace: mindx-dl
      labels:
        fd-ol-slow-node: "true"
    data:
      FeatConf: |
        {"jobName":"default-test-pytorch-2pod-16npu","jobNamespace":"default","normalNumber":20,"nSigma":3,"degradationPercentage":0.3,"nConsecAnomaliesSignifySlow":3,"nSecondsDoOneDetection":30,"clusterMeanDistance":1.3,"cardOneNode":16,"SlowNode":1}
    ---
    Table 1 Filed description of the YAML file

    Field

    Default Value

    Description

    jobNamespace

    default

    Job namespace

    jobName

    -

    Job name

    normalNumber

    20

    Initial computing threshold

    nSigma

    3

    Number of sigmas for calculating upper and lower thresholds

    degradationPercentage

    0.3

    Deterioration rate. A value of 0.3 represents a 30% performance drop.

    nConsecAnomaliesSignifySlow

    3

    Number of exceptions. Detection is triggered only when exceptions occur for multiple consecutive times.

    nSecondsDoOneDetection

    30s

    Interval for detection, in seconds.

    clusterMeanDistance

    1.3

    Threshold distance (mean1 and mean2) between two clusters after clustering

    cardOneNode

    16

    Number of cards on a node

    slowNode

    1

    Whether to enable the task.

    • 1: enabled
    • 0: disabled

Querying Slow Node Diagnosis Results

After creating a slow node diagnosis task, you can query the logs of ClusterD and NodeD to view the task details.

Method 1: Querying Slow Node Diagnosis Logs in a Cluster Using Kubernetes Logs

  1. Run the kubectl get pods -n mindx-dl command to query the ClusterD and NodeD data.

  2. Then, run the kubectl logs -n mindx-dl clusterd-7d5db546d8-kdslz | grep "got degradation, slow rank" command to query the log data.
  3. Check the log information. If information similar to the following is displayed, the node deteriorates.

Method 2: Querying Slow Node Diagnosis Logs in a Cluster Using Flushed Logs

  1. Run the cat /var/log/mindx-dl.clusterd.clusterd.log | grep "got degradation, slow rank" command to query the log data.
  2. Check the log information. If information similar to the following is displayed, the node deteriorates.

Method 3: Querying Slow Node Diagnosis Logs on a Node

Run the kubectl logs -n mindx-dl node-9ld8k | grep "is degradation" command to query the log data. If the information similar to the following is displayed, the node deteriorates.

Known Slow Node Faults

Fault Code

Fault Description

Fault Level

110001010

Slow node fault (single-event reporting)

SubHealthFault

100001011

Deterioration rectified

NotHandleFault