MindCluster Adaptation

  1. Obtain the MindCluster code.
    mkdir -p /data/atlas_dls/public/code
    cd /data/atlas_dls/public/code
    git clone https://gitcode.com/Ascend/mind-cluster.git
    cd ./mind-cluster/component/clusterd
    git checkout v7.3.0   # v7.3.0 is the tag of the code repository version. Switch to the target version.
  2. Modify the ClusterD code.
    1. Open the pkg/application/faultmanager/jobprocess/faultrank/job_fault_rank_processor.go file.
      vi pkg/application/faultmanager/jobprocess/faultrank/job_fault_rank_processor.go
    2. Press i to enter the insert mode and add the following code in bold:
      package faultrank
      
      import (
      ...
          "clusterd/pkg/domain/faultdomain/collector"
      ...
      )
      ...
      func (processor *jobRankFaultInfoProcessor) findFaultRankForJob(
      ...
              if deviceDetail, ok := processor.retryInBusinessPlane(podInfo.jobId, nodeName, deviceName); ok {
      			faultRankList = append(faultRankList, constant.FaultRank{RankId: deviceInfo.RankID, PodUid: podUid,
      				PodRank: podRankStr, FaultCode: faultdomain.GetRetryCodeByFaultType(deviceDetail.FaultType),
      				FaultLevel:  constant.RestartBusiness,
      				DoStepRetry: processor.canDoStepRetry(podInfo.jobId, nodeName, deviceName),
      				DeviceId:    deviceInfo.DeviceID,
      			})
      			Set the collector.ReportInfoCollector.ReportRetryInfo(podInfo.jobId, deviceInfo.RankID, constant.JobNotRecover, constant.UceFaultType) // Set the service-plane fault time to an invalid time to prevent a single fault from repeatedly triggering process-level online recovery.
      		}
      ...
    3. Press Esc, type :wq!, and press Enter to save the changes and exit.
  3. Compile the ClusterD.
    cd ./build/
    chmod +x build.sh && dos2unix build.sh
    sed -i 's|build_version="v[^"]\+"|build_version="xxx"|g' build.sh  # Replace xxx with the version number, for example, v7.3.0.
    sed -i 's|export CGO_ENABLED=0|export CGO_ENABLED=1|g' build.sh  # Enable the CGO function.
    ./build.sh # Compile the ClusterD. Go 1.21 or later is required. Go 1.21 is recommended.
    After the compilation is successful, the related files are generated in the../output/ directory. You can run the following command to view the files:
    ll ../output/
    Command output:
    -r-x------. 1 root root 45891128 Aug 13 10:52 clusterd
    -r--------. 1 root root     4021 Aug 13 10:52 clusterd-v7.3.0.yaml
    -r--------. 1 root root      946 Aug 13 10:52 Dockerfile
    -r--------. 1 root root      209 Aug 13 10:52 faultDuration.json
    -r--------. 1 root root      207 Aug 13 10:52 fdConfig.yaml
    -r--------. 1 root root      467 Aug 13 10:52 publicFaultConfiguration.json
    -r--------. 1 root root      756 Aug 13 10:52 relationFaultCustomization.json
  4. Go to the output directory and create the ClusterD image.
    cd ../output/
    docker build --no-cache -t clusterd:{tag} ./  {tag} must be the same as the value of build_version="xxx" in 3.
  5. (Optional) Save the image and upload the saved image file and clusterd-{tag}.yaml file to the master node. If steps 1 to 4 are performed on the master node, skip this step.
    docker save -o clusterd.tar clusterd:{tag}  # Save the image.
    docker load -i clusterd.tar  # Import the image on the master node.
  6. Restart the ClusterD on the master node.
    kubectl delete -f clusterd-{tag}.yaml  # Delete the old ClusterD container.
    kubectl apply -f clusterd-{tag}.yaml  # Start the new container.