Configuring Graceful Fault Tolerance

This function has been deprecated. It will not be supported in PyTorch versions beyond 7.2.RC1 and MindSpore versions beyond 7.1.RC1.

This section describes how to configure graceful fault tolerance. For details about its features, restrictions, supported products, and working principles, see (Optional) Graceful Fault Tolerance.

Building an Image

Use the Dockerfile to create a container image and add the startup command.

# MindCluster resumable training adaptation script. MINDIO_TTP_PKG is the path of the MindIO whl installation package. Set it as required.
RUN pip3 install $MINDIO_TTP_PKG 

Adapting the Training Script

Add the following field to the shell script, for example, train_start.sh, to start training.
...
export MS_ENABLE_TFT='{RSC:1}'      # Enable graceful fault tolerance in MindSpore scenarios.
...

Configuring the Startup YAML File

Modify the startup YAML file of Ascend Device Plugin, set -hotReset to 1 to enable hot reset, and enable the graceful fault tolerance mode. Note: Graceful fault tolerance, process-level rescheduling, and process-level online recovery cannot be enabled at the same time.

...
      containers:
      - image: ascend-k8sdeviceplugin:v{version}
        name: device-plugin-01
        resources:
          requests:
            memory: 500Mi
            cpu: 500m
          limits:
            memory: 500Mi
            cpu: 500m
        command: [ "/bin/bash", "-c", "--"]
        args: [ "device-plugin  
                 -useAscendDocker=true 
                -volcanoType=true                     # Volcano must be used in the rescheduling scenario.
                 -autoStowing=true                    # Whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, after the processor health status changes from unhealthy to healthy, or the network fault on the processor parameter plane is recovered, the processor will not be automatically added to the schedulable resource pool. This parameter applies only to Atlas training products.
                 -listWatchPeriod=5                   # Set the health status check period. The value range is [3, 1800], in seconds.
                 -hotReset=1      # Enable hot reset on the basis of job-level or pod-level rescheduling when resumable training is used, to enable graceful fault tolerance.
                 -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log 
                 -logLevel=0" ]
        securityContext:
          privileged: true
          readOnlyRootFilesystem: true
...