Single-Server Single-Device and Single-Server Multi-Device Training

This section describes how to configure resource information through environment variables to start training tasks in single-server single-device and single-server multi-device scenarios.

Prerequisites

To start a training task using this method, the following environment variables must be configured. For details, refer to the little-demo configuration file or Environment Variable Configuration.

CM_CHIEF_IP={host_ip}
CM_CHIEF_PORT=60000
CM_CHIEF_DEVICE=0
CM_WORKER_IP={host_ip}
CM_WORKER_SIZE=8

Procedure

  1. Optional: If you need to customize the number of devices, modify local_rank_size to Custom number of devices and CM_WORKER_SIZE to local_rank_size × Number of training nodes.
    • The custom number of devices is less than or equal to the number of devices visible in the container (or process).
    • If the container is started in privileged mode, the number of visible devices in the container is equal to the total number of devices in the environment. If the container is started in non-privileged mode, the number of visible devices in the container is equal to the number of devices actually mounted when the container is started.
    • To further control the number of devices visible in the training process, you can use the environment variable ASCEND_RT_VISIBLE_DEVICES to specify the devices visible in the current container that are visible to the current process. Generally, this environment variable specifies devices that participate in training.
    • If the number of devices visible to the current process is 8, the custom number of devices can be 1, 2, 4, or 8.

    The following example uses eight devices as an example. The device logic ID list is [0,1,2,3,4,5,6,7].

    • If local_rank_size = 1, set CM_CHIEF_DEVICE to any ID from [0-7].
    • local_rank_size = 2:

      Contiguous allocation

      export CM_CHIEF_DEVICE=0

      Uses devices 0 and 1.

      export CM_CHIEF_DEVICE=4

      Uses devices 4 and 5.

      Non-contiguous allocation

      export ASCEND_RT_VISIBLE_DEVICES=1,3,5,7

      Set this if the container was created with contiguous devices but non-contiguous access is required.

      In this case, the visible logical IDs become [1, 3, 5, 7].

      export CM_CHIEF_DEVICE=0

      Uses devices 1 and 3.

      export CM_CHIEF_DEVICE=1

      Uses devices 3 and 5.

    • local_rank_size = 4:
      • export CM_CHIEF_DEVICE=0 (Uses devices 0, 1, 2, and 3.)
      • export CM_CHIEF_DEVICE=4 (Uses devices 4, 5, 6, and 7.)
    • local_rank_size = 8:

      export CM_CHIEF_DEVICE=0 (Uses devices 0, 1, 2, 3, 4, 5, 6, and 7.)

    Rec SDK TensorFlow defaults to 8-device training. To enable 16-device training, run the following:

    1
    2
    3
    4
    for pdev in `lspci -vvv|grep -E "^[a-f]|^[0-9]|ACSCtl"|grep ACSCtl -B1|grep -E "^[a-f]|^[0-9]"|awk '{print $1}'` 
    do
    setpci -s $pdev ECAP_ACS+06.w=0000 
    done
    
  2. Enter the host listening IP address of the master node after the startup command. The command format is as follows:
    bash run.sh main.py {host_ip}  
    • A training job can be started by setting resource information through environment variables only after a valid and available IP address is passed in the startup command.
    • If a resource configuration file exists and no IP address (or an invalid IP address) is provided, the task will start using the configuration file instead.
    • If no resource configuration file exists and no valid IP is provided, the system returns: "the rank table file does not exist." In this case, you can configure the resource configuration file or provide a valid IP address to restart the training task.

    If the script is successfully executed, the following information is displayed:

    ip: {host_ip} available.
    The ranktable solution is removed.
    CM_CHIEF_IP={host_ip}  
    CM_CHIEF_PORT=60000
    CM_CHIEF_DEVICE=0
    CM_WORKER_IP={host_ip}  
    CM_WORKER_SIZE=8
    ASCEND_VISIBLE_DEVICES=0-8
    py is main.py
    use horovod to start tasks
    ...

    After the execution is complete, the following log information is displayed:

    1
    2
    3
    ASC manager has been destroyed.
    MPI has been destroyed.
    Demo done!