Building the Rec SDK Torch Training Image

Building a Training Image Based on Debian 12

Create an image by referring to the Dockerfile and Readme in Basic Image Creation.

Container Startup

#!/bin/bash
container_name=$1
image_name=$2
docker run \
-it \
--name ${container_name} \
--shm-size="300g" \
-m 300g \
-v /etc/localtime:/etc/localtime:ro \
-e ASCEND_VISIBLE_DEVICES=0-7 \
-v /etc/ascend_install.info:/etc/ascend_install.info:ro \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
${image_name} \
/bin/bash

Some parameters are described as follows:

  • -m 300g indicates that the maximum memory size that can be used in the container is 300 GB. You can set this parameter based on the site requirements.
  • -e ASCEND_VISIBLE_DEVICES=0-7 indicates that the NPU devices numbered from device0 to device7 on the server are mounted to the container. Set this parameter based on the site requirements.

Installing the Rec SDK Torch

  1. Obtain the Rec SDK Torch software package by referring to Rec SDK Torch Software Package Download.
  2. Copy the software package to the container. Alternative monitoring methods include:
    • When starting a container, specify a host directory to be mounted to the container and place the downloaded software package in the directory so that the container can access the software package.
      Example of the docker parameter for mounting a host directory to a container:
      -v /dir1:/dir1
    • On the host machine, run the docker cp command to copy the software package to the container.
      Example of the docker cp command:
      docker cp host_file_path container_name:container_file_path

      In the preceding command, host_file_path indicates the file path on the host machine, container_name indicates the name of the Docker container to which the file is to be copied, and container_file_path indicates the file path in the Docker container to which the file is to be copied.

  3. Perform the following steps to compile and install the package:
    1. Install the TorchRecAscend registration package.

      The TorchRec Ascend registration package is an NPU device adaptation package based on the TorchRec source code. You can compile the registration package using the patch file provided by Rec SDK Torch and the fixed branch of the TorchRec source code.

      For details about source code compilation and installation, see README. (The source code compilation of this package does not distinguish PyTorch versions.)

    2. Install the Ascend-mindxsdk-hybrid-torchrec-*-linux-*.tar.gz package.
      # Install Ascend-mindxsdk-hybrid-torchrec-*-linux-*.tar.gz.
      tar zxvf Ascend-mindxsdk-hybrid-torchrec-*-linux-*.tar.gz
      pip3 install hybrid_torchrec-*-py3-none-linux_*.whl
      pip3 install torchrec_embcache-*-py3-none-linux_*.whl
    3. Install the operator package.

      Download the repository source code at https://gitcode.com/Ascend/RecSDK, go to the source code directory, and run the following commands to compile and install the operator package (the PyTorch version is not distinguished during operator package compilation):

      # Note: Before compiling an operator, enable the CANN environment variables. When the CANN package is installed in the default path, run the following command to enable the CANN environment variables:
      source /usr/local/Ascend/cann/set_env.sh
      unset ASCEND_CUSTOM_OPP_PATH
       
      cd cust_op/ascendc_op/build
      # Note: The compilation of some operators depends on external components. Download the dependencies by referring to the README file in the build folder. Otherwise, the compilation fails.
      # Compile the operator package. (The package will be automatically installed during compilation. If only some of the packages need to be installed, you can compile them in other containers and then copy them to the current environment for installation.)
      bash build_ai_core_op.sh v220
       
      # Optional: Install a specified operator package.
      # Method 1: Install operator packages in batches. The following command is used to install all operator packages that are not ended with 310p.run or A3.run. You can modify the matching keyword based on the device environment.
      find . -name "*.run" ! -name "*310p.run" ! -name "*A3.run" -exec bash {} \;
      # Method 2: Select the required operator package for installation. The following is an example:
      bash mxrec_opp_split_embedding_codegen_forward_unweighted.run
       
      # Install the operator adaptation layer libfbgemm_npu_api.so.
      cd ../../../../framework/torch_plugin/torch_library/common/
      bash build_ops.sh
      Table 1 describes the parameters for installing the operator.
      Table 1 Parameters

      Option

      Description

      --help | -h

      Queries help information.

      --info

      Queries the information about the installation package.

      --list

      Queries the file list of the installation package.

      --check

      Checks the integrity of the compressed package.

      --quiet

      Performs a silent installation.

      --nox11

      The xterm terminal is not started.

      --noexec

      The embedded installation script is not executed.

      --extract=<path>

      Decompresses the package to the target directory. This option is usually used together with --noexec to decompress the package without running the script.

      --tar arg1 [arg2 ...]

      Runs the tar command to access the package content.

      --install-path

      Installs the package to a specified directory.

    After the operator is installed, folders such as split_embedding_codegen_forward_unweighted, backward_codegen_adagrad_unweighted_exact, asynchronous_complete_cumsum and permute2d_sparse_data are generated in the /usr/local/Ascend/cann/opp/vendors/ directory. If no related folder exists, run the unset ASCEND_CUSTOM_OPP_PATH command to cancel the environment variable and reinstall the operator.