Description of Command-Line Options

This section describes the command-line options of the HCCL Performance Tester.

Command

  • In the MPICH installation scenario:
    mpirun [-f <hostfile>] -n <number> ./bin/<executable_file> [-p <npus>] [-b <minbytes>] [-e <maxbytes>] [-f <incfactor>] [-o <operator>] [-r <root>] [-d <datatype>] [-z <0/1>] [-n <iters_count>] [-w <warmup_iters_count>] [-c <0/1>]
  • In the Open MPI installation scenario:
    mpirun [--prefix <mpi_install_path>] [-hostfile <hostfile>] -n <number> -x <env> [--allow-run-as-root] [--mca <key value>] ./bin/<executable_file> [-p <npus>] [-b <minbytes>] [-e <maxbytes>] [-f <incfactor>] [-o <operator>] [-r <root>] [-d <datatype>] [-z <0/1>] [-n <iters_count>] [-w <warmup_iters_count>] [-c <0/1>]

MPICH Command-Line Options

Only common MPICH options are listed below. For more options, see MPICH official documentation.

Table 1 MPICH options

Option

Optional/Required

Description

-f <hostfile>

Optional

List file of hostfile nodes.

Configure this file in multi-server scenarios. You can set this parameter to the absolute path of the hostfile file or the path relative to the directory where the current command is executed.

For details about the configuration example, see 4.

-n <number>

Required

Total number of NPUs to be started, that is, Number of nodes × Number of NPUs participating in training on each node.

Open MPI Command-Line Options

Only common Open MPI options are listed below. For more options, see open-mpi documentation.

Table 2 Open MPI options

Option

Optional/Required

Description

--prefix <mpi_install_path>

Optional

Installation path of Open MPI.

Generally, this parameter is not required in single-server scenarios. In multi-server scenarios, this parameter is required. Otherwise, the MPI library file may fail to be obtained.

-hostfile <hostfile>

Optional

List file of hostfile nodes.

Configure this file in multi-server scenarios. You can set this parameter to the absolute path of the hostfile file or the path relative to the directory where the current command is executed.

For details about the configuration example, see 4.

-n <number>

Required

Total number of NPUs to be started, that is, Number of nodes × Number of NPUs participating in training on each node.

-x <env>

Required

Name of the environment variable to be transferred to the remote node.

--allow-run-as-root

Optional

The mpirun command can be run by root users.

--mca <key value>

Optional

The Open MPI is centered on MPI Component Architecture (MCA). You can set mca at mpirun runtime to load various Open MPI components to implement certain features.

Common commands:

  • --mca btl_tcp_if_include <nic_name>

    Use a specified NIC for inter-node communication. For example:

    --mca btl_tcp_if_include eth0
  • --mca opal_set_max_sys_limits 1

    Set the system limit (such as the number of file descriptors) through ulimit to prevent Open MPI running from being affected. You are advised to configure ulimit to prevent resource insufficiency when there are a large number of NICs in a cluster.

HCCL Performance Tester Options

Table 3 HCCL Performance Tester options

Option

Optional/Required

Description

./bin/<executable_file>

Required

Command of the HCCL Performance Tester.

<executable_file> is the executable file of the HCCL Performance Tester, that is, supported test commands.

  • For the Atlas A3 training products / Atlas A3 inference products , the supported test commands are all_gather_test, all_gatherv_test, all_reduce_test, alltoall_test, alltoallv_test, alltoallvc_test, broadcast_test, reduce_scatter_test, reduce_scatterv_test, reduce_test, and scatter_test.
  • For the Atlas A2 training products / Atlas A2 inference products , the supported test commands are all_gather_test, all_gatherv_test, all_reduce_test, alltoall_test, alltoallv_test, alltoallvc_test, broadcast_test, reduce_scatter_test, reduce_scatterv_test, reduce_test, and scatter_test.
  • For the Atlas training products , the supported test commands are all_gather_test, all_reduce_test, alltoallv_test, alltoall_test, broadcast_test, reduce_scatter_test, reduce_test, and scatter_test.
  • For the Atlas inference products , the supported test commands are all_gather_test, all_gatherv_test, all_reduce_test, alltoall_test, alltoallv_test, reduce_scatter_test, and reduce_scatterv_test.

Options supported by the collective communication performance test

-p <npus>

or --npus <npus>

Optional

Number of NPUs participating in training on a single compute node.

The default value is the total number of NPUs on the current node. If the number of NPUs involved in training on a single compute node is less than the total number of NPUs on the current node, this option is required.

Note: The HCCL Performance Tester launches the corresponding devices based on the configured number of NPUs used in training. For details about the configuration restrictions on the parameter, see Restrictions.

-b <minbytes>

or --minbytes <minbytes>

Optional

Test data size used to perform the collective communication operation.
  • -b: start value of the test data size, that is, the minimum value. The default value is 64 MB. The unit is KB, MB, or GB.
  • -e: end value of the test data size, that is, the maximum value. The default value is 64 MB. The unit is KB, MB, or GB.
  • -i/-f: data increment type.
    • -i indicates the incremental step, in bytes. For example, if it is set to 100, the incremental step is 100 bytes. (Note that only digits are required after -i, without the unit bytes.)
    • -f indicates the multiplication factor.

    By default, the increment step mode set by -i is enabled. The default step size is calculated as follows: (End value of the test data size – Start value of the test data size)/10.

Notes:
  • If the value of -b is equal to that of -e, the test is conducted based on a fixed amount of data in each iteration.
  • If the value of -e is greater than that of -b, you need to set the data increment type, either -i or -f.
  • If the value of -i is 0, the test is conducted continuously based on the start value of the test data size (that is, the data size defined by -b).
  • When the HCCL Performance Tester is executed, the data size entered by the -b, -e, and -i options of some operators is slightly adjusted based on the address alignment or rank size multiple to achieve better performance.

Examples:

  • Configuration example: -b 100M -e 400M -i 0

    The test is continuously performed based on the start value of the test data size 100 MB.

  • Configuration example: -b 100M -e 400M -i 500

    The test is performed from the start value 100 MB and increases by 500 bytes each step until the test is complete.

  • Configuration example: -b 100M -e 400M -f 2

    The start value of the test data size is 100 MB, the end value is 400 MB, and the multiplication factor is 2. Each iteration uses 100 MB, 200 MB, and 400 MB data for test respectively.

-e <maxbytes>

or --maxbytes <maxbytes>

Optional

-i <incsize>

or --stepbytes <incsize>

Optional

-f <incfactor>

or --stepfactor <incfactor>

Optional

-o <operator>

or --op <operator>

Optional

Operation type of the Reduce command. The value can be sum, prod, max, or min. The default value is sum.

Reduce-related commands include all_reduce_test, reduce_scatter_test, reduce_scatterv_test, and reduce_test.

  • For the reduce_scatterv_test command:
    • For the Atlas A3 training products / Atlas A3 inference products , the supported operation types are sum, max, and min.
    • For the Atlas A2 training products / Atlas A2 inference products , the supported operation types are sum, max, and min.
    • For the Atlas inference products , only the sum operation type is supported.

-r <root>

or --root <root>

Optional

When the broadcast_test, reduce_test, or scatter_test command is executed, you can use this option to specify the device ID of the root node.

Value range: [0, Actual number of devices – 1]

Default value: 0

-d <datatype>

or --datatype <datatype>

Optional

Data type supported by the HCCL command. The default type is fp32.

  • For the all_reduce_test, reduce_scatter_test and reduce_test commands:
    • The Atlas A3 training products / Atlas A3 inference products support int8, int16, int32, int64, fp16, fp32 and bfp16. The prod operation does not support the int16 and bfp16 data types.
    • The Atlas A2 training products / Atlas A2 inference products support int8, int16, int32, int64, fp16, fp32, and bfp16. The prod operation does not support the int16 and bfp16 data types.
    • The Atlas training products supports int8, int32, int64, fp16, and fp32.
    • The Atlas inference products supports int8, int16, int32, fp16, and fp32. The prod, max, and min operations do not support the int16 data type.
  • The broadcast_test, all_gather_test, alltoallv_test, alltoallvc_test, alltoall_test, scatter_test, and all_gatherv_test commands support the following data types: int8, uint8, int16, uint16, int32, uint32, int64, uint64, fp16, fp32, fp64, and bfp16.

    Notes:

    The bfp16 data type supports only the following Products:

    Atlas A3 training products / Atlas A3 inference products

    Atlas A2 training products

  • For the reduce_scatterv_test command:
    • For the Atlas A3 training products / Atlas A3 inference products , the supported data types include int8, int16, int32, fp16, fp32, and bfp16.
    • For the Atlas A2 training products / Atlas A2 inference products , the supported data types include int8, int16, int32, fp16, fp32, and bfp16.
    • For the Atlas inference products , the supported data types include int16, fp16, and fp32.

-z <0/1>

or --zero_copy <0/1>

Optional

Whether to enable the zero-copy function.

In single-operator mode, the input and output buffers dynamically change, and HCCL uses intermediate buffers for data transfer to complete collective communication. However, extra memory copy overhead is introduced. The zero-copy function reduces the memory copy overhead and directly operates the memory transferred by the service to improve the performance.

Note: The zero-copy function is for trial use and may be changed in later versions. Therefore, it cannot be used in commercial products.

This option can be set to:
  • 0 (default): disabled.
  • 1: enabled.

The zero-copy function has the following restrictions:

  • It is supported only by the Atlas A3 training products / Atlas A3 inference products .
  • Only the reduce_scatter_test, all_gather_test, all_reduce_test, and broadcast_test commands can be executed.
  • Only the scenario where the communication algorithm orchestration is expanded in the AI CPU is supported.

    For details, see the environment variable HCCL_OP_EXPANSION_MODE.

Performance test options

-n <iters_count>

or --iters <iters_count>

Optional

Number of iterations. The default value is 20.

-w <warmup_iters_count>

or --warmup_iters <warmup_iters_count>

Optional

Number of warm-up iterations. This option affects only the execution duration of the HCCL Performance Tester, and is not counted in performance statistics. The default value is 10.

Note: Due to the possibility of operations in the first few iterations that may affect the performance test, such as socket establishment operations in the first iteration, you are advised to set the first few iterations as warm-up iterations and not include them in performance statistics.

Result check options

-c <0/1>

or --check <0/1>

Optional

Whether to enable the function of verifying the correctness of the HCCL operation results.
  • 0: The verification is disabled.
  • 1: The check is enabled.

Default value: 1

Note: In large-scale cluster scenarios, enabling result check will increase the execution duration of the HCCL Performance Tester.