Parsing Communication Profile Data

The msprof communication profile data parsing function is mainly used to collect statistics on communication-related information, such as the segment-based time consumption, copy information, and bandwidth, for communication data analysis. Communication data exists only in multi-rank, multi-node, or cluster scenarios.

Prerequisites

  • You have performed operations in Before You Start.
  • You have run the msprof command to export (disable clear) the PROF_XXX directory .

Procedure (msprof commands)

Run the analysis command.

Example:

msprof --analyze=on [--type=<type>] [--rule=communication] --output=<dir> [--clear=on]
Table 1 Options

Option

Description

Required/Optional

--analyze

Profile data file to be analyzed, either on or off (default).

Required

--type

Format of the profile data parsing result file. That is, you can choose the format of the result file generated after the profile data collected by the msprof command is automatically parsed. The available formats include:

  • text: parsed into a .json file and a communication_analyzer.db file.
  • db: parsed into a communication_analyzer.db file.

The default value is text.

Optional

--rule

Analysis rule. Possible values are as follows:

  • communication: analyzes communication data.
    • If --type is set to text, the communication.json file is generated in the PROF_XXX/analyze directory to display details about the communication duration and bandwidth of all communication operators on a single rank (see Figure 4), and the communication_analyzer.db file is also generated.
    • If --type is set to db, only the communication_analyzer.db file is generated in the PROF_XXX/analyze directory to save the CommAnalyzerTime (communication duration) and CommAnalyzerBandwidth (communication bandwidth) information tables.
  • communication_matrix: analyzes communication matrix data.
    • If --type is set to text, the communication_matrix.json file is generated in the PROF_XXX/analyze directory to display basic information about communication operators, including the communication size, bandwidth, and rank information used to analyze communication details (see Figure 5). The communication_analyzer.db file is also generated.
    • If --type is set to db, only the communication_analyzer.db file is generated in PROF_XXX/analyze to store the CommAnalyzerMatrix (communication matrix) information table.

The preceding two values can be both set. Use a comma (,) to separate the values, for example, :--rule=communication,communication_matrix.

By default, they are both set.

Optional

--output

Directory for storing the profile data file. The value must be PROF_XXX, for example, /home/HwHiAiUser/profiler_data/PROF_XXX.

The following special characters are not allowed in the path: "\n", "\\n", "\f", "\\f", "\r", "\\r", "\b", "\\b", "\t", "\\t", "\v", "\\v", "\u007F", "\\u007F", "\"", "\\\"", "'", "\'", "\\", "\\\\", "%", "\\%", ">", "\\>", "<", "\\<", "|", "\\|", "&", "\\&", "$", "\\$", ";", "\\;", "`", "\\`".

Required

--clear

Data simplification mode. After this option is enabled, the sqlite directory in PROF_XXX is deleted after profile data is exported, so as to save storage space. The value can be on or off (default).

Optional

Procedure (msprof.py script)

  1. Log in as the running user to the development environment where the CANN Toolkit package and ops operator package are located.
  2. Switch to the directory where the msprof.py script is located.

    ${INSTALL_DIR}/tools/profiler/profiler_tool/analysis/msprof. Replace ${INSTALL_DIR} with the actual CANN component directory. If the Ascend-CANN-Toolkit package is installed as the root user, the CANN component directory is /usr/local/Ascend/ascend-toolkit/latest.

  3. Run the analysis command.
    Example:
    python3 msprof.py analyze [--type <type>] --rule communication -dir <dir> [--clear]
Table 2 Options

Option

Description

Required/Optional

analyze

Analyze the profile data file.

Required

--type

Format of the profile data parsing result file. That is, you can choose the format of the result file generated after the profile data collected by the msprof.py script is automatically parsed. The available formats include:

  • text: parsed into a .json file and a communication_analyzer.db file.
  • db: parsed into a communication_analyzer.db file.

The default value is text.

Optional

-r or --rule

Analysis rule. Possible values are as follows:

  • communication: analyzes communication data.
    • If --type is set to text, the communication.json file is generated in the PROF_XXX/analyze directory to display details about the communication duration and bandwidth of all communication operators on a single rank (see Figure 4), and the communication_analyzer.db file is also generated.
    • If --type is set to db, only the communication_analyzer.db file is generated in the PROF_XXX/analyze directory to save the CommAnalyzerTime (communication duration) and CommAnalyzerBandwidth (communication bandwidth) information tables.
  • communication_matrix: analyzes communication matrix data.
    • If --type is set to text, the communication_matrix.json file is generated in the PROF_XXX/analyze directory to display basic information about communication operators, including the communication size, bandwidth, and rank information used to analyze communication details (see Figure 5). The communication_analyzer.db file is also generated.
    • If --type is set to db, only the communication_analyzer.db file is generated in PROF_XXX/analyze to store the CommAnalyzerMatrix (communication matrix) information table.

You can set either or both of these two parameters. If you set both of them, use a comma (,) to separate them, for example, --rule communication,communication_matrix.

Required

-dir, or --collection-dir

Directory for storing the profile data file. The value must be PROF_XXX, for example, /home/HwHiAiUser/profiler_data/PROF_XXX.

The following special characters are not allowed in the path: "\n", "\\n", "\f", "\\f", "\r", "\\r", "\b", "\\b", "\t", "\\t", "\v", "\\v", "\u007F", "\\u007F", "\"", "\\\"", "'", "\'", "\\", "\\\\", "%", "\\%", ">", "\\>", "<", "\\<", "|", "\\|", "&", "\\&", "$", "\\$", ";", "\\;", "`", "\\`".

Required

--clear

Data simplification mode. After this option is enabled, the sqlite directory in PROF_XXX is deleted after profile data is exported, so as to save storage space. When this parameter is configured, the data clearance mode is enabled. This parameter is not configured by default.

Optional

Parsing Result

  • --type=text or --type=db, and --rule=communication
    Figure 1 CommAnalyzerTime
    Table 3 CommAnalyzerTime

    Field

    Description

    hccl_op_name

    Communication operator name.

    group_name

    Group of communication operators.

    start_timestamp

    Communication start timestamp.

    elapse_time

    Total operator communication duration, in milliseconds.

    transit_time

    Communication duration, in milliseconds. If the communication duration is too long, a link may be faulty.

    wait_time

    Waiting duration, in milliseconds. Before establishing communication between nodes, ensure that the synchronization between the two nodes is complete.

    synchronization_time

    Synchronization duration, in milliseconds. It is the duration required for synchronization between nodes.

    idle_time

    Idle time, in milliseconds. Idle time (idle_time) = Total operator communication duration (elapse_time) – Communication duration (transit_time) – Waiting duration (wait_time)

    Figure 2 CommAnalyzerBandwidth
    Table 4 CommAnalyzerBandwidth

    Field

    Description

    hccl_op_name

    Communication operator name.

    group_name

    Group of communication operators.

    transport_type

    Communication transmission type, including LOCAL, SDMA, RDMA, PCIE, SIO, and HCCS.

    transit_size

    Communication data volume, in MB.

    transit_time

    Communication duration, in milliseconds. If the communication duration is too long, a link may be faulty.

    bandwidth

    Communication bandwidth, in GB/s.

    large_packet_ratio

    Ratio of large communication data packets.

    package_size

    Size of a communication data packet transmitted at a time, in MB.

    count

    Number of communication transmission times.

    total_duration

    Total duration of data transmission, in milliseconds.

  • --type=text or --type=db, and --rule=communication_matrix
    Figure 3 CommAnalyzerMatrix
    Table 5 CommAnalyzerMatrix

    Field

    Description

    hccl_op_name

    Communication operator name.

    group_name

    Group of communication operators.

    src_rank

    Rank of the communication source.

    dst_rank

    Rank of the communication destination.

    transport_type

    Communication transmission type, including LOCAL, SDMA, RDMA, PCIE, SIO, and HCCS.

    transit_size

    Communication data volume, in MB.

    transit_time

    Communication duration, in milliseconds. If the communication duration is too long, a link may be faulty.

    bandwidth

    Communication bandwidth, in GB/s.

  • --type=text, --rule=communication
    Figure 4 communication.json
  • --type=text, --rule=communication_matrix
    Figure 5 communication_matrix.json