Communication Profile Data Parsing

The msprof communication profile data parsing function is mainly used to collect statistics on communication-related information, such as the segment-based time consumption, copy information, and bandwidth, for communication data analysis. Communication data exists only in multi-device, multi-node, or cluster scenarios.

Prerequisites

  • You have performed operations in Before You Start.
  • You have run the msprof command to export (disable clear) the PROF_XXX directory .

Procedure (msprof commands)

Run the analysis command.

Example:

msprof --analyze=on [--type=<type>] [--rule=communication] --output=<dir> [--clear=on]
Table 1 Command-line options

Option

Description

Required/Optional

--analyze

Profile data file to be analyzed, either on or off (default).

Required

--type

Format of the profile data parsing result file. That is, you can choose the format of the result file generated after the profile data collected by the msprof command is automatically parsed. The available formats include:

  • text: parsed into a JSON file.
  • db: parsed into a communication_analyzer.db file.

The default value is text.

Optional

--rule

Analysis rule. Possible values are as follows:

  • communication: analyzes communication data.
    • If --type is set to text, the communication.json file is generated in the PROF_XXX/analyze directory. The file displays detailed information such as the communication duration and bandwidth of all communication operators on a single device. See Figure 1.
    • If --type is set to db, the communication_analyzer.db file is generated in the PROF_XXX/analyze directory to save the CommAnalyzerTime (communication duration) and CommAnalyzerBandwidth (communication bandwidth) information tables.
  • communication_matrix: analyzes communication matrix data.
    • If --type is set to text, the communication_matrix.json file is generated in the PROF_XXX/analyze directory. The file displays basic information about communication operators, including the communication size, bandwidth, and rank information, which is used to analyze communication details. See Figure 2.
    • If --type is set to db, the communication_analyzer.db file is generated in PROF_XXX/analyze to store the CommAnalyzerMatrix (communication matrix) information table.

The preceding two values can be both set. Use a comma (,) to separate the values, for example, :--rule=communication,communication_matrix.

By default, they are both set.

Optional

--output

Directory for storing the profile data file. The value must be PROF_XXX, for example, /home/HwHiAiUser/profiler_data/PROF_XXX.

The following special characters are not allowed in the path: "\n", "\\n", "\f", "\\f", "\r", "\\r", "\b", "\\b", "\t", "\\t", "\v", "\\v", "\u007F", "\\u007F", "\"", "\\\"", "'", "\'", "\\", "\\\\", "%", "\\%", ">", "\\>", "<", "\\<", "|", "\\|", "&", "\\&", "$", "\\$", ";", "\\;", "`", "\\`".

Required

--clear

Data simplification mode. After this option is enabled, the sqlite directory in PROF_XXX is deleted after profile data is exported, so as to save storage space. The value can be on or off (default).

Optional

Procedure (msprof.py script)

Run the analysis command.

Example:

python3 msprof.py analyze [--type <type>] --rule communication -dir <dir> [--clear]
Table 2 Command-line options

Option

Description

Required/Optional

analyze

Analyze the profile data file.

Required

--type

Format of the profile data parsing result file. That is, you can choose the format of the result file generated after the profile data collected by the msprof.py script is automatically parsed. The available formats include:

  • text: parsed into a JSON file.
  • db: parsed into a communication_analyzer.db file.

The default value is text.

Optional

-r or --rule

Analysis rule. Possible values are as follows:

  • communication: analyzes communication data.
    • If --type is set to text, the communication.json file is generated in the PROF_XXX/analyze directory. The file displays detailed information such as the communication duration and bandwidth of all communication operators on a single device. See Figure 1.
    • If --type is set to db, the communication_analyzer.db file is generated in the PROF_XXX/analyze directory to save the CommAnalyzerTime (communication duration) and CommAnalyzerBandwidth (communication bandwidth) information tables.
  • communication_matrix: analyzes communication matrix data.
    • If --type is set to text, the communication_matrix.json file is generated in the PROF_XXX/analyze directory. The file displays basic information about communication operators, including the communication size, bandwidth, and rank information, which is used to analyze communication details. See Figure 2.
    • If --type is set to db, the communication_analyzer.db file is generated in PROF_XXX/analyze to store the CommAnalyzerMatrix (communication matrix) information table.

You can set either or both of these two parameters. If you set both of them, use a comma (,) to separate them, for example, --rule communication,communication_matrix.

Required

-dir, or --collection-dir

Directory for storing the profile data file. The value must be PROF_XXX, for example, /home/HwHiAiUser/profiler_data/PROF_XXX.

The following special characters are not allowed in the path: "\n", "\\n", "\f", "\\f", "\r", "\\r", "\b", "\\b", "\t", "\\t", "\v", "\\v", "\u007F", "\\u007F", "\"", "\\\"", "'", "\'", "\\", "\\\\", "%", "\\%", ">", "\\>", "<", "\\<", "|", "\\|", "&", "\\&", "$", "\\$", ";", "\\;", "`", "\\`".

Required

--clear

Data simplification mode. After this option is enabled, the sqlite directory in PROF_XXX is deleted after profile data is exported, so as to save storage space. When this parameter is configured, the data clearance mode is enabled. This parameter is not configured by default.

Optional

Parsing Result

  • --type=text, --rule=communication
    Figure 1 communication.json
  • --type=text, --rule=communication_matrix
    Figure 2 communication_matrix.json
  • --type=db, --rule=communication
    Figure 3 CommAnalyzerTime
    Table 3 CommAnalyzerTime

    Field

    Description

    hccl_op_name

    Name of an HCCL communication operator.

    group_name

    Group of communication operators.

    start_timestamp

    Communication start timestamp.

    elapse_time

    Total operator communication duration, in milliseconds.

    transit_time

    Communication duration, in milliseconds. If the communication duration is too long, a link may be faulty.

    wait_time

    Waiting duration, in milliseconds. Before establishing communication between nodes, ensure that the synchronization between the two nodes is complete.

    synchronization_time

    Synchronization duration, in milliseconds. It is the duration required for synchronization between nodes.

    idle_time

    Duration for communication operator delivery, in milliseconds. Duration for communication operator delivery (idle_time) = Total operator communication duration (elapse_time) – Communication duration (transit_time) – Wait duration (wait_time)

    Figure 4 CommAnalyzerBandwidth
    Table 4 CommAnalyzerBandwidth

    Field

    Description

    hccl_op_name

    Name of an HCCL communication operator.

    group_name

    Group of communication operators.

    transport_type

    Communication transmission type, including LOCAL, SDMA, RDMA, PCIE, and HCCS.

    transit_size

    Communication data volume, in MB.

    transit_time

    Communication duration, in milliseconds. If the communication duration is too long, a link may be faulty.

    bandwidth

    Communication bandwidth, in GB/s.

    large_packet_ratio

    Ratio of large communication data packets.

    package_size

    Size of a communication data packet transmitted at a time, in MB.

    count

    Number of communication transmission times.

    total_duration

    Total duration of data transmission, in milliseconds.

  • --type=db, --rule=communication_matrix
    Figure 5 CommAnalyzerMatrix
    Table 5 CommAnalyzerMatrix

    Field

    Description

    hccl_op_name

    Name of an HCCL communication operator.

    group_name

    Group of communication operators.

    src_rank

    Rank of the communication source.

    dst_rank

    Rank of the communication destination.

    transport_type

    Communication transmission type, including LOCAL, SDMA, RDMA, PCIE, HCCS .

    transit_size

    Communication data volume, in MB.

    transit_time

    Communication duration, in milliseconds. If the communication duration is too long, a link may be faulty.

    bandwidth

    Communication bandwidth, in GB/s.