Communication Profile Data Parsing
The msprof communication profile data parsing function is mainly used to collect statistics on communication-related information, such as the segment-based time consumption, copy information, and bandwidth, for communication data analysis. Communication data exists only in multi-device, multi-node, or cluster scenarios.
Prerequisites
- You have performed operations in Before You Start.
- You have run the msprof command to export (disable clear) the PROF_XXX directory .
Procedure (msprof commands)
Run the analysis command.
Example:
msprof --analyze=on [--type=<type>] [--rule=communication] --output=<dir> [--clear=on]
|
Option |
Description |
Required/Optional |
|---|---|---|
|
--analyze |
Profile data file to be analyzed, either on or off (default). |
Required |
|
--type |
Format of the profile data parsing result file. That is, you can choose the format of the result file generated after the profile data collected by the msprof command is automatically parsed. The available formats include:
The default value is text. |
Optional |
|
--rule |
Analysis rule. Possible values are as follows:
The preceding two values can be both set. Use a comma (,) to separate the values, for example, :--rule=communication,communication_matrix. By default, they are both set. |
Optional |
|
--output |
Directory for storing the profile data file. The value must be PROF_XXX, for example, /home/HwHiAiUser/profiler_data/PROF_XXX. The following special characters are not allowed in the path: "\n", "\\n", "\f", "\\f", "\r", "\\r", "\b", "\\b", "\t", "\\t", "\v", "\\v", "\u007F", "\\u007F", "\"", "\\\"", "'", "\'", "\\", "\\\\", "%", "\\%", ">", "\\>", "<", "\\<", "|", "\\|", "&", "\\&", "$", "\\$", ";", "\\;", "`", "\\`". |
Required |
|
--clear |
Data simplification mode. After this option is enabled, the sqlite directory in PROF_XXX is deleted after profile data is exported, so as to save storage space. The value can be on or off (default). |
Optional |
Procedure (msprof.py script)
Run the analysis command.
Example:
python3 msprof.py analyze [--type <type>] --rule communication -dir <dir> [--clear]
|
Option |
Description |
Required/Optional |
|---|---|---|
|
analyze |
Analyze the profile data file. |
Required |
|
--type |
Format of the profile data parsing result file. That is, you can choose the format of the result file generated after the profile data collected by the msprof.py script is automatically parsed. The available formats include:
The default value is text. |
Optional |
|
-r or --rule |
Analysis rule. Possible values are as follows:
You can set either or both of these two parameters. If you set both of them, use a comma (,) to separate them, for example, --rule communication,communication_matrix. |
Required |
|
-dir, or --collection-dir |
Directory for storing the profile data file. The value must be PROF_XXX, for example, /home/HwHiAiUser/profiler_data/PROF_XXX. The following special characters are not allowed in the path: "\n", "\\n", "\f", "\\f", "\r", "\\r", "\b", "\\b", "\t", "\\t", "\v", "\\v", "\u007F", "\\u007F", "\"", "\\\"", "'", "\'", "\\", "\\\\", "%", "\\%", ">", "\\>", "<", "\\<", "|", "\\|", "&", "\\&", "$", "\\$", ";", "\\;", "`", "\\`". |
Required |
|
--clear |
Data simplification mode. After this option is enabled, the sqlite directory in PROF_XXX is deleted after profile data is exported, so as to save storage space. When this parameter is configured, the data clearance mode is enabled. This parameter is not configured by default. |
Optional |
Parsing Result
- --type=text, --rule=communication
- --type=text, --rule=communication_matrix
- --type=db, --rule=communication
Figure 3 CommAnalyzerTime
Table 3 CommAnalyzerTime Field
Description
hccl_op_name
Name of an HCCL communication operator.
group_name
Group of communication operators.
start_timestamp
Communication start timestamp.
elapse_time
Total operator communication duration, in milliseconds.
transit_time
Communication duration, in milliseconds. If the communication duration is too long, a link may be faulty.
wait_time
Waiting duration, in milliseconds. Before establishing communication between nodes, ensure that the synchronization between the two nodes is complete.
synchronization_time
Synchronization duration, in milliseconds. It is the duration required for synchronization between nodes.
idle_time
Duration for communication operator delivery, in milliseconds. Duration for communication operator delivery (idle_time) = Total operator communication duration (elapse_time) – Communication duration (transit_time) – Wait duration (wait_time)
Figure 4 CommAnalyzerBandwidth
Table 4 CommAnalyzerBandwidth Field
Description
hccl_op_name
Name of an HCCL communication operator.
group_name
Group of communication operators.
transport_type
Communication transmission type, including LOCAL, SDMA, RDMA, PCIE, and HCCS.
transit_size
Communication data volume, in MB.
transit_time
Communication duration, in milliseconds. If the communication duration is too long, a link may be faulty.
bandwidth
Communication bandwidth, in GB/s.
large_packet_ratio
Ratio of large communication data packets.
package_size
Size of a communication data packet transmitted at a time, in MB.
count
Number of communication transmission times.
total_duration
Total duration of data transmission, in milliseconds.
- --type=db, --rule=communication_matrix
Figure 5 CommAnalyzerMatrix
Table 5 CommAnalyzerMatrix Field
Description
hccl_op_name
Name of an HCCL communication operator.
group_name
Group of communication operators.
src_rank
Rank of the communication source.
dst_rank
Rank of the communication destination.
transport_type
Communication transmission type, including LOCAL, SDMA, RDMA, PCIE, HCCS .
transit_size
Communication data volume, in MB.
transit_time
Communication duration, in milliseconds. If the communication duration is too long, a link may be faulty.
bandwidth
Communication bandwidth, in GB/s.

