SuperPoD Bandwidth Test Failed In NPU Disconnection Scenarios

Symptom

The following error information is reported when the card-level SuperPoD bandwidth test is performed:

ascend-dmi -bw -t p2p --sp 1 --ip xx.xx.xx.xx --spp /home/superpod/share --hip xx.xx.xx.xx --mode card -q -d 5
ascend-dmi -bw -t p2p --sp 0 --ip xx.xx.xx.xx --spp /home/superpod/share --hip xx.xx.xx.xx --mode card -q -d 5

The figure below shows the log error information.

Run the npu-smi info command to query the environment status. As shown in the following figure, NPU 0 in the environment is disconnected.

Possible Causes

The code uses the logic ID (device ID) to test the card-level bandwidth. When an NPU is disconnected, the device IDs at both ends do not match and the peer file cannot be found.

Solution

Run the npu-smi info -m command to query the chip ID and specify the NPU ID with the same chip logic ID to test the card-level bandwidth.

Example:

ascend-dmi -bw -t p2p --sp 1 --ip xx.xx.xx.xx --spp /home/superpod/share --hip xx.xx.xx.xx --mode card -q -d 6
ascend-dmi -bw -t p2p --sp 0 --ip xx.xx.xx.xx --spp /home/superpod/share --hip xx.xx.xx.xx --mode card -q -d 5