Eye Diagram Test

Ascend DMI supports the eye diagram test on the network to query the current signal quality.

This function can query the specific data of signal quality. To check the signal quality of the current port, diagnose signal quality. For details, see Eye Diagram Diagnosis.

Function

Query the signal quality of the PCIe, HCCS, and RoCE communication ports on the NPU.

Parameters

You can run either of the following commands to view the available parameters of the signal quality query command:

ascend-dmi --sq -h

ascend-dmi --sq --help

Table 1 lists only a test-specific parameter. For details about other common parameters, see Common Parameters.

Table 1 Parameters

Parameter

Description

Mandatory

[-sq, --sq, --signal-quality]

Queries the PCIe, HCCS, and RoCE communication ports on the NPU and the signal quality of the HCCS communication ports on the CPU.

Yes

[-d, --device]

Specifies the device ID of the NPU or CPU to be queried. If multiple processors are specified, use commas (,) to separate them. If this parameter is not specified, all NPUs or CPUs on the device are queried by default.

  • For the Atlas A2 training product, Atlas 800I A2 inference product, and A200I A2 Box heterogeneous component, if the HCCS type is specified, two devices must be specified. If the Atlas 200T A2 Box16/Atlas 200I A2 Box16 heterogeneous subrack is used to specify the HCCS type, at least two devices must be specified for the first eight or last eight NPUs.
  • For the Atlas 300I Duo inference card, if type is set to pcie, this parameter cannot specify the secondary chip.

No

[-t, --type]

Specifies the type of the communication port. Currently, PCIe, HCCS, and RoCE are supported. Use commas (,) to separate multiple communication port types.

  • For the Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, Atlas 900 A2 PoDc cluster basic unit, Atlas 200T A2 Box16/Atlas 200I A2 Box16 heterogeneous subrack, Atlas 800I A2 inference server, A200I A2 Box heterogeneous component, and A200T A3 Box8 SuperPoD Server, the PCIe and RoCE signal quality are queried by default.
  • For the Atlas 800I A3 SuperPoD Server, Atlas 9000 A3 SuperPoD, and Atlas 900 A3 SuperPoD, the RoCE signal quality is queried by default.
  • For servers equipped with Atlas 300I Duo inference cards, the primary chip supports both PCIe and HCCS, and the secondary chip supports only HCCS. By default, the PCIe signal quality is queried.
  • The Atlas 800I A2 inference server (32 GB PCIe) does not support the query of the HCCS signal quality.
  • The Atlas 300I A2 inference card supports only the query of the PCIe signal quality.

The options are as follows:

  • pcie: A PCIe link facilitates communication between the NPU and the CPU. The PCIe signal quality is determined by the value of the quad-eye diagram of the macro port connected to the PCIe link on the NPU.
  • hccs: An HCCS link is set up between multiple NPUs. The HCCS signal quality is determined by the signal-to-noise ratio (SNR) and half-eye height of the macro port connected to the HCCS link on the NPU.
  • roce: A RoCE link is used by NPUs for external cluster communication. The RoCE signal quality is determined by the SNR and half-eye height of the macro port connected to the RoCE link on the NPU.

No

[-m, --module]

Queries the CPU and NPU eye diagrams. If this parameter is not specified, the NPU eye diagram is queried by default. This parameter is supported only by the Atlas A3 training product and Atlas A3 inference product.

The options are as follows:

  • cpu: queries the signal quality of the HCCS link connected to the CPU.
  • npu: queries the signal quality of the PCIe, HCCS, and RoCE links connected to the NPU.

No

[-r, --result]

Specifies the output path for the full eye diagram results, for example, /test. The specified path must meet security requirements and cannot contain the wildcard (*).

  • If you specify a path for saving the result, the ascend_check folder is created in the specified path under the root directory. If you do not specify a path, the result is saved in the default path /var/log/ascend_check.
  • To prevent the permission on the result saving directory from being modified, you can set the permission on ascend_check to 700 for security purposes.

No

Example

The command output on an inference server is similar to that on a training server. The following uses the screenshots on a training server as an example.

  • PCIe, HCCS, and RoCE signal quality of device 0 and device 1

    ascend-dmi --sq -t hccs,pcie,roce -d 0,1

    If information shown in Figure 1 is displayed, the tool is running properly.

    Figure 1 Example of device signal quality detection

  • Output in JSON format

    ascend-dmi --signal-quality -t roce -d 0 --fmt json

    If information shown in Figure 2 is displayed, the tool is running properly.
    Figure 2 JSON output example of device signal quality detection

    The following table describes the parameters displayed in Figure 1.

    Table 2 Parameters of the HCCS signal quality detection

    Parameter

    Description

    type

    Type of the communication port

    device

    Logic ID of an NPU

    M* (macro port)

    Macro port. For example, M0 and M1 indicate macro ports 0 and 1 respectively.

    L* (LANE)

    Lane number in an HCCS link. For example, L0 and L1 indicate lane 0 and lane 1, respectively.

    S (SNR)

    SNR of a lane

    H (HEH)

    Half-eye height of a lane

    B/T/L/R

    Values of the bottom, top, left, and right positions of the quad-eye diagram

    Description:

    • In the HCCS signal quality command output, if SNR ≥ 400000 and HEH ≥ 350, the lane signal quality is normal.
    • If the SNR and HEH values are not within the preceding ranges, the HCCS signal quality is abnormal. In this case, check whether the macro connector is loose and whether the link is dirty.
    • If the values of SNR and HEH are 0, no HCCS link is established between the specified devices.
    • In the command outputs of the NPU's HCCS signal quality on the Atlas 300I Duo inference card or the CPU's HCCS signal quality on the Atlas 900 A3 SuperPoD, Atlas 800I A3 SuperPoD Server, and Atlas 9000 A3 SuperPoD, only the type, device, M* (macro port), L* (LANE) and B/T/L/R (B(bottom) ≤ -30, T(top) ≥ 30, L(left )≤ -5, R(right) ≥ 5) are displayed.
    Table 3 Parameters of the PCIe signal quality detection

    Parameter

    Description

    type

    Type of the communication port

    device

    Logic ID of an NPU

    M* (macro port)

    macro port number. For example, M9 and M10 indicate macro ports 9 and 10, respectively.

    L* (LANE)

    Lane number in a PCIe link. For example, L0 and L1 indicate lane 0 and lane 1, respectively.

    B/T/L/R

    Values of the bottom, top, left, and right positions of the quad-eye diagram

    Description:

    • In the PCIe signal quality command output, if B(bottom) ≤ -17, T(top) ≥ 17, L(left) ≤ -3, and R(right) ≥ 3 (all values must meet the requirements), the lane signal quality is normal.
    • If the values of the quad-eye diagram are not within the preceding range, the PCIe signal quality is abnormal. In this case, check whether the macro connector is loose and whether the link is dirty.
    • For the Atlas 300I Duo inference card, the value of B/T/L/R meets the following requirements: B(bottom) ≤ -30, T(top) ≥ 30, L(left) ≤ -5, and R(right) ≥ 5.
    Table 4 Parameters of the RoCE signal quality detection

    Parameter

    Description

    type

    Specifies the type of the communication port.

    device

    Logic ID of an NPU

    M* (macro port)

    macro port number. For example, M0 indicates macro port 0.

    S (SNR)

    SNR of a lane

    H (HEH)

    Half-eye height of a lane

    L* (LANE)

    Lane number in a RoCE link. For example, L0 and L1 indicate lane 0 and lane 1, respectively.

    Description:

    • In the RoCE signal quality output:
      • For 100G optical modules with SNR ≥ 260000 and HEH ≥ 350, the lane signal quality is normal.
      • For 200G optical modules with SNR ≥ 400000 and HEH ≥ 350, the lane signal quality is normal.
    • If the SNR and HEH values are not within the preceding ranges, the RoCE signal quality is abnormal. In this case, check whether the macro connector is loose and whether the link is dirty.
    • If the values of SNR and HEH are 0, no RoCE link is established between the specified devices.
    Example of the command output when the SNR and HEH values are 0
    [root@*****~]#  ascend-dmi --sq -t roce
    type: roce
    Prompt message: M*: macro port, L*: lane, S: SNR, H: HEH
    100G Optical Normal range: S(SNR) >= 260000 and H(HEH) >= 350
    200G Optical Normal range: S(SNR) >= 400000 and H(HEH) >= 350
    ----------------------------------------------------------------------------------------------
        device                  signal-to-noise ratio
    ----------------------------------------------------------------------------------------------
         0                       M0:    L0: S:0      H:0         L1: S:0      H:0
                                        L2: S:0      H:0         L3: S:0      H:0
    ----------------------------------------------------------------------------------------------
         1                       M0:    L0: S:0      H:0         L1: S:0      H:0
                                        L2: S:0      H:0         L3: S:0      H:0
    ----------------------------------------------------------------------------------------------
         2                       M0:    L0: S:0      H:0         L1: S:0      H:0
                                        L2: S:0      H:0         L3: S:0      H:0
    ----------------------------------------------------------------------------------------------
         3                       M0:    L0: S:0      H:0         L1: S:0      H:0
                                        L2: S:0      H:0         L3: S:0      H:0
    ----------------------------------------------------------------------------------------------
         4                       M0:    L0: S:0      H:0         L1: S:0      H:0
                                        L2: S:0      H:0         L3: S:0      H:0
    ----------------------------------------------------------------------------------------------
         5                       M0:    L0: S:0      H:0         L1: S:0      H:0
                                        L2: S:0      H:0         L3: S:0      H:0
    ----------------------------------------------------------------------------------------------
         6                       M0:    L0: S:0      H:0         L1: S:0      H:0
                                        L2: S:0      H:0         L3: S:0      H:0
    ----------------------------------------------------------------------------------------------
         7                       M0:    L0: S:0      H:0         L1: S:0      H:0
                                        L2: S:0      H:0         L3: S:0      H:0
    ----------------------------------------------------------------------------------------------
  • PCIe signal quality on device 0 with the error full eye diagram result

    ascend-dmi -sq -t pcie -d 0 -r /home/

    If information shown in Figure 3 is displayed, the tool is running properly.

    Figure 3 Full eye diagram test on device signal quality

    The full eye diagram result is saved as a CSV file. It is advised to generate a scatter chart based on the data in the file and observe the distribution of sampling points. Figure 4 shows a normal full eye diagram, and Figure 5 shows an abnormal one.

    Figure 4 Full eye diagram with normal signal quality
    Figure 5 Full eye diagram with abnormal signal quality