One-Click Stream Test

Function

One-click stream test refers to the process of sending and receiving streams on the NPU's RoCE network port in an NPU external loopback (CDR loopback or fiber optic circulator).

Test Item

Supported Stream Test Mode

Instructions

One-click stream test

Stream test in a CDR loopback or fiber optic circulator

When the command for one-click stream test is executed, Ascend DMI automatically sends and receives streams of all lanes on the specified device. After a period of time, the streams are closed and the result is queried.

Customized stream test

Stream test in a CDR loopback, in a fiber optic circulator, or through direct NPU connection

The customized stream test separates steps for one-click stream test, allowing you to flexibly control the TX and RX directions and specify the specific lane for stream test.

Test Principle

After streams are sent in the TX direction on the SerDes port of a specified NPU, data flows are transmitted to the loopback unit (CDR or fiber optic circulator) through the link to be tested, and then sent back through the link in the RX direction. In the RX direction, bit errors generated when data flows pass through the link during the sending and receiving processes are collected to check the link's signal quality.

One-click stream test supports the following two modes:

  • CDR loopback: A single device sends and receives streams at the same time, which is used to check the signal quality between the physical SerDes port of the NPU and the CDR unit. Before the stream test, ensure that the optical module is in position and then configure the CDR loopback.
  • Fiber optic circulator: A single device sends and receives streams at the same time without setting a loopback, which is used to check the signal quality between the physical SerDes port of the NPU and the optical module.
    Figure 1 One-click stream test principle

Application Scenario

The stream test is used to query the signal quality of the RoCE network port. For details about how to locate signal quality problems of the RoCE network port, refer to PRBS Stream Diagnosis.

Preparations

  • The stream test will interrupt training or inference services. Before the test, ensure that no service is running.
  • If an external fiber optic circulator is used for the stream test, no additional configuration is required. If a CDR loopback is used, ensure that the optical module can work properly before configuring the CDR loopback.

Parameters

You can run either of the following commands to list the parameters of the stream test command:

ascend-dmi --prbs-check -h

ascend-dmi --prbs-check --help

Table 1 lists only a test-specific parameter. For details about other common parameters, see Common Parameters.

Table 1 Parameters

Parameter

Description

Mandatory

[-pc, --pc, --prbs-check]

Performs a PRBS stream test.

Yes

[-dur, --dur, --duration]

Specifies the duration of a stream test.

  • The value ranges from 3 to 10, in seconds.
  • If the parameter value is not specified, 3 is used by default.

No

Example

  • Stream test on device 8 and device 9
    [***@***]# ascend-dmi --prbs-check -d 8,9 -dur 5
    This operation will make network port on devices down, please make sure no business is running on devices.
     Do you want to continue?(Y/N)y
    PRBS31 on device 8:
    -----------------------------------------------------------------------------------------------
    lane                error count         error rate          alos                time(ms)
    -----------------------------------------------------------------------------------------------
    0                   21                  0.0000000079%       0                   5020
    1                   12                  0.0000000045%       0                   5020
    2                   34                  0.0000000128%       0                   5014
    3                   4                   0.0000000015%       0                   5015
    -----------------------------------------------------------------------------------------------
    PRBS31 on device 9:
    -----------------------------------------------------------------------------------------------lane                error count         error rate          alos                time(ms)
    -----------------------------------------------------------------------------------------------
    0                   24                  0.0000000090%       0                   5033
    1                   71                  0.0000000266%       0                   5027
    2                   12                  0.0000000045%       0                   5026
    3                   70                  0.0000000262%       0                   5023
    -----------------------------------------------------------------------------------------------
  • Output in JSON format

    ascend-dmi -pc -d 9 -dur 5 -fmt json

    If the following information is displayed, the bit error rate (BER) falls in a normal range.

    [***@***]# ascend-dmi -pc -d 9 -dur 5 -fmt json
    This operation will make network port on devices down, please make sure no business is running on devices.
     Do you want to continue?(Y/N)y
    {
        "prbs": [
            {
                "device": 9,
                "pattern": "PRBS31",
                "prbs_result": [
                    {
                        "alos": 0,
                        "error_cnt": 19,
                        "error_rate": "0.0000000071%",
                        "lane": 0,
                        "time": 5018
                    },
                    {
                        "alos": 0,
                        "error_cnt": 194,
                        "error_rate": "0.0000000728%",
                        "lane": 1,
                        "time": 5017
                    },
                    {
                        "alos": 0,
                        "error_cnt": 12,
                        "error_rate": "0.0000000045%",
                        "lane": 2,
                        "time": 5019
                    },
                    {
                        "alos": 0,
                        "error_cnt": 6,
                        "error_rate": "0.0000000023%",
                        "lane": 3,
                        "time": 5017
                    }
                ]
            }
        ]
    }

    The following table describes the parameters in the command output.

    Table 2 Parameters

    Parameter

    Description

    device

    Logic ID of an NPU

    lane

    Lane ID of a RoCE link.

    error count

    Number of errors. In JSON format, the parameter is represented by error_cnt.

    The maximum value is 67092480, indicating that the number of errors reaches the upper limit.

    error rate

    Bit error rate.

    If the BER is less than 10-5, the signal quality is normal.

    alos

    The value options are as follows:

    The value 0 indicates that the input signal amplitude is normal.

    The value 1 indicates that the input signal amplitude is too low.

    times

    Traffic generation duration. In the JSON format, the parameter is represented by time.

    The number of bit errors may reach 67092480 in the following situations:

    • The NPU and CDR adaptation is automatically disabled during the stream test. However, multiple executions of the stream test command can repeatedly enable and disable adaptation. If the adaptation process is not completed in a timely manner, the number of bit errors can reach 67092480 at certain time.
    • The CDR loopback is not configured or fails to be configured.

Follow-up Procedure

  • To prevent the running training or inference service from being affected, disable the stream test after it is finished.
  • If the CDR loopback is used for the stream test, release the CDR loopback after the test. Otherwise, services cannot run properly.