PCIe
The simplest way is to increase the pressure. The PCIe time is obviously prolonged, and other time changes slightly.
The PCIe time in the timeline file is model@inputcopy in the figure.

Actual rate: The summary information is summarized in the pcie_*.csv file. (For details about how to capture PCIe information, see the document. The related variable is --sys-interconnection-profiling.) As shown in the following figure, Tx_p_avg is the input data, and the average rate is 99.628 MB/s.

Theoretical data calculation: Assume that the PCIe version is PCIe 5.0. The bidirectional transmission rate of each PCIe lane is close to 8 GB/s. Generally, the card interface is x4. Therefore, the theoretical maximum rate is close to 32 GB/s. The actual rate varies according to the size of the file to be transmitted. Generally, the maximum rate can reach 80% of the theoretical bandwidth. The actual bandwidth is closely related to the shape and size of the sent data.
Analysis Orientation
- Check the number of cards corresponding to a CPU and whether operations such as NUMA node binding can be performed to reduce the time required for PCIe preemption.
- Increase the batch size to reduce the proportion of PCIe transmission header overhead.
- Reduce the data format from float32 to float16. (Check whether the accuracy meets the requirements or is accepted by the customer.)
- On the host, combine the data and convert the transmission of small data blocks into the transmission of large data blocks. (The GE framework has this function. If the customer calls the aclrt interface, the customer needs to implement this function.)