HCCL_BUFFSIZE

Description

Sets the size of the buffer for sharing data between two NPUs. The unit is MB. The value must be an integer greater than or equal to 1. The default value is 200.

In the collective communication network, each HCCL communicator occupies a buffer of the size of HCCL_BUFFSIZE. If there are many HCCL communicators in the cluster network, the buffer usage increases, which may affect the normal storage of model data. In this case, you can use HCCL_BUFFSIZE to reduce the buffer size occupied by the communicator. If the model data size of the service is small but the communication data size is large, you can use HCCL_BUFFSIZE to increase the size of the buffer occupied by the HCCL communicator, to improve data communication efficiency.

The recommended value for LLMs is as follows:

(MircobatchSize * SequenceLength * hiddenSize * sizeOf (DataType) )/(1024*1024). Round up to the next integer.

This environment variable is used in the following scenarios:
Notes:
  • The memory requested by this environment variable is exclusively used by HCCL and cannot be multiplexed by other services.
  • Each communicator occupies 2 × HCCL_BUFFSIZE memory, which is used for receiving and sending memory.
  • The resource is managed by communicator. Each communicator exclusively occupies a group of 2 × HCCL_BUFFSIZE memory to ensure that concurrent operators in multiple communicators do not affect each other.
  • For collective communication operators, when the data size exceeds the value of HCCL_BUFFSIZE, the performance may deteriorate. It is recommended that the value of HCCL_BUFFSIZE be greater than the data size.

Example

export HCCL_BUFFSIZE=200

Restrictions

None

Applicability

Atlas Training Series Product