HCCL_BUFFSIZE
Description
Sets the size of the shared data buffer used by the communicator. The value must be an integer greater than or equal to 1. The default value is 200. The unit is MB.
In collective communication, each communicator occupies a buffer of the size of HCCL_BUFFSIZE. If there are many communicators in the cluster, the overall buffer usage increases, which may affect the normal storage of model data. In this case, you can decrease the value of this environment variable to reduce the buffer space occupied by the communicator. If the service model data size is small but the communication data size is large, you can increase the value of this environment variable to increase the buffer space occupied by the communicator, thereby improving data communication efficiency.
The recommended value for LLMs is as follows:
(MicrobatchSize × SequenceLength × hiddenSize × sizeof (DataType))/(1024 × 1024). Round up to an integer.
- Dynamic shape network scenario
- Scenario where developers call the C language APIs of the HCCL for framework interconnection
- The memory requested by this environment variable is exclusively used by HCCL and cannot be multiplexed by other services.
- Each communicator occupies 2 × HCCL_BUFFSIZE memory, which is used for receiving and sending memory.
- The resource is managed by communicator. Each communicator exclusively occupies a group of 2 × HCCL_BUFFSIZE memory to ensure that concurrent operators in multiple communicators do not affect each other.
- For the collective communication operator, when the data size exceeds the value of HCCL_BUFFSIZE, the performance may deteriorate. It is recommended that the value of HCCL_BUFFSIZE be greater than the data size.
Example
export HCCL_BUFFSIZE=200
Restrictions
If you call the HCCL C APIs to initialize a communicator with specific configurations and specify the shared data buffer size using the hcclBufferSize parameter of HcclCommConfig, the configuration of the communicator takes precedence.