NPU Memory Is Insufficient When train_and_evaluate Is Executed in Estimator Mode

Symptom

When train_and_evaluate is executed in Estimator mode, an error message "Sum of total mem_offset:26496001536 and var_mem_size:11776003072 is greater than memory manager malloc max size 33285996544" is displayed, indicating insufficient device memory.

Possible Cause

In train_and_evaluate mode of Estimator, a table is recreated when the mode is switched from train to eval (if dynamic capacity expansion is not enabled). If the table is too large, the device memory may be insufficient.

Solution

Enable capacity expansion to avoid this problem. In this mode, tables are created only once. Alternatively, you can reduce the batch size.