Dynamic Capacity Expansion Mode of the on-chip memory

TensorFlow accommodates embeddings through variables. You need to estimate the size of each table and then create variables by calling create_table. The size of the embedding table, once set, cannot be increased or reduced later. This may cause either a waste of NPU memory or insufficient space. In recommended scenarios, the size of multiple sparse tables cannot be estimated. To better cater to user scenarios and requirements, the automatic capacity expansion function of the on-chip memory sparse table has been introduced. This means that the NPU memory increases with model training.

In this mode, feature eviction is not supported.

Training Process

This part describes how to use the dynamic capacity expansion mode for training. For details, see Figure 1.

Figure 1 Training process in dynamic capacity expansion mode of the on-chip memory

The training process contains two phases. For details about the sample code of the overall process, see Sample Code.

Model Adaptation
Training Startup

Model Adaptation

The key steps are as follows:

Initialize the framework.
Call the init API and set use_dynamic_expansion to True to enable the dynamic capacity expansion function. (The default value of this parameter is False.)
Import a sparse optimizer.
Call the create_hash_optimizer_by_address API of the corresponding optimizer in the mx_rec.optimizers package to create a sparse_optimizer table. The following lists the available optimizers:
- SGDByAddr
- LazyAdamByAddress
Obtain the embedding representation result (emb) and mapping address (addr).
Use the tf.get_collection("ASCEND_SPARSE_LOOKUP_LOCAL_EMB") API to obtain the embedding representation result used for training, and use the tf.get_collection("ASCEND_SPARSE_LOOKUP_ID_OFFSET") API to obtain the mapping address used for training.
Perform backward gradient calculations.
Use the tf.gradients(loss, emb) API to calculate the derivation of the embedding representation result obtained in 3 to obtain the gradient (grad).
Perform backward sparse table update.
Use the sparse optimizer created in step 2 to import the created sparse_optimizer.apply_gradients([grad, addr]) API to update the sparse table corresponding to the mapping address.

Training Startup

Start model training.
After the model training is complete, call the terminate_config_initializer API to destroy resources.

Sample Code

Initialize the framework.

use_dynamic_expansion = bool(int(os.getenv("USE_DYNAMIC_EXPANSION", 0)))
init(use_mpi, train_steps=args.train_steps, eval_steps=args.eval_steps, 
use_dynamic_expansion=use_dynamic_expansion)

Import a sparse optimizer.

def get_dense_and_sparse_optimizer(cfg):
    dense_optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=cfg.learning_rate)
    use_dynamic_expansion = get_use_dynamic_expansion()
    if use_dynamic_expansion:
        sparse_optimizer = create_hash_optimizer_by_address(learning_rate=cfg.learning_rate)
        logging.info("optimizer lazy_adam_by_addr")
    else:
        sparse_optimizer = create_hash_optimizer(learning_rate=cfg.learning_rate)
        logging.info("optimizer lazy_adam")
    return dense_optimizer, sparse_optimizer

Obtain the embedding representation result and the mapping address.

train_emb_list = tf.compat.v1.get_collection(ASCEND_SPARSE_LOOKUP_LOCAL_EMB)
train_address_list = tf.compat.v1.get_collection(ASCEND_SPARSE_LOOKUP_ID_OFFSET)

Perform backward gradient calculations.

local_grads = tf.gradients(loss, train_emb_list)  # local_embedding

Perform backward sparse table update.
1 2
grads_and_vars = [(grad, address) for grad, address in zip(local_grads, train_address_list)] train_ops.append(sparse_optimizer.apply_gradients(grads_and_vars))
- When sparse_optimizer.apply_gradients(grads_and_vars) is called to update the gradient, if the used vars (such as address) is a tensor instead of a variable, ensure that the dimension of vars is the same as the first dimension of grads.
- train_address_list must be valid and can be obtained through 3. If an invalid address is used, errors such as "AI Core Error" occurs during running.

Parent topic: Function Training Process