Model Code Tuning Strategy

The model performance can be further improved based on the profile data and NPU features. For details about the tuning process, see Figure 1. For more common cases, see Table 1.

Figure 1 Model code tuning process
  1. Obtains the PR profile data of a model.
  2. Locate the model performance problems such as the single-point CPU operation or long operator execution duration.
  3. Locate the faulty code segment based on the call stack relationship in the profile data.
  4. Perform in-depth analysis on the faulty code segment to find out the specific cause.
  5. Take appropriate tuning measures, such as eliminating redundant code or replacing the original code with a more affinity implementation, to improve performance.
Table 1 Common fine-tuning cases

Category

Model Fault

Code Tuning Suggestions

Format conversion

Based on the operator data, if the time consumption of the TransData operator is high, see Figure 2.

Try to disable automatic format conversion.

torch.npu.config.allow_internal_format = false

The variable x1 is the result after discontinuous conversion. transpose is introduced in each subsequent call.

def forward(self, x):
x=self.fc1(x)
x1=F.relu(x).transpose(1,2)#.contiguous()
x2_1=self.fc2_1(x1)
x2_2=self.fc2_2(x1)
x3=torch.add(x2_1,x2_2)
x4=self.fc3(x3)[:,0,]
returnx4

Eliminate the redundant transpose generated during the call. After the conversion, invoke the continuous conversion function.

x1 = F.relu(x).transpose(1, 2).contiguous()

Redundant code

If the variable definition is not used, extra memory operation overheads are caused.

tasks = torch.tensor(tasks).to(self.device) #The defined variable is not used.

Delete redundant code.

Multiple small-batch memory movements cause a large number of memory operators. You can merge the tasks to improve performance.

tasks = torch.cat([self.task_tokenizer(x["task"]).to(self.device).unsqueeze(0) for x in batched_inputs], dim=0)

After the operations are complete on the CPU, the operations are transferred to the NPU in a unified manner.

tasks = torch.cat([self.task_tokenizer(x["task"]).unsqueeze(0) for x in batched_inputs], dim=0)
tasks=tasks.to(self.device)

Code non-affinity

The performance of an operator deteriorates greatly in extreme shapes. The following uses the SelectV2 operator as an example. For details, see Figure 3.

fg_scores_mask = fg_mask[;, ;, None].repeat(1, 1, self.num_classes)
target_scores=torch.where(fg_scores_mask>0,target_scores,0)

Avoid calling this operator and use the matrix operation to replace it.

fg_scores_mask = fg_mask.unsqueeze(-1)
target_sores*=(fg_scores_mask>0).float()
Figure 2 High time consumption ratio of the TransData operator
Figure 3 Performance deterioration of the SelectV2 operator in extreme shapes