Model Code Tuning Strategy

The model performance can be further improved based on the profile data and NPU features. For details about the tuning process, see Figure 1. For more common cases, see Table 1.

Figure 1 Model code tuning process

Obtains the PR profile data of a model.
Locate the model performance problems such as the single-point CPU operation or long operator execution duration.
Locate the faulty code segment based on the call stack relationship in the profile data.
Perform in-depth analysis on the faulty code segment to find out the specific cause.
Take appropriate tuning measures, such as eliminating redundant code or replacing the original code with a more affinity implementation, to improve performance.

**Table 1** Common fine-tuning cases
Category	Model Fault	Code Tuning Suggestions
Format conversion	Based on the operator data, if the time consumption of the TransData operator is high, see Figure 2.	Try to disable automatic format conversion. torch.npu.config.allow_internal_format = false
Format conversion	The variable x1 is the result after discontinuous conversion. transpose is introduced in each subsequent call. def forward(self, x): x=self.fc1(x) x1=F.relu(x).transpose(1,2)#.contiguous() x2_1=self.fc2_1(x1) x2_2=self.fc2_2(x1) x3=torch.add(x2_1,x2_2) x4=self.fc3(x3)[:,0,] returnx4	Eliminate the redundant transpose generated during the call. After the conversion, invoke the continuous conversion function. x1 = F.relu(x).transpose(1, 2).contiguous()
Redundant code	If the variable definition is not used, extra memory operation overheads are caused. tasks = torch.tensor(tasks).to(self.device) #The defined variable is not used.	Delete redundant code.
Redundant code	Multiple small-batch memory movements cause a large number of memory operators. You can merge the tasks to improve performance. tasks = torch.cat([self.task_tokenizer(x["task"]).to(self.device).unsqueeze(0) for x in batched_inputs], dim=0)	After the operations are complete on the CPU, the operations are transferred to the NPU in a unified manner. tasks = torch.cat([self.task_tokenizer(x["task"]).unsqueeze(0) for x in batched_inputs], dim=0) tasks=tasks.to(self.device)
Code non-affinity	The performance of an operator deteriorates greatly in extreme shapes. The following uses the SelectV2 operator as an example. For details, see Figure 3. fg_scores_mask = fg_mask[;, ;, None].repeat(1, 1, self.num_classes) target_scores=torch.where(fg_scores_mask>0,target_scores,0)	Avoid calling this operator and use the matrix operation to replace it. fg_scores_mask = fg_mask.unsqueeze(-1) target_sores*=(fg_scores_mask>0).float()

Figure 2 High time consumption ratio of the TransData operator

Figure 3 Performance deterioration of the SelectV2 operator in extreme shapes

Parent topic: Operator Performance Tuning Solutions