Model Code Tuning Strategy
The model performance can be further improved based on the profile data and NPU features. For details about the tuning process, see Figure 1. For more common cases, see Table 1.
- Obtains the PR profile data of a model.
- Locate the model performance problems such as the single-point CPU operation or long operator execution duration.
- Locate the faulty code segment based on the call stack relationship in the profile data.
- Perform in-depth analysis on the faulty code segment to find out the specific cause.
- Take appropriate tuning measures, such as eliminating redundant code or replacing the original code with a more affinity implementation, to improve performance.
|
Category |
Model Fault |
Code Tuning Suggestions |
|---|---|---|
|
Format conversion |
Based on the operator data, if the time consumption of the TransData operator is high, see Figure 2. |
Try to disable automatic format conversion. torch.npu.config.allow_internal_format = false |
|
The variable x1 is the result after discontinuous conversion. transpose is introduced in each subsequent call. def forward(self, x): x=self.fc1(x) x1=F.relu(x).transpose(1,2)#.contiguous() x2_1=self.fc2_1(x1) x2_2=self.fc2_2(x1) x3=torch.add(x2_1,x2_2) x4=self.fc3(x3)[:,0,] returnx4 |
Eliminate the redundant transpose generated during the call. After the conversion, invoke the continuous conversion function. x1 = F.relu(x).transpose(1, 2).contiguous() |
|
|
Redundant code |
If the variable definition is not used, extra memory operation overheads are caused. tasks = torch.tensor(tasks).to(self.device) #The defined variable is not used. |
Delete redundant code. |
|
Multiple small-batch memory movements cause a large number of memory operators. You can merge the tasks to improve performance. tasks = torch.cat([self.task_tokenizer(x["task"]).to(self.device).unsqueeze(0) for x in batched_inputs], dim=0) |
After the operations are complete on the CPU, the operations are transferred to the NPU in a unified manner. tasks = torch.cat([self.task_tokenizer(x["task"]).unsqueeze(0) for x in batched_inputs], dim=0) tasks=tasks.to(self.device) |
|
|
Code non-affinity |
The performance of an operator deteriorates greatly in extreme shapes. The following uses the SelectV2 operator as an example. For details, see Figure 3. fg_scores_mask = fg_mask[;, ;, None].repeat(1, 1, self.num_classes) target_scores=torch.where(fg_scores_mask>0,target_scores,0) |
Avoid calling this operator and use the matrix operation to replace it. fg_scores_mask = fg_mask.unsqueeze(-1) target_sores*=(fg_scores_mask>0).float() |


