优化修改
GEGLU算子修改
修改“src/diffusers/models/activations.py”脚本。
适配npu_geglu融合算子加速。
修改前:
class GEGLU(nn.Module):
r"""
A [variant](https://arxiv.org/abs/2002.05202) of the gated linear unit activation function.
Parameters:
dim_in (`int`): The number of channels in the input.
dim_out (`int`): The number of channels in the output.
bias (`bool`, defaults to True): Whether to use a bias in the linear layer.
"""
def __init__(self, dim_in: int, dim_out: int, bias: bool = True):
super().__init__()
linear_cls = LoRACompatibleLinear if not USE_PEFT_BACKEND else nn.Linear
self.proj = linear_cls(dim_in, dim_out * 2, bias=bias)
def gelu(self, gate: torch.Tensor) -> torch.Tensor:
if gate.device.type != "mps":
return F.gelu(gate)
# mps: gelu is not implemented for float16
return F.gelu(gate.to(dtype=torch.float32)).to(dtype=gate.dtype)
def forward(self, hidden_states, scale: float = 1.0):
args = () if USE_PEFT_BACKEND else (scale,)
# hidden_states, gate = self.proj(hidden_states, *args).chunk(2, dim=-1)
# return hidden_states * self.gelu(gate)
修改后:
class GEGLU(nn.Module):
r"""
A [variant](https://arxiv.org/abs/2002.05202) of the gated linear unit activation function.
Parameters:
dim_in (`int`): The number of channels in the input.
dim_out (`int`): The number of channels in the output.
bias (`bool`, defaults to True): Whether to use a bias in the linear layer.
"""
def __init__(self, dim_in: int, dim_out: int, bias: bool = True):
super().__init__()
linear_cls = LoRACompatibleLinear if not USE_PEFT_BACKEND else nn.Linear
self.proj = linear_cls(dim_in, dim_out * 2, bias=bias)
def gelu(self, gate: torch.Tensor) -> torch.Tensor:
if gate.device.type != "mps":
return F.gelu(gate)
# mps: gelu is not implemented for float16
return F.gelu(gate.to(dtype=torch.float32)).to(dtype=gate.dtype)
def forward(self, hidden_states, scale: float = 1.0):
args = () if USE_PEFT_BACKEND else (scale,)
hidden_states = self.proj(hidden_states, *args)
return torch_npu.npu_geglu(hidden_states, dim=-1, approximate=1)[0] # 使用NPU亲和算子
FA算子修改
使用FlashAttention(FA)融合算子,加速计算attention模块。
- 修改“src/diffusers/models/attention_processor.py”脚本。
PyTorch2.0以上版本 ,Diffusers使用的是PyTorch原生适配FA的F.scaled_dot_product_attention接口。目前NPU上也适配该接口,但支持范围与GPU不完全一致,可根据实际情况调用该接口。
hidden_states = F.scaled_dot_product_attention( query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False ) - 修改“src/diffusers/models/attention_processor.py”脚本。
PyTorch2.0以下版本,Diffusers使用的是三方库xFormers中FA接口,可手动替换成NPU的torch_npu.npu_fusion_attention接口。
if query.dtype in (torch.float16, torch.bfloat16): hidden_states = torch_npu.npu_fusion_attention( query, key, value, attn.heads, input_layout="BNSD", pse=None, atten_mask=attention_mask, scale=1.0 / math.sqrt(query.shape[-1]), pre_tockens=65536, next_tockens=65536, keep_prob=1., sync=False, inner_precise=0, )[0]
优化conv算子速度
训练的Python脚本上加入torch.npu.config.allow_internal_format = False,不允许conv私有格式,减少格式转换从而优化conv算子速度。
已适配修改的训练脚本位置如下所示:
“examples/text_to_image/train_text_to_image_lora_sdxl.py”
“examples/text_to_image/train_text_to_image_sdxl_pretrain.py”
“examples/controlnet/train_controlnet_sdxl.py”
torch.npu.config.allow_internal_format = False
模型保存
Huggingface中的代码可能有与DeepSpeed冲突部分,例如使用DeepSpeed保存模型时,需要每个节点都进行保存。修改如下代码解决模型保存卡死问题,断点续训时需将所有节点保存点整合并传输至每个节点。
涉及到的代码有SDXL预训练、LoRA微调、ControlNet微调,代码脚本所在位置如下所示:
“examples/text_to_image/train_text_to_image_lora_sdxl.py”
“examples/text_to_image/train_text_to_image_sdxl_pretrain.py”
“examples/controlnet/train_controlnet_sdxl.py”
修改前:
if accelerator.is_main_process:
if global_step % args.checkpointing_steps == 0:
# _before_ saving state, check if this save would set us over the `checkpoints_total_limit`
修改后:
if global_step % args.checkpointing_steps == 0:
# _before_ saving state, check if this save would set us over the `checkpoints_total_limit`