GEGLU算子修改

修改“src/diffusers/models/activations.py”脚本。

适配npu_geglu融合算子加速。

修改前：

class GEGLU(nn.Module):
    r"""
    A [variant](https://arxiv.org/abs/2002.05202) of the gated linear unit activation function.

    Parameters:
        dim_in (`int`): The number of channels in the input.
        dim_out (`int`): The number of channels in the output.
        bias (`bool`, defaults to True): Whether to use a bias in the linear layer.
    """

    def __init__(self, dim_in: int, dim_out: int, bias: bool = True):
        super().__init__()
        linear_cls = LoRACompatibleLinear if not USE_PEFT_BACKEND else nn.Linear

        self.proj = linear_cls(dim_in, dim_out * 2, bias=bias)

    def gelu(self, gate: torch.Tensor) -> torch.Tensor:
        if gate.device.type != "mps":
            return F.gelu(gate)
        # mps: gelu is not implemented for float16
        return F.gelu(gate.to(dtype=torch.float32)).to(dtype=gate.dtype)

    def forward(self, hidden_states, scale: float = 1.0):
        args = () if USE_PEFT_BACKEND else (scale,)
        hidden_states, gate = self.proj(hidden_states, *args).chunk(2, dim=-1)
        return hidden_states * self.gelu(gate)

修改后：

class GEGLU(nn.Module):
    r"""
    A [variant](https://arxiv.org/abs/2002.05202) of the gated linear unit activation function.

    Parameters:
        dim_in (`int`): The number of channels in the input.
        dim_out (`int`): The number of channels in the output.
        bias (`bool`, defaults to True): Whether to use a bias in the linear layer.
    """

    def __init__(self, dim_in: int, dim_out: int, bias: bool = True):
        super().__init__()
        linear_cls = LoRACompatibleLinear if not USE_PEFT_BACKEND else nn.Linear

        self.proj = linear_cls(dim_in, dim_out * 2, bias=bias)

    def gelu(self, gate: torch.Tensor) -> torch.Tensor:
        if gate.device.type != "mps":
            return F.gelu(gate)
        # mps: gelu is not implemented for float16
        return F.gelu(gate.to(dtype=torch.float32)).to(dtype=gate.dtype)

    def forward(self, hidden_states, scale: float = 1.0):
        args = () if USE_PEFT_BACKEND else (scale,)
        hidden_states = self.proj(hidden_states, *args)
        return torch_npu.npu_geglu(hidden_states, dim=-1, approximate=1)[0] # 使用NPU亲和算子

FA算子修改

使用FlashAttention（FA）融合算子，加速计算attention模块。

修改“src/diffusers/models/attention_processor.py”脚本。

Diffusers使用的是PyTorch原生适配FA的F.scaled_dot_product_attention接口。目前NPU上也适配该接口，但支持范围与GPU不完全一致，可根据实际情况调用该接口。

在当前版本中，sdpa（scaled_dot_product_attention）接口仅作为一项试用特性，此功能在后续版本中可能会有所调整或改进。请用户在使用过程中关注后续版本的迭代。

 hidden_states = F.scaled_dot_product_attention(
            query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
        )

优化conv算子速度

训练的Python脚本上加入torch.npu.config.allow_internal_format = False，不允许conv私有格式，减少格式转换从而优化conv算子速度。

已适配修改的训练脚本位置如下所示：

“examples/text_to_image/train_text_to_image_lora_sdxl.py”

“examples/text_to_image/train_text_to_image_sdxl_pretrain.py”

“examples/controlnet/train_controlnet_sdxl.py”

torch.npu.config.allow_internal_format = False

模型保存

Huggingface中的代码可能有与DeepSpeed冲突部分，例如使用DeepSpeed保存模型时，需要每个节点都进行保存。修改如下代码解决模型保存卡死问题，断点续训时需将所有节点保存整合并传输至每个节点。

涉及到的代码有SDXL预训练、LoRA微调、ControlNet微调，代码脚本所在位置如下所示：