SelfAttentionParam
属性 |
类型 |
默认值 |
描述 |
---|---|---|---|
quant_type |
torch_atb.SelfAttentionParam.QuantType |
torch_atb.QuantType.TYPE_QUANT_UNQUANT |
表示不进行量化操作。 |
out_data_type |
torch_atb.AclDataType |
torch_atb.AclDataType.ACL_DT_UNDEFINED |
根据输入tensors自动推导输出tensors数据类型。 |
head_num |
int |
0 |
此默认值不可用,用户需配置此项参数。 |
kv_head_num |
int |
0 |
- |
q_scale |
float |
1.0 |
- |
qk_scale |
float |
1.0 |
- |
batch_run_status_enable |
bool |
False |
- |
is_triu_mask |
int |
0 |
- |
calc_type |
torch_atb.SelfAttentionParam.CalcType |
torch_atb.SelfAttentionParam.CalcType.UNDEFINED |
decoder&encoder for flashAttention。 |
kernel_type |
torch_atb.SelfAttentionParam.KernelType |
torch_atb.SelfAttentionParam.KernelType.KERNELTYPE_DEFAULT |
- |
clamp_type |
torch_atb.SelfAttentionParam.ClampType |
torch_atb.SelfAttentionParam.ClampType.CLAMP_TYPE_UNDEFINED |
不做clamp。 |
clamp_min |
float |
0.0 |
- |
clamp_max |
float |
0.0 |
- |
mask_type |
torch_atb.SelfAttentionParam.MaskType |
torch_atb.SelfAttentionParam.MaskType.MASK_TYPE_UNDEFINED |
全0mask。 |
kvcache_cfg |
torch_atb.SelfAttentionParam.KvCacheCfg |
torch_atb.SelfAttentionParam.KvCacheCfg.K_CACHE_V_CACHE |
- |
scale_type |
torch_atb.SelfAttentionParam.ScaleType |
torch_atb.SelfAttentionParam.ScaleType.SCALE_TYPE_TOR |
- |
input_layout |
torch_atb.InputLayout |
torch_atb.InputLayout.TYPE_BSND |
- |
mla_v_head_size |
int |
0 |
- |
cache_type |
torch_atb.SelfAttentionParam.CacheType |
torch_atb.SelfAttentionParam.CacheType.CACHE_TYPE_NORM |
- |
window_size |
int |
0 |
- |
SelfAttentionParam.QuantType
枚举项:
- TYPE_QUANT_UNQUANT
- TYPE_DEQUANT_FUSION
- TYPE_QUANT_QKV_OFFLINE
- TYPE_QUANT_QKV_ONLINE
SelfAttentionParam.CalcType
枚举项:
- UNDEFINED
- ENCODER
- DECODER
- PA_ENCODER
- PREFIX_ENCODER
SelfAttentionParam.KernelType
枚举项:
- KERNELTYPE_DEFAULT
- KERNELTYPE_HIGH_PRECISION
SelfAttentionParam.ClampType
枚举项:
- CLAMP_TYPE_UNDEFINED
- CLAMP_TYPE_MIN_MAX
SelfAttentionParam.MaskType
枚举项:
- MASK_TYPE_UNDEFINED
- MASK_TYPE_NORM
- MASK_TYPE_ALIBI
- MASK_TYPE_NORM_COMPRESS
- MASK_TYPE_ALIBI_COMPRESS
- MASK_TYPE_ALIBI_COMPRESS_SQRT
- MASK_TYPE_ALIBI_COMPRESS_LEFT_ALIGN
- MASK_TYPE_SLIDING_WINDOW_NORM
- MASK_TYPE_SLIDING_WINDOW_COMPRESS
SelfAttentionParam.KvCacheCfg
枚举项:
- K_CACHE_V_CACHE
- K_BYPASS_V_BYPASS
SelfAttentionParam.ScaleType
枚举项:
- SCALE_TYPE_TOR
- SCALE_TYPE_LOGN
- SCALE_TYPE_MAX
SelfAttentionParam.CacheType
枚举项:
- CACHE_TYPE_NORM
- CACHE_TYPE_SWA
调用示例
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import torch import torch_atb def self_attention(): self_attention_param = torch_atb.SelfAttentionParam(head_num = 24, kv_head_num = 24) self_attention_param.calc_type = torch_atb.SelfAttentionParam.CalcType.PA_ENCODER self_attention = torch_atb.Operation(self_attention_param) q = torch.ones(4096, 24, 64, dtype=torch.float16).npu() k = torch.ones(4096, 24, 64, dtype=torch.float16).npu() v = torch.ones(4096, 24, 64, dtype=torch.float16).npu() seqlen = torch.tensor([4096], dtype=torch.int32) intensors = [q,k,v,seqlen] print("intensors: ", intensors) def self_attention_run(): outputs = self_attention.forward([q,k,v,seqlen]) return [outputs] outputs = self_attention_run() print("outputs: ", outputs) if __name__ == "__main__": self_attention() |