Long Sequence

A long sequence is defined as a text whose sequence length exceeds 32 KB or even reaches 1 MB. The primary goal of the long sequence feature is to ensure that the model's answering effectiveness and performance are maintained, even when the input text is excessively long. In long sequence scenarios, the graphics memory consumed by Attention and KV cache increases exponentially. Therefore, optimizing the graphics memory is the key to the long sequence feature. Key algorithmic technologies include KV cache quantization, KV multi-header compression, and short-sequence training with long-sequence inference.

Long-sequence training/inference: During training, a long text is used to train weights of a model, so that the model can still maintain a good capability for long-sequence input in an inference process.
Short-sequence training and long-sequence inference: A model uses technologies such as Alibi encoding or sequence compression algorithms (such as NTK and YaRN) to ensure a strong auto-scale capability. In this way, the model can obtain a better capability in long-sequence inference phase after short-sequence training.

Constraints

For details about the models that support the long sequence feature, see the list of supported LLMs.
This feature cannot be used together with the SplitFuse, parallel decoding, and multi-server inference features or used in scenarios where MTP exceeds 1.
The maximum sequence length supported by MindIE LLM is determined by the following factors:
- Specifications of hardware graphics memory and the number of model parameters: This determines the maximum input length that the model can accept during inference, given the limits of the hardware. Take the Atlas 800I A2 inference server of 64 GB as an example. When the Glm4-9B-Chat model is running on eight devices, up to 1 MB long sequence can be inferred with sufficient graphics memory.
- Model weights and structure: This determines the generation and dialog performance of the model in the long sequence scenario. For a long-sequence training and inference model (such as Glm-4-9B-Chat-1M), MindIE LLM ensures the same long-sequence inference effect as open-source models. For a short-sequence training and long-sequence inference model, MindIE LLM leverages technologies such as NTK (with related features enabled) to ensure the same long-sequence inference capability as open-source models. Note that if you want to use a model that natively supports only short sequences to process long sequence input, MindIE LLM cannot ensure the rationality of the long sequence inference output.
- Currently, NTK is supported by Llama3. YaRN is supported by models running Qwen2 modeling, such as Qwen2, Qwen2.5, and Qwen3.

Running Inference

Determine a proper sequence length based on the hardware specifications, model parameters, and the maximum valid inference length supported by a model. For details about the specifications, see the official documentation of the corresponding model. Unlike common inference, some models that support the long sequence feature require modification of configuration files to enable this feature. Take Qwen2-72B-Instruct as an example. To enable the long sequence feature, you need to add the rope_scaling field to config.json in the weight file. (If the long sequence feature is not required, do not add this field.)

{
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  // ...
  "vocab_size": 152064,
  
  // adding the following snippets
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}

The methods of enabling the long sequence feature vary according to models. For some models (such as LLaMA3.1-70B-Instruct), the feature can be enabled without any modification. For details about how to enable the long sequence feature, see the README file of each model that supports this feature (see README Document Interpretation).

Once the long sequence feature is enabled, simply transfer the long sequence text to the model following the standard inference process to complete long sequence inference. For details about the model inference process, see ATB Models for Pure Models.

After the configurations of the long sequence feature are added, the inference can be performed properly. You can customize the input text length. If the length exceeds the value of original_max_position_embeddings, long sequence inference can be performed. The following commands are an example:

cd ${ATB_SPEED_HOME_PATH}
torchrun --nproc_per_node [Number of running devices] --master_port 20030 -m examples.run_pa --model_path [Model weight path] --max_output_length [Maximum output length] --max_input_length [Maximum input length] --input_texts [Input text, which can be a file or character string]

You are advised to use a text file (such as *.txt) as the input for long sequence inference.

Parent topic: Long Sequence Features