Configuring Fault Isolation

The native agent script provided by vLLM-Ascend encapsulates the vLLM service as an instance object. It also creates heaps for prefill and decode instances, organized according to their respective load scores. (A min-heap is a special data structure with the smallest element maintained at the top.) When a request reaches the proxy, the proxy selects an instance with the smallest load score from instance heaps. After the instance is selected, the load score is updated to implement load balancing among instances.

The native proxy script lacks built-in instance availability detection. To address this, the fault isolation function is introduced. This function is used to isolate services that cannot run properly, preventing traffic from reaching faulty instances. In addition, a health detection mechanism has been added to assist fault isolation.

Fault Isolation Principles

The proxy script now is able to track failure times and faults of instance objects. Figure 1 shows the process.

Figure 1 Fault isolation flowchart

Select an instance from heaps.
Route a request to the instance.
- If the request is successful, no action is required.
- If the request fails, the instance's failure count increments by one.
If this count exceeds a threshold of three, the instance is marked as faulty and its load score is set to infinity.
According to min-heap features, the faulty instance is deprioritized to the bottom of the heap due to its load score.

Since the proxy script retains faulty instances in the heaps, the selection logic has been updated to iteratively detect non-faulty instances from one selection cycle. If all instances in the heaps are marked as faulty, the proxy service becomes unavailable.

Health Check Mechanism

In the proxy script, the health check coroutine is enabled by default. All instance objects are checked and processed based on the check result. For details, see Figure 2.

Figure 2 Health check flowchart

The health check mechanism detects all instance objects every 5 seconds via the /health API.
If the request fails, the instance's failure count increments by one.
There are two cases for a successful request:
- If the current instance is not marked as faulty, it does not need to be changed.
- If the current instance is marked as faulty, its load score is reset to 0 and its status is updated to normal, so that the instance can continue request processing.

Deployment and Usage

For details, see Deploying Inference Jobs Using a Script in One-Click Mode.

Parent topic: Best Practices of vLLM Inference Jobs