Result Viewing of Rescheduling Upon Inference Card Faults

If a fault occurs during the running of an inference job, Volcano schedules the job to another NPU.

Procedure

  1. Run the following command to check the job running status:
    kubectl get pod --all-namespaces
    If the job name changes from resnetinfer1-2-scpr5 to resnetinfer1-2-xsdsf, as shown in the following command output, the rescheduling is successful. The job name is generated based on a random character string. Use the actual job name.
    NAMESPACE        NAME                                       READY   STATUS    RESTARTS   AGE
    ...
    default      resnetinfer1-2-xsdsf                    1/1    Running   0       10s
    ...
  2. Run the following command to view job logs :
    kubectl logs -f resnetinfer1-2-xsdsf
    Command output:
    [2025-02-24 19:13:09,331] [2269] [281472887965984] [llm] [INFO] [logging.py-331] : Answer[0]:  Deep learning is a subset of machine learning that uses neural networks with multiple layers to model complex relationships between
    [2025-02-24 19:13:09,331] [2269] [281472887965984] [llm] [INFO] [logging.py-331] : Generate[0] token num: (0, 20)