Result Viewing of Rescheduling Upon Inference Card Faults
If a fault occurs during the running of an inference job, Volcano schedules the job to another NPU.
Procedure
- Run the following command to check the job running status:
kubectl get pod --all-namespaces
If the job name changes from resnetinfer1-2-scpr5 to resnetinfer1-2-xsdsf, as shown in the following command output, the rescheduling is successful. The job name is generated based on a random character string. Use the actual job name.NAMESPACE NAME READY STATUS RESTARTS AGE ... default resnetinfer1-2-xsdsf 1/1 Running 0 10s ...
- Run the following command to view job logs :
kubectl logs -f resnetinfer1-2-xsdsf
Command output:[2025-02-24 19:13:09,331] [2269] [281472887965984] [llm] [INFO] [logging.py-331] : Answer[0]: Deep learning is a subset of machine learning that uses neural networks with multiple layers to model complex relationships between [2025-02-24 19:13:09,331] [2269] [281472887965984] [llm] [INFO] [logging.py-331] : Generate[0] token num: (0, 20)
Parent topic: Use on the CLI (Volcano)