当推理任务运行中出现故障时,Volcano会将该任务调度到其他NPU上。
kubectl get pod --all-namespaces
回显示例如下,任务名称由resnetinfer1-2-scpr5变为resnetinfer1-2-xsdsf,表示故障重调度特性运行成功。该任务名称由随机字符串生成,以实际名称为准。
[2025-02-24 19:13:09,331] [2269] [281472887965984] [llm] [INFO] [logging.py-331] : Answer[0]: Deep learning is a subset of machine learning that uses neural networks with multiple layers to model complex relationships between [2025-02-24 19:13:09,331] [2269] [281472887965984] [llm] [INFO] [logging.py-331] : Generate[0] token num: (0, 20)