MindIE Pod日志采集
文件说明
- 通过K8s指令或者采集脚本采集MindIE Pod打屏日志,MindIE Pod日志包含实例节点信息,以JSON文件统一存储。
- 命名约束:${pod_name}.json。
- 存放路径约束:
- 采集目录/mindie/log/mindie_cluster_log/
- ${--mindie_log参数指定路径}/
- 详细说明请参考日志采集目录结构
使用示例
采集方式说明
故障诊断工具支持通过以下方式采集MindIE Pod打屏日志:
- 脚本采集。在日志采集脚本中,使用pod_log_collect.sh脚本采集MindIE Pod打屏日志。
- 命令采集。通过命令采集MindIE Pod打屏日志。
命令采集
- 在MindIE服务稳定拉起后,执行以下命令,采集MindIE Pod打屏日志。
kubectl logs -f -n ${namespace} ${podname} | head -n 1000 > ${log_dir}/${podname}.log 2>&1 &
在${log_dir}目录下查看${podname}.log日志。
日志内容如下:
…… INFO:root:status of ranktable is not completed, waiting for file update. INFO:root:status of ranktable is not completed, waiting for file update. INFO:root:status of ranktable is not completed, waiting for file update. {"IsMindIEEPJob":true,"status":"completed","server_list":[{"device":[{"device_id":"0","device_ip":"10.0.2.41","super_device_id":"113246208","rank_id":"0"},{"device_id":"1","device_ip":"10.0.3.41","super_device_id":"113311745","rank_id":"1"},{"device_id":"2","device_ip":"10.0.2.42","super_device_id":"113508354","rank_id":"2"},{"device_id":"3","device_ip":"10.0.3.42","super_device_id":"113573891","rank_id":"3"},{"device_id":"4","device_ip":"10.0.2.43","super_device_id":"113770500","rank_id":"4"},{"device_id":"5","device_ip":"10.0.3.43","super_device_id":"113836037","rank_id":"5"},{"device_id":"6","device_ip":"10.0.2.44","super_device_id":"114032646","rank_id":"6"},{"device_id":"7","device_ip":"10.0.3.44","super_device_id":"114098183","rank_id":"7"},{"device_id":"8","device_ip":"10.0.2.45","super_device_id":"114294792","rank_id":"8"},{"device_id":"9","device_ip":"10.0.3.45","super_device_id":"114360329","rank_id":"9"},{"device_id":"10","device_ip":"10.0.2.46","super_device_id":"114556938","rank_id":"10"},{"device_id":"11","device_ip":"10.0.3.46","super_device_id":"114622475","rank_id":"11"},{"device_id":"12","device_ip":"10.0.2.47","super_device_id":"114819084","rank_id":"12"},{"device_id":"13","device_ip":"10.0.3.47","super_device_id":"114884621","rank_id":"13"},{"device_id":"14","device_ip":"10.0.2.48","super_device_id":"115081230","rank_id":"14"},{"device_id":"15","device_ip":"10.0.3.48","super_device_id":"115146767","rank_id":"15"}],"server_id":"141.61.57.128","container_ip":"192.168.247.11"}],"server_count":"1","version":"1.2","super_pod_list":[{"super_pod_id":"1","server_list":[{"server_id":"141.61.57.128"}]}]} ……
- server_list:列表中包含该Pod所在实例的所有节点
- container_ip:容器IP
- device_id:卡号

MindIE Pod日志在拉起服务后,会记录实例相关日志,由于日志存在老化机制,若采集的MindIE Pod日志不包含实例相关日志,组件将不支持多实例故障诊断。
父主题: 训练及推理中采集