MindIE Pod Log Collection
File Description
- You can run Kubernetes commands or use a collection script to collect MindIE pod console logs. MindIE pod logs contain instance node information and are stored in JSON files.
- Naming constraint: ${pod_name}.json.
- Constraints on the storage path:
- Collection_directory/mindie/log/mindie_cluster_log/
- ${Path specified by --mindie_log}/
- For details, see Log Collection Directory Structure.
Usage Example
- Refer to pod_log_collect.sh to compile a collection script.
- Ensure that the output path of the script is Collection_directory/mindie/log/mindie_cluster_log/. You can run the command in any directory to collect logs.
Output path example:
log_dir="Collection_directory/mindie/log/mindie_cluster_log/"
Example:
bash pod_log_collect.sh
The ${pod_name}.json file is generated in the output path directory.
Collection Mode Description
The fault diagnosis tool can collect MindIE pod console logs in either of the following ways:
- Script-based collection: In the log collection script, use the pod_log_collect.sh script to collect MindIE pod console logs.
- CLI-based collection: Collect MindIE pod console logs using commands.
CLI-based Collection
- After the MindIE service is stably started, run the following command to collect MindIE pod console logs.
kubectl logs -f -n ${namespace} ${podname} | head -n 1000 > ${log_dir}/${podname}.log 2>&1 &View the ${podname}.log file in the ${log_dir} directory.
Log content:
... INFO:root:status of ranktable is not completed, waiting for file update. INFO:root:status of ranktable is not completed, waiting for file update. INFO:root:status of ranktable is not completed, waiting for file update. {"IsMindIEEPJob":true,"status":"completed","server_list":[{"device":[{"device_id":"0","device_ip":"10.0.2.41","super_device_id":"113246208","rank_id":"0"},{"device_id":"1","device_ip":"10.0.3.41","super_device_id":"113311745","rank_id":"1"},{"device_id":"2","device_ip":"10.0.2.42","super_device_id":"113508354","rank_id":"2"},{"device_id":"3","device_ip":"10.0.3.42","super_device_id":"113573891","rank_id":"3"},{"device_id":"4","device_ip":"10.0.2.43","super_device_id":"113770500","rank_id":"4"},{"device_id":"5","device_ip":"10.0.3.43","super_device_id":"113836037","rank_id":"5"},{"device_id":"6","device_ip":"10.0.2.44","super_device_id":"114032646","rank_id":"6"},{"device_id":"7","device_ip":"10.0.3.44","super_device_id":"114098183","rank_id":"7"},{"device_id":"8","device_ip":"10.0.2.45","super_device_id":"114294792","rank_id":"8"},{"device_id":"9","device_ip":"10.0.3.45","super_device_id":"114360329","rank_id":"9"},{"device_id":"10","device_ip":"10.0.2.46","super_device_id":"114556938","rank_id":"10"},{"device_id":"11","device_ip":"10.0.3.46","super_device_id":"114622475","rank_id":"11"},{"device_id":"12","device_ip":"10.0.2.47","super_device_id":"114819084","rank_id":"12"},{"device_id":"13","device_ip":"10.0.3.47","super_device_id":"114884621","rank_id":"13"},{"device_id":"14","device_ip":"10.0.2.48","super_device_id":"115081230","rank_id":"14"},{"device_id":"15","device_ip":"10.0.3.48","super_device_id":"115146767","rank_id":"15"}],"server_id":"141.61.57.128","container_ip":"192.168.247.11"}],"server_count":"1","version":"1.2","super_pod_list":[{"super_pod_id":"1","server_list":[{"server_id":"141.61.57.128"}]}]} ...- server_list: all nodes hosting the pod instance
- container_ip: container IP address
- device_id: device ID
After the MindIE pod service is started, instance logs are recorded. Due to the aging mechanism of logs, if the collected MindIE pod logs do not contain instance logs, multi-instance fault diagnosis will not be supported.
Parent topic: Collection During Training or Inference