(Optional) Processor Fault Level Configuration
If you want to customize the fault level, create a fault code configuration file and pass the file as the value of the -faultConfigPath parameter when starting Container Manager. The following uses a node status detection fault (dmp_daemon) with fault code 80E21007 as an example. To change the fault handling policy from NotHandleFault to RestartNPU, perform the following steps:
- Log in to the environment and go to any directory (for example, /home/container-manager).
- Create a fault code configuration file, for example, faultCode.json.
vi faultCode.json
- Press i to enter the insert mode and copy the default fault code configuration in Default Fault Code Configuration to the file.
- Locate the fault code 80E21007.
"NotHandleFaultCodes":[ "80E21007","80E38003","80F78006","80C98006","80CB8006","81318006","80A18006","80A18005","80FB8000","8C1F8609", ... ], ...
You can configure multiple fault levels for a fault code, but the fault is handled as the configured highest level by default.
- Delete the fault code 80E21007 from NotHandleFaultCodes and add it to RestartNPUCodes.
"NotHandleFaultCodes":[ "80E38003","80F78006","80C98006","80CB8006","81318006","80A18006","80A18005","80FB8000","8C1F8609", ... ], ... "RestartNPUCodes":[ "8C204E00","A8028802","A4302003","A4302004","A4302005","A4302006","A4302009","A430200A","80CF8009","80CF8008","80E21007",... ... ],
- After the modification, press Esc, enter :wq!, save the configuration, and exit.
- Check the permission on the custom fault code configuration file and ensure that the permission is not higher than 640.
- Start Container Manager. If Container Manager has been installed, restart it for the configuration to take effect.
systemctl daemon-reload && systemctl restart container-manager.service # Reload the service configuration and restart the installed Container Manager.If "load custom fault config file from /home/container-manager/faultCode.json success" is displayed in the log, the fault code is configured successfully.
- The fault code is defined in the system configuration. Do not modify it unless necessary, as changes may cause errors during troubleshooting.
- After the custom fault code configuration file is modified, restart Container Manager for the modification to take effect. If the content of the configuration file is incorrect, Container Manager reports an error and exits.
Parent topic: Fault Level Configuration