Hardware Stress Test
Large cluster tasks require initial hardware stress tests to identify and remove nodes with accuracy issues. The stress test procedure is as follows:
- Model-based stress test: Use single-node or multi-node tasks in groups to find out the group whose accuracy is inconsistent with that of most cards or machines.
- Command-based stress test: Run the ascend-dmi command to perform the stress test. The command is as follows:
ascend-dmi -dg -i aicore -s -sc 60 -q
Parent topic: Fault Locating Method