Hardware Stress Test

Large cluster tasks require initial hardware stress tests to identify and remove nodes with accuracy issues. The stress test procedure is as follows:

  1. Model-based stress test: Use single-node or multi-node tasks in groups to find out the group whose accuracy is inconsistent with that of most cards or machines.
  2. Command-based stress test: Run the ascend-dmi command to perform the stress test. The command is as follows:
    ascend-dmi -dg -i aicore -s -sc 60 -q