New Features

Feature

Description

MindIO ACP

MindIO ACP is open-sourced.

MindIO TFT

  • MindIO TFT supports hot switchover upon subhealthy faults in the MindSpore scenario.
  • MindIO TFT is open-sourced.

MindCluster ToolBox

  • Adds PCIe full eyed diagram test and optimizes on-chip memory stress test duration and algorithms for A3 servers (A+X).
  • Adds the 310P power consumption stress test.
  • Adds the DSA stress test.

MindCluster Ascend FaultDiag

Adds fault events of A3 AI servers.

MindCluster Ascend Deployer

N/A

MindCluster components

  • Process-level online recovery for UnifiedBus L1-L2 link faults is supported when operator re-execution is disabled.
  • Traffic isolation for faulty NPU instances can be deployed based on AIBrix vLLM.
  • NPU Exporter can output the SN.
  • AIBrix vLLM-based serving instance-level rescheduling is supported.
  • Based on the CRD definition of the AIBrix community, YAML files can be generated by one-click mode, and one-click configuration and delivery are supported.
  • Based on the native CRD definition, YAML files can be generated by one-click mode, and one-click configuration and delivery are supported.
  • SGLang OME deployment and instance-level rescheduling are supported.
  • The reliability of UnifiedBus fault reporting is enhanced.
  • An adaptation layer is added to Volcano to isolate the differences between different job controllers and support affinity scheduling in all podGroups that meet the format requirements.
  • The scheduling resource usage is optimized. If scheduling is not complete, a job can be re-enqueued after a certain period of time.
  • Pre-isolation processing is supported for public faults.
  • NPU Exporter supports custom metrics.
  • Multi-instance inference job scheduling is supported for A3 servers.
  • A3 servers are compatible with accelerator-type of A2 servers.
  • The ecosystem component compatibility is verified.
  • A reference design for the inference job daemon is introduced.
  • NPU fault detection and rectification are supported for appliances.
  • Volcano scheduling supports StatefulSet.
  • Hot switching under MindSpore is supported.
  • The usability of quick training recovery is enhanced.