Overview

  • The MindIO ACP SDK can be deployed on host machines and containers.
  • You need to create a container image, deploy the image, and enhance the image's security.
  • Only fixed versions of the DeepSpeed framework, X1 framework, MindSpeed-LLM, and Kubernetes are supported.
  • When using the MindIO ACP service, ensure that the user who starts the training job and the user who starts the MindIO ACP daemon process belong to the same primary group.

After the MindIO ACP SDK is installed, to use the cache acceleration capability of MindIO ACP, replace the Torch load/save function in the Python file used in your training model with the load/save function of the MindIO ACP SDK.

  • The same data can be saved to multiple paths. The torch.save function for cyclically saving the same data in the training model is replaced with the mindio_acp.multi_save function of the MindIO ACP SDK.
  • The MindIO ACP SDK provides register_checker(callback, check_dict, user_context, timeout_sec) to register the folders to be observed and the number of common files in the folders as check_dict with MindIO ACP. MindIO ACP checks the number of files in these folders within the time specified by timeout_sec and whether the number of files is the same as that specified by check_dict, and then uses the registered callback function to call back an application. user_context is the second parameter of the callback function, allowing you to set your required parameters in the callback function. timeout_sec indicates the timeout interval for registering an event. If the event does not meet the requirements after the timeout interval is exceeded, an error is reported in the callback function. You can process the subsequent service logic based on the check result.