执行日志报错:EI0004 “The ranktable or rank is invalid”,如下所示。
custom_group :None dtype :float32 data :1 iter :1 profiling :false pid :1040540 [2024-04-24 06:31:38.087571: F ge_plugin. cc:338] [GePlugin] Initialize ge failed, ret :failed Error Message is: EI0004: 2024-04-24-06:31:33.915.634 The ranktable or rank is invalid, Reason:[The ip in ranktable is not a valid ip address]. Please check the configured ranktable. [The ranktable path configured in the training can be found in the plogs.] Solution: Try again with a valid cluster configuration in the ranktable file. Ensure that the configuration matches the operating environment. TraceBack (most recent call last): PluginManager InvokeAll failed. [FUNC:Initialize][FILE:ops_kernel_manager. cc][LINE :89] OpsManager initialize failed. [FUNC:InnerInitialize][FILE:gelib. cc] [LINE :241] GELib::InnerInitialize failed. [FUNC:Initialize] [FILE:gelib.cc][LINE:169] GEInitialize failed.[FUNC:GEInitialize][FILE:ge_api. cc][LINE :307]
该报错常见于rank id或ranktable的数据校验阶段,通常有以下几种原因:
针对执行日志提示字段中的“Reason”,可以基本看出问题方向,基于此排查配置下发参数的rank id或者ranktable,即可确认问题点。