分布式任务端口号被占用

问题现象

回显信息中存在关键字“Address already in use”,类似如下屏显信息:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
torch.distributed.run: [WARNING] *****************************************
[W socket.cpp:436] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:436] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:472] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "/root/miniconda3/envs/pt2.1/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run
    result = self._invoke_run(role)
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 871, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
    self._rendezvous(worker_group)
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/root/miniconda3/envs/pt2.1/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 54, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

故障根因

关键过程:刚启动分布式任务时,就报该错误。

根本原因分析:拉起的分布式任务中断后有残留进程,或者环境中存在同样端口号的进程,导致该端口号被占用。

处理方法

Error Code

故障事件名称

端口号被占用

故障解释/可能原因

环境中已有进程占用该端口号

故障影响

端口号被占用,不能正常通信,分布式任务失败。

故障自处理模式

查看分布式代码所有的端口号,并修改为其他端口号。

查看并修改脚本中os.environ['MASTER_PORT']设置的端口号,例如:

修改前
os.environ['MASTER_PORT'] = '29500'
修改后
os.environ['MASTER_PORT'] = '29580'

系统处理建议

无需操作