Unanswered
Hi! I'M Running Launch_Multi_Mode With Pytorch-Lightning
@<1523701435869433856:profile|SmugDolphin23>
Logs of rank0:
Environment setup completed successfully
Starting Task Execution:
1718702244585 gpuvm-01:gpu3,0 DEBUG InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See:
ClearML results page:
/projects/0eae440b14054464a3f9c808ad6447dd/experiments/beaa8c380f3c46f0b6f5a3feab514dc8/output/log
task id [beaa8c380f3c46f0b6f5a3feab514dc8]
world=4
rank=0
DEVICE_COUNT: 2
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [gpuvm-01.pv]:29500 (errno: 97 - Address family not supported by protocol).
1718702425204 gpuvm-01:gpu3,0 DEBUG ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
1718702430310 gpuvm-01:gpu3,0 DEBUG ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /root/.clearml/venvs-builds/3.11/code/lightning_logs
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 419, in <module>
[rank0]: main()
[rank0]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 414, in main
[rank0]: run(task, current_conf.get('node_rank'), args)
[rank0]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 360, in run
[rank0]: trainer.fit(model, datamodule)
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
[rank0]: call._call_and_handle_interrupt(
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]: return function(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
[rank0]: self._run(model, ckpt_path=ckpt_path)
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 944, in _run
[rank0]: self.__setup_profiler()
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in __setup_profiler
[rank0]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1249, in log_dir
[rank0]: dirpath = self.strategy.broadcast(dirpath)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank0]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank0]: broadcast(object_sizes_tensor, src=src, group=group)
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank0]: work = default_pg.broadcast([tensor], opts)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank0]: Last error:
[rank0]: nvmlDeviceGetHandleByPciBusId() failed: Not Found
1718702435318 gpuvm-01:gpu3,0 DEBUG Process failed, exit code 1
Logs of rank1:
Environment setup completed successfully
Starting Task Execution:
1718702279944 gpuvm-11:gpu0,5 DEBUG InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See:
ClearML results page:
/projects/0eae440b14054464a3f9c808ad6447dd/experiments/342596a41c8344f5b75dfe082002d130/output/log
task id [342596a41c8344f5b75dfe082002d130]
world=4
rank=1
DEVICE_COUNT: 2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [terminal]:29500 (errno: 97 - Address family not supported by protocol).
Missing logger folder: /root/.clearml/venvs-builds/3.11/code/lightning_logs
1718702435519 gpuvm-11:gpu0,5 DEBUG [rank1]: Traceback (most recent call last):
[rank1]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 419, in <module>
[rank1]: main()
[rank1]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 414, in main
[rank1]: run(task, current_conf.get('node_rank'), args)
[rank1]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 360, in run
[rank1]: trainer.fit(model, datamodule)
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
[rank1]: call._call_and_handle_interrupt(
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
[rank1]: self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 944, in _run
[rank1]: self.__setup_profiler()
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in __setup_profiler
[rank1]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank1]: ^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1249, in log_dir
[rank1]: dirpath = self.strategy.broadcast(dirpath)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank1]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank1]: broadcast(object_sizes_tensor, src=src, group=group)
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank1]: work = default_pg.broadcast([tensor], opts)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, remote process exited or there was a network error, NCCL version 2.20.5
[rank1]: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
[rank1]: Last error:
[rank1]: socketProgress: Connection closed by remote peer terminal<34282>
1718702445543 gpuvm-11:gpu0,5 DEBUG Process failed, exit code 1
50 Views
0
Answers
5 months ago
5 months ago