Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Unanswered
Hi! I'M Running Launch_Multi_Mode With Pytorch-Lightning


@<1523701435869433856:profile|SmugDolphin23>
Logs of rank0:

Environment setup completed successfully
 
Starting Task Execution:
 
 
1718702244585 gpuvm-01:gpu3,0 DEBUG InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See: 

ClearML results page: 
 /projects/0eae440b14054464a3f9c808ad6447dd/experiments/beaa8c380f3c46f0b6f5a3feab514dc8/output/log
task id [beaa8c380f3c46f0b6f5a3feab514dc8]
world=4
rank=0
DEVICE_COUNT: 2
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [gpuvm-01.pv]:29500 (errno: 97 - Address family not supported by protocol).
 
1718702425204 gpuvm-01:gpu3,0 DEBUG ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
 
1718702430310 gpuvm-01:gpu3,0 DEBUG ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
 
Missing logger folder: /root/.clearml/venvs-builds/3.11/code/lightning_logs
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 419, in <module>
[rank0]:     main()
[rank0]:   File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 414, in main
[rank0]:     run(task, current_conf.get('node_rank'), args)
[rank0]:   File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 360, in run
[rank0]:     trainer.fit(model, datamodule)
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 944, in _run
[rank0]:     self.__setup_profiler()
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in __setup_profiler
[rank0]:     self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank0]:                                                                             ^^^^^^^^^^^^
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1249, in log_dir
[rank0]:     dirpath = self.strategy.broadcast(dirpath)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank0]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank0]:     broadcast(object_sizes_tensor, src=src, group=group)
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank0]:     work = default_pg.broadcast([tensor], opts)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank0]: Last error:
[rank0]: nvmlDeviceGetHandleByPciBusId() failed: Not Found
 
1718702435318 gpuvm-01:gpu3,0 DEBUG Process failed, exit code 1

Logs of rank1:

Environment setup completed successfully
 
Starting Task Execution:
 
 
1718702279944 gpuvm-11:gpu0,5 DEBUG InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See: 

ClearML results page: 
 /projects/0eae440b14054464a3f9c808ad6447dd/experiments/342596a41c8344f5b75dfe082002d130/output/log
task id [342596a41c8344f5b75dfe082002d130]
world=4
rank=1
DEVICE_COUNT: 2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [terminal]:29500 (errno: 97 - Address family not supported by protocol).
Missing logger folder: /root/.clearml/venvs-builds/3.11/code/lightning_logs
 
1718702435519 gpuvm-11:gpu0,5 DEBUG [rank1]: Traceback (most recent call last):
[rank1]:   File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 419, in <module>
[rank1]:     main()
[rank1]:   File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 414, in main
[rank1]:     run(task, current_conf.get('node_rank'), args)
[rank1]:   File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 360, in run
[rank1]:     trainer.fit(model, datamodule)
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
[rank1]:     call._call_and_handle_interrupt(
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank1]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]:     return function(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
[rank1]:     self._run(model, ckpt_path=ckpt_path)
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 944, in _run
[rank1]:     self.__setup_profiler()
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in __setup_profiler
[rank1]:     self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank1]:                                                                             ^^^^^^^^^^^^
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1249, in log_dir
[rank1]:     dirpath = self.strategy.broadcast(dirpath)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank1]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank1]:     broadcast(object_sizes_tensor, src=src, group=group)
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank1]:     work = default_pg.broadcast([tensor], opts)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, remote process exited or there was a network error, NCCL version 2.20.5
[rank1]: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
[rank1]: Last error:
[rank1]: socketProgress: Connection closed by remote peer terminal<34282>
 
1718702445543 gpuvm-11:gpu0,5 DEBUG Process failed, exit code 1
  
  
Posted 3 months ago
34 Views
0 Answers
3 months ago
2 months ago