Hi! I'M Running Launch_Multi_Mode With Pytorch-Lightning

SmugDolphin23
Logs of rank0:

Environment setup completed successfully
 
Starting Task Execution:
 
 
1718702244585 gpuvm-01:gpu3,0 DEBUG InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See:


ClearML results page:

 /projects/0eae440b14054464a3f9c808ad6447dd/experiments/beaa8c380f3c46f0b6f5a3feab514dc8/output/log
task id [beaa8c380f3c46f0b6f5a3feab514dc8]
world=4
rank=0
DEVICE_COUNT: 2
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [gpuvm-01.pv]:29500 (errno: 97 - Address family not supported by protocol).
 
1718702425204 gpuvm-01:gpu3,0 DEBUG ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
 
1718702430310 gpuvm-01:gpu3,0 DEBUG ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
 
Missing logger folder: /root/.clearml/venvs-builds/3.11/code/lightning_logs
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 419, in <module>
[rank0]:     main()
[rank0]:   File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 414, in main
[rank0]:     run(task, current_conf.get('node_rank'), args)
[rank0]:   File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 360, in run
[rank0]:     trainer.fit(model, datamodule)
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 944, in _run
[rank0]:     self.__setup_profiler()
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in __setup_profiler
[rank0]:     self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank0]:                                                                             ^^^^^^^^^^^^
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1249, in log_dir
[rank0]:     dirpath = self.strategy.broadcast(dirpath)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank0]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank0]:     broadcast(object_sizes_tensor, src=src, group=group)
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank0]:     work = default_pg.broadcast([tensor], opts)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank0]: Last error:
[rank0]: nvmlDeviceGetHandleByPciBusId() failed: Not Found
 
1718702435318 gpuvm-01:gpu3,0 DEBUG Process failed, exit code 1

Logs of rank1:

Environment setup completed successfully
 
Starting Task Execution:
 
 
1718702279944 gpuvm-11:gpu0,5 DEBUG InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See:


ClearML results page:

 /projects/0eae440b14054464a3f9c808ad6447dd/experiments/342596a41c8344f5b75dfe082002d130/output/log
task id [342596a41c8344f5b75dfe082002d130]
world=4
rank=1
DEVICE_COUNT: 2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [terminal]:29500 (errno: 97 - Address family not supported by protocol).
Missing logger folder: /root/.clearml/venvs-builds/3.11/code/lightning_logs
 
1718702435519 gpuvm-11:gpu0,5 DEBUG [rank1]: Traceback (most recent call last):
[rank1]:   File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 419, in <module>
[rank1]:     main()
[rank1]:   File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 414, in main
[rank1]:     run(task, current_conf.get('node_rank'), args)
[rank1]:   File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 360, in run
[rank1]:     trainer.fit(model, datamodule)
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
[rank1]:     call._call_and_handle_interrupt(
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank1]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]:     return function(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
[rank1]:     self._run(model, ckpt_path=ckpt_path)
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 944, in _run
[rank1]:     self.__setup_profiler()
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in __setup_profiler
[rank1]:     self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank1]:                                                                             ^^^^^^^^^^^^
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1249, in log_dir
[rank1]:     dirpath = self.strategy.broadcast(dirpath)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank1]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank1]:     broadcast(object_sizes_tensor, src=src, group=group)
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank1]:     work = default_pg.broadcast([tensor], opts)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, remote process exited or there was a network error, NCCL version 2.20.5
[rank1]: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
[rank1]: Last error:
[rank1]: socketProgress: Connection closed by remote peer terminal<34282>
 
1718702445543 gpuvm-11:gpu0,5 DEBUG Process failed, exit code 1

  				
Posted 
	8 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

you could also try using gloo as the backend (it uses CPU) just to check that the subprocesses spawn properly

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

I think we need to set more env var if we are running with multiple gpus on 1 node.
Can you try setting:

os.environ["NODE_RANK"] = current_conf["node_rank"] // gpus
os.environ["LOCAL_RANK"] = current_conf["node_rank"] % gpus
os.environ["GLOBAL_RANK"] = current_conf["node_rank"]

  				
Posted 
	9 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

can you send the full logs of rank0 and rank1 tasks?

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

Hi AgitatedDove14
I started an experiment with gpus=2 and node=2 and I have the following logs

  				
Posted 
	9 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

SmugDolphin23 hi! it works! thanks!

  				
Posted 
	8 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

Hi SmugDolphin23 ! I set NODE_RANK in the environment and now

if gpus=2, node=2, task.launch_multi_node(node) : three tasks are created, and two of which are completed, but one is failed. In this case, are created (gpus*nodes-1) of tasks, some of which crashes with an error, or they all fall with an error. the behavior is inconsistent.
if gpus=2, node=2, task.launch_multi_node(node*gpus) : seven tasks are created.I n this case, all tasks are failed except the main.The errors that occur in the first case are presented in the first two screenshots.

  				
Posted 
	9 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

SmugDolphin23 gloo doesn't work for me either

but torch work with nccl and task.launch_multi_node

problems arise specifically with pytorch-lightning

  				
Posted 
	8 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

Hi GrievingKoala83 ! Are you trying to launch 2 nodes each using 2 gpus on only 1 machine? Because I think that will likely not work because of nccl limitation
Also, I think that you should actually do

task.launch_multi_node(nodes)
os.environ["LOCAL_RANK"] = 0  # this process should fork the other one
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
os.environ["GLOBAL_RANK"] = str(current_conf.get("node_rank", "")) * gpus
os.environ["WORLD_SIZE"] = nodes * gpus
os.environ["LOCAL_WORLD_SIZE"] = gpus

This should spawn only 2 tasks, each task being forked based on the number of gpus.
We will investigate further and officially support this once we have something reliable

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

for example, global rank from failed task in first scenario

  				
Posted 
	9 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

GrievingKoala83 Looks like something inside NCCL now fails which doesn't allow rank0 to start. are you running this inside a docker container? what is the output of nvidia-smi inside of this container?

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

because I think that what you are encountering now is an NCCL error

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

SmugDolphin23 it work with gpus=1 and node=2 and there are only two tasks is created

  				
Posted 
	9 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

SmugDolphin23 Two tasks were created when gpus=2, nodes=2, task.launch_multi_node(node). But their running status does not end, and model training does not begin.

  				
Posted 
	9 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

SmugDolphin23 if task.aunch_multi_node(4) , then all 4 tasks are failed

  				
Posted 
	8 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

SmugDolphin23 Each task shows that process allocates only 1 gpu out of 2 (all task have the same scalar as below)

  				
Posted 
	8 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

SmugDolphin23 I added os.environ["NCCL_SOCKET_IFNAME" and I managed to run on nccl
But it seems that workaround that you said do not run 2 processes on 2 nodes, but 4 processes on 4 different nodes
current_conf = task.launch_multi_node(args.nodes*args.gpus)
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
os.environ["NODE_RANK"] = str(current_conf["node_rank"] // args.gpus)
os.environ["LOCAL_RANK"] = str(current_conf["node_rank"] % args.gpus)
And when I set args.nodes=2, args.gpus=2, I have 4 tasks:

first host, global rank = 0, local rank = 0
second host, global rank = 1, local rank = 1
third host, global rank = 2, local rank = 0
fourth host, global rank = 3, local rank = 1
How do I fix this?

  				
Posted 
	8 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

The errors that occur in the second case are presented in this screenshots.

  				
Posted 
	9 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

Hi GrievingKoala83

Two tasks are created, but the training does not begin, both tasks are in perpetual running.

Can you print something after the task.launch_multi_node(args.nodes)) - I'm assuming the two Tasks are running and are blocked on the " Trainer " class

If specified

args.gpus=2

and args.nodes=2,

three

tasks are created.

This is really odd, can you add some prints with task id and rank after the launch_multi_node call?

print(f"task id [{task.id}] world={os.environ['WORLD_SIZE']} rank={os.environ['RANK`]}")

  				
Posted 
	9 months ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

GrievingKoala83 did you call task.aunch_multi_node(4) or 2 ? I think the right value is 4 in this case

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

does it work running this without clearml? GrievingKoala83

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

1 more thing: It's likely that you should do task.launch_multi_node(args.nodes * args.gpus) instead, as I see that the world size set by lightning corresponds to this value

  				
Posted 
	9 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

Hi GrievingKoala83 ! It looks like lightning uses the NODE_RANK env var to get the rank of a node, instead of NODE (which is used by pytorch).
We don't set NODE_RANK yet, but you could set it yourself after launchi_multi_node :

import os    
current_conf = task.launch_multi_node(2)
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))

Hope this helps

  				
Posted 
	9 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

GrievingKoala83 does it work properly when gpus=1? Also, what are the values found under Initializing distributed: GLOBAL_RANK: , MEMBER: in the 2 scenarios, for each task?

  				
Posted 
	9 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

Hi GrievingKoala83 ! We have released clearml==1.16.3rc1 which should solve the issue now. Just specify task.launch_multi_node(nodes, devices=gpus) . For example:

import sys
import os
from argparse import ArgumentParser

import pytorch_lightning as pl
from pytorch_lightning.strategies.ddp import DDPStrategy
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
from torchvision.datasets.mnist import MNIST

from clearml import Task


class LitClassifier(pl.LightningModule):
    def __init__(self, hidden_dim=128, learning_rate=1e-3):
        super().__init__()
        self.save_hyperparameters()

        self.l1 = torch.nn.Linear(28 * 28, self.hparams.hidden_dim)
        self.l2 = torch.nn.Linear(self.hparams.hidden_dim, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = torch.relu(self.l1(x))
        x = torch.relu(self.l2(x))
        return x

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        self.log('valid_loss', loss)
        return loss

    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)

    @staticmethod
    def add_model_specific_args(parent_parser):
        parser = ArgumentParser(parents=[parent_parser], add_help=False)
        parser.add_argument('--hidden_dim', type=int, default=128)
        parser.add_argument('--learning_rate', type=float, default=0.0001)
        return parser


if __name__ == '__main__':
    Task.force_store_standalone_script()
    Task.add_requirements("./requirements.txt")
    pl.seed_everything(0)

    parser = ArgumentParser()
    parser.add_argument('--batch_size', default=32, type=int)
    parser.add_argument('--max_epochs', default=3, type=int)
    sys.argv.extend(['--max_epochs', '1'])
    parser = LitClassifier.add_model_specific_args(parser)
    args = parser.parse_args()

    task = Task.init(project_name="examples", task_name="pytorch lightning MNIST")
    task.execute_remotely(queue_name="Eugene2")
    nodes = 2
    gpus = 2
    config = task.launch_multi_node(nodes, devices=gpus, hide_children=True)
    print(os.environ)

    # ------------
    # data
    # ------------
    dataset = MNIST('', train=True, download=True, transform=transforms.ToTensor())
    mnist_test = MNIST('', train=False, download=True, transform=transforms.ToTensor())
    mnist_train, mnist_val = random_split(dataset, [55000, 5000])

    train_loader = DataLoader(mnist_train, batch_size=args.batch_size)
    val_loader = DataLoader(mnist_val, batch_size=args.batch_size)
    test_loader = DataLoader(mnist_test, batch_size=args.batch_size)

    model = LitClassifier(args.hidden_dim, args.learning_rate)

    ddp = DDPStrategy(process_group_backend="nccl")
    trainer = pl.Trainer(max_epochs=args.max_epochs, devices=gpus, num_nodes=nodes)
    trainer.fit(model, train_loader, val_loader)

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

SmugDolphin23 yeah, I am running this inside a docker container and cuda is available

  				
Posted 
	8 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

Hi SmugDolphin23 Thank you for your reply!
I use 2 machines.
I set these parameters, but unfortunately, the training has not started.

torch.distributed.DistStoreError: Timed out after 1801 seconds waiting for clients. 2/4 clients joined.

  				
Posted 
	8 months ago

					More  		
  Report
		
					GrievingKoala83
				
					0
					 × 1

GrievingKoala83 what error are you getting when using gloo? Is it the same one?

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

Answers 28