Hi @<1523701205467926528:profile|AgitatedDove14>
I started an experiment with gpus=2 and node=2 and I have the following logs
Hi @<1578555761724755968:profile|GrievingKoala83> ! We have released clearml==1.16.3rc1
which should solve the issue now. Just specify task.launch_multi_node(nodes, devices=gpus)
. For example:
import sys
import os
from argparse import ArgumentParser
import pytorch_lightning as pl
from pytorch_lightning.strategies.ddp import DDPStrategy
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
from torchvision.datasets.mnist import MNIST
from clearml import Task
class LitClassifier(pl.LightningModule):
def __init__(self, hidden_dim=128, learning_rate=1e-3):
super().__init__()
self.save_hyperparameters()
self.l1 = torch.nn.Linear(28 * 28, self.hparams.hidden_dim)
self.l2 = torch.nn.Linear(self.hparams.hidden_dim, 10)
def forward(self, x):
x = x.view(x.size(0), -1)
x = torch.relu(self.l1(x))
x = torch.relu(self.l2(x))
return x
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
self.log('valid_loss', loss)
return loss
def test_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)
@staticmethod
def add_model_specific_args(parent_parser):
parser = ArgumentParser(parents=[parent_parser], add_help=False)
parser.add_argument('--hidden_dim', type=int, default=128)
parser.add_argument('--learning_rate', type=float, default=0.0001)
return parser
if __name__ == '__main__':
Task.force_store_standalone_script()
Task.add_requirements("./requirements.txt")
pl.seed_everything(0)
parser = ArgumentParser()
parser.add_argument('--batch_size', default=32, type=int)
parser.add_argument('--max_epochs', default=3, type=int)
sys.argv.extend(['--max_epochs', '1'])
parser = LitClassifier.add_model_specific_args(parser)
args = parser.parse_args()
task = Task.init(project_name="examples", task_name="pytorch lightning MNIST")
task.execute_remotely(queue_name="Eugene2")
nodes = 2
gpus = 2
config = task.launch_multi_node(nodes, devices=gpus, hide_children=True)
print(os.environ)
# ------------
# data
# ------------
dataset = MNIST('', train=True, download=True, transform=transforms.ToTensor())
mnist_test = MNIST('', train=False, download=True, transform=transforms.ToTensor())
mnist_train, mnist_val = random_split(dataset, [55000, 5000])
train_loader = DataLoader(mnist_train, batch_size=args.batch_size)
val_loader = DataLoader(mnist_val, batch_size=args.batch_size)
test_loader = DataLoader(mnist_test, batch_size=args.batch_size)
model = LitClassifier(args.hidden_dim, args.learning_rate)
ddp = DDPStrategy(process_group_backend="nccl")
trainer = pl.Trainer(max_epochs=args.max_epochs, devices=gpus, num_nodes=nodes)
trainer.fit(model, train_loader, val_loader)
@<1523701435869433856:profile|SmugDolphin23> hi! it works! thanks!
you could also try using gloo
as the backend (it uses CPU) just to check that the subprocesses spawn properly
does it work running this without clearml? @<1578555761724755968:profile|GrievingKoala83>
because I think that what you are encountering now is an NCCL error
@<1523701435869433856:profile|SmugDolphin23> I added os.environ["NCCL_SOCKET_IFNAME"
and I managed to run on nccl
But it seems that workaround that you said do not run 2 processes on 2 nodes, but 4 processes on 4 different nodescurrent_conf =
task.launch_multi_node(args.nodes*args.gpus)
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
os.environ["NODE_RANK"] = str(current_conf["node_rank"] // args.gpus)
os.environ["LOCAL_RANK"] = str(current_conf["node_rank"] % args.gpus)
And when I set args.nodes=2, args.gpus=2, I have 4 tasks:
- first host, global rank = 0, local rank = 0
- second host, global rank = 1, local rank = 1
- third host, global rank = 2, local rank = 0
- fourth host, global rank = 3, local rank = 1
How do I fix this?
Hi @<1523701435869433856:profile|SmugDolphin23> ! I set NODE_RANK in the environment and now
- if gpus=2, node=2, task.launch_multi_node(node) : three tasks are created, and two of which are completed, but one is failed. In this case, are created (gpus*nodes-1) of tasks, some of which crashes with an error, or they all fall with an error. the behavior is inconsistent.
- if gpus=2, node=2, task.launch_multi_node(node*gpus) : seven tasks are created.I n this case, all tasks are failed except the main.The errors that occur in the first case are presented in the first two screenshots.
Hi @<1578555761724755968:profile|GrievingKoala83> ! It looks like lightning uses the NODE_RANK
env var to get the rank of a node, instead of NODE
(which is used by pytorch).
We don't set NODE_RANK
yet, but you could set it yourself after launchi_multi_node
:
import os
current_conf = task.launch_multi_node(2)
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
Hope this helps
The errors that occur in the second case are presented in this screenshots.
1 more thing: It's likely that you should do task.launch_multi_node(args.nodes * args.gpus)
instead, as I see that the world size set by lightning corresponds to this value
I think we need to set more env var if we are running with multiple gpus on 1 node.
Can you try setting:
os.environ["NODE_RANK"] = current_conf["node_rank"] // gpus
os.environ["LOCAL_RANK"] = current_conf["node_rank"] % gpus
os.environ["GLOBAL_RANK"] = current_conf["node_rank"]
for example, global rank from failed task in first scenario
@<1523701435869433856:profile|SmugDolphin23> it work with gpus=1 and node=2 and there are only two tasks is created
@<1523701435869433856:profile|SmugDolphin23>
Logs of rank0:
Environment setup completed successfully
Starting Task Execution:
1718702244585 gpuvm-01:gpu3,0 DEBUG InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See:
ClearML results page:
/projects/0eae440b14054464a3f9c808ad6447dd/experiments/beaa8c380f3c46f0b6f5a3feab514dc8/output/log
task id [beaa8c380f3c46f0b6f5a3feab514dc8]
world=4
rank=0
DEVICE_COUNT: 2
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [gpuvm-01.pv]:29500 (errno: 97 - Address family not supported by protocol).
1718702425204 gpuvm-01:gpu3,0 DEBUG ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
1718702430310 gpuvm-01:gpu3,0 DEBUG ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /root/.clearml/venvs-builds/3.11/code/lightning_logs
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 419, in <module>
[rank0]: main()
[rank0]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 414, in main
[rank0]: run(task, current_conf.get('node_rank'), args)
[rank0]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 360, in run
[rank0]: trainer.fit(model, datamodule)
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
[rank0]: call._call_and_handle_interrupt(
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]: return function(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
[rank0]: self._run(model, ckpt_path=ckpt_path)
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 944, in _run
[rank0]: self.__setup_profiler()
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in __setup_profiler
[rank0]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1249, in log_dir
[rank0]: dirpath = self.strategy.broadcast(dirpath)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank0]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank0]: broadcast(object_sizes_tensor, src=src, group=group)
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank0]: work = default_pg.broadcast([tensor], opts)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank0]: Last error:
[rank0]: nvmlDeviceGetHandleByPciBusId() failed: Not Found
1718702435318 gpuvm-01:gpu3,0 DEBUG Process failed, exit code 1
Logs of rank1:
Environment setup completed successfully
Starting Task Execution:
1718702279944 gpuvm-11:gpu0,5 DEBUG InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See:
ClearML results page:
/projects/0eae440b14054464a3f9c808ad6447dd/experiments/342596a41c8344f5b75dfe082002d130/output/log
task id [342596a41c8344f5b75dfe082002d130]
world=4
rank=1
DEVICE_COUNT: 2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [terminal]:29500 (errno: 97 - Address family not supported by protocol).
Missing logger folder: /root/.clearml/venvs-builds/3.11/code/lightning_logs
1718702435519 gpuvm-11:gpu0,5 DEBUG [rank1]: Traceback (most recent call last):
[rank1]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 419, in <module>
[rank1]: main()
[rank1]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 414, in main
[rank1]: run(task, current_conf.get('node_rank'), args)
[rank1]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 360, in run
[rank1]: trainer.fit(model, datamodule)
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
[rank1]: call._call_and_handle_interrupt(
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
[rank1]: self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 944, in _run
[rank1]: self.__setup_profiler()
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in __setup_profiler
[rank1]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank1]: ^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1249, in log_dir
[rank1]: dirpath = self.strategy.broadcast(dirpath)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank1]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank1]: broadcast(object_sizes_tensor, src=src, group=group)
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank1]: work = default_pg.broadcast([tensor], opts)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, remote process exited or there was a network error, NCCL version 2.20.5
[rank1]: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
[rank1]: Last error:
[rank1]: socketProgress: Connection closed by remote peer terminal<34282>
1718702445543 gpuvm-11:gpu0,5 DEBUG Process failed, exit code 1
@<1523701435869433856:profile|SmugDolphin23> if task.aunch_multi_node(4)
, then all 4 tasks are failed
@<1578555761724755968:profile|GrievingKoala83> did you call task.aunch_multi_node(4)
or 2
? I think the right value is 4 in this case
@<1523701435869433856:profile|SmugDolphin23> yeah, I am running this inside a docker container and cuda is available
@<1523701435869433856:profile|SmugDolphin23> gloo doesn't work for me either
but torch work with nccl and task.launch_multi_node
problems arise specifically with pytorch-lightning
Hi @<1578555761724755968:profile|GrievingKoala83>
Two tasks are created, but the training does not begin, both tasks are in perpetual running.
Can you print something after the task.launch_multi_node(args.nodes))
- I'm assuming the two Tasks are running and are blocked on the " Trainer
" class
If specified
args.gpus=2
and args.nodes=2,
three
tasks are created.
This is really odd, can you add some prints with task id and rank after the launch_multi_node
call?
print(f"task id [{task.id}] world={os.environ['WORLD_SIZE']} rank={os.environ['RANK`]}")
@<1523701435869433856:profile|SmugDolphin23> Each task shows that process allocates only 1 gpu out of 2 (all task have the same scalar as below)
@<1523701435869433856:profile|SmugDolphin23> Two tasks were created when gpus=2, nodes=2, task.launch_multi_node(node). But their running status does not end, and model training does not begin.
can you send the full logs of rank0 and rank1 tasks?
Hi @<1578555761724755968:profile|GrievingKoala83> ! Are you trying to launch 2 nodes each using 2 gpus on only 1 machine? Because I think that will likely not work because of nccl limitation
Also, I think that you should actually do
task.launch_multi_node(nodes)
os.environ["LOCAL_RANK"] = 0 # this process should fork the other one
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
os.environ["GLOBAL_RANK"] = str(current_conf.get("node_rank", "")) * gpus
os.environ["WORLD_SIZE"] = nodes * gpus
os.environ["LOCAL_WORLD_SIZE"] = gpus
This should spawn only 2 tasks, each task being forked based on the number of gpus.
We will investigate further and officially support this once we have something reliable
@<1578555761724755968:profile|GrievingKoala83> Looks like something inside NCCL now fails which doesn't allow rank0 to start. are you running this inside a docker container? what is the output of nvidia-smi
inside of this container?
Hi @<1523701435869433856:profile|SmugDolphin23> Thank you for your reply!
I use 2 machines.
I set these parameters, but unfortunately, the training has not started.
torch.distributed.DistStoreError: Timed out after 1801 seconds waiting for clients. 2/4 clients joined.
@<1578555761724755968:profile|GrievingKoala83> what error are you getting when using gloo? Is it the same one?
@<1578555761724755968:profile|GrievingKoala83> does it work properly when gpus=1? Also, what are the values found under Initializing distributed: GLOBAL_RANK: , MEMBER:
in the 2 scenarios, for each task?