@<1578555761724755968:profile|GrievingKoala83> did you call task.aunch_multi_node(4)
or 2
? I think the right value is 4 in this case
Hi @<1523701435869433856:profile|SmugDolphin23> Thank you for your reply!
I use 2 machines.
I set these parameters, but unfortunately, the training has not started.
torch.distributed.DistStoreError: Timed out after 1801 seconds waiting for clients. 2/4 clients joined.
Hi @<1578555761724755968:profile|GrievingKoala83> ! We have released clearml==1.16.3rc1
which should solve the issue now. Just specify task.launch_multi_node(nodes, devices=gpus)
. For example:
import sys
import os
from argparse import ArgumentParser
import pytorch_lightning as pl
from pytorch_lightning.strategies.ddp import DDPStrategy
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
from torchvision.datasets.mnist import MNIST
from clearml import Task
class LitClassifier(pl.LightningModule):
def __init__(self, hidden_dim=128, learning_rate=1e-3):
super().__init__()
self.save_hyperparameters()
self.l1 = torch.nn.Linear(28 * 28, self.hparams.hidden_dim)
self.l2 = torch.nn.Linear(self.hparams.hidden_dim, 10)
def forward(self, x):
x = x.view(x.size(0), -1)
x = torch.relu(self.l1(x))
x = torch.relu(self.l2(x))
return x
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
self.log('valid_loss', loss)
return loss
def test_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)
@staticmethod
def add_model_specific_args(parent_parser):
parser = ArgumentParser(parents=[parent_parser], add_help=False)
parser.add_argument('--hidden_dim', type=int, default=128)
parser.add_argument('--learning_rate', type=float, default=0.0001)
return parser
if __name__ == '__main__':
Task.force_store_standalone_script()
Task.add_requirements("./requirements.txt")
pl.seed_everything(0)
parser = ArgumentParser()
parser.add_argument('--batch_size', default=32, type=int)
parser.add_argument('--max_epochs', default=3, type=int)
sys.argv.extend(['--max_epochs', '1'])
parser = LitClassifier.add_model_specific_args(parser)
args = parser.parse_args()
task = Task.init(project_name="examples", task_name="pytorch lightning MNIST")
task.execute_remotely(queue_name="Eugene2")
nodes = 2
gpus = 2
config = task.launch_multi_node(nodes, devices=gpus, hide_children=True)
print(os.environ)
# ------------
# data
# ------------
dataset = MNIST('', train=True, download=True, transform=transforms.ToTensor())
mnist_test = MNIST('', train=False, download=True, transform=transforms.ToTensor())
mnist_train, mnist_val = random_split(dataset, [55000, 5000])
train_loader = DataLoader(mnist_train, batch_size=args.batch_size)
val_loader = DataLoader(mnist_val, batch_size=args.batch_size)
test_loader = DataLoader(mnist_test, batch_size=args.batch_size)
model = LitClassifier(args.hidden_dim, args.learning_rate)
ddp = DDPStrategy(process_group_backend="nccl")
trainer = pl.Trainer(max_epochs=args.max_epochs, devices=gpus, num_nodes=nodes)
trainer.fit(model, train_loader, val_loader)
Hi @<1578555761724755968:profile|GrievingKoala83> ! It looks like lightning uses the NODE_RANK
env var to get the rank of a node, instead of NODE
(which is used by pytorch).
We don't set NODE_RANK
yet, but you could set it yourself after launchi_multi_node
:
import os
current_conf = task.launch_multi_node(2)
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
Hope this helps
for example, global rank from failed task in first scenario
@<1523701435869433856:profile|SmugDolphin23> it work with gpus=1 and node=2 and there are only two tasks is created
Hi @<1578555761724755968:profile|GrievingKoala83> ! Are you trying to launch 2 nodes each using 2 gpus on only 1 machine? Because I think that will likely not work because of nccl limitation
Also, I think that you should actually do
task.launch_multi_node(nodes)
os.environ["LOCAL_RANK"] = 0 # this process should fork the other one
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
os.environ["GLOBAL_RANK"] = str(current_conf.get("node_rank", "")) * gpus
os.environ["WORLD_SIZE"] = nodes * gpus
os.environ["LOCAL_WORLD_SIZE"] = gpus
This should spawn only 2 tasks, each task being forked based on the number of gpus.
We will investigate further and officially support this once we have something reliable
I think we need to set more env var if we are running with multiple gpus on 1 node.
Can you try setting:
os.environ["NODE_RANK"] = current_conf["node_rank"] // gpus
os.environ["LOCAL_RANK"] = current_conf["node_rank"] % gpus
os.environ["GLOBAL_RANK"] = current_conf["node_rank"]
@<1523701435869433856:profile|SmugDolphin23> Two tasks were created when gpus=2, nodes=2, task.launch_multi_node(node). But their running status does not end, and model training does not begin.
Hi @<1523701435869433856:profile|SmugDolphin23> ! I set NODE_RANK in the environment and now
- if gpus=2, node=2, task.launch_multi_node(node) : three tasks are created, and two of which are completed, but one is failed. In this case, are created (gpus*nodes-1) of tasks, some of which crashes with an error, or they all fall with an error. the behavior is inconsistent.
- if gpus=2, node=2, task.launch_multi_node(node*gpus) : seven tasks are created.I n this case, all tasks are failed except the main.The errors that occur in the first case are presented in the first two screenshots.
@<1578555761724755968:profile|GrievingKoala83> does it work properly when gpus=1? Also, what are the values found under Initializing distributed: GLOBAL_RANK: , MEMBER:
in the 2 scenarios, for each task?
@<1523701435869433856:profile|SmugDolphin23> Each task shows that process allocates only 1 gpu out of 2 (all task have the same scalar as below)
Hi @<1578555761724755968:profile|GrievingKoala83>
Two tasks are created, but the training does not begin, both tasks are in perpetual running.
Can you print something after the task.launch_multi_node(args.nodes))
- I'm assuming the two Tasks are running and are blocked on the " Trainer
" class
If specified
args.gpus=2
and args.nodes=2,
three
tasks are created.
This is really odd, can you add some prints with task id and rank after the launch_multi_node
call?
print(f"task id [{task.id}] world={os.environ['WORLD_SIZE']} rank={os.environ['RANK`]}")
The errors that occur in the second case are presented in this screenshots.
because I think that what you are encountering now is an NCCL error
you could also try using gloo
as the backend (it uses CPU) just to check that the subprocesses spawn properly
does it work running this without clearml? @<1578555761724755968:profile|GrievingKoala83>
@<1523701435869433856:profile|SmugDolphin23> if task.aunch_multi_node(4)
, then all 4 tasks are failed
can you send the full logs of rank0 and rank1 tasks?
1 more thing: It's likely that you should do task.launch_multi_node(args.nodes * args.gpus)
instead, as I see that the world size set by lightning corresponds to this value
@<1523701435869433856:profile|SmugDolphin23> yeah, I am running this inside a docker container and cuda is available
Hi @<1523701205467926528:profile|AgitatedDove14>
I started an experiment with gpus=2 and node=2 and I have the following logs
@<1523701435869433856:profile|SmugDolphin23> I added os.environ["NCCL_SOCKET_IFNAME"
and I managed to run on nccl
But it seems that workaround that you said do not run 2 processes on 2 nodes, but 4 processes on 4 different nodescurrent_conf =
task.launch_multi_node(args.nodes*args.gpus)
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
os.environ["NODE_RANK"] = str(current_conf["node_rank"] // args.gpus)
os.environ["LOCAL_RANK"] = str(current_conf["node_rank"] % args.gpus)
And when I set args.nodes=2, args.gpus=2, I have 4 tasks:
- first host, global rank = 0, local rank = 0
- second host, global rank = 1, local rank = 1
- third host, global rank = 2, local rank = 0
- fourth host, global rank = 3, local rank = 1
How do I fix this?
@<1578555761724755968:profile|GrievingKoala83> Looks like something inside NCCL now fails which doesn't allow rank0 to start. are you running this inside a docker container? what is the output of nvidia-smi
inside of this container?
@<1523701435869433856:profile|SmugDolphin23> gloo doesn't work for me either
but torch work with nccl and task.launch_multi_node
problems arise specifically with pytorch-lightning
@<1523701435869433856:profile|SmugDolphin23>
Logs of rank0:
Environment setup completed successfully
Starting Task Execution:
1718702244585 gpuvm-01:gpu3,0 DEBUG InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See:
ClearML results page:
/projects/0eae440b14054464a3f9c808ad6447dd/experiments/beaa8c380f3c46f0b6f5a3feab514dc8/output/log
task id [beaa8c380f3c46f0b6f5a3feab514dc8]
world=4
rank=0
DEVICE_COUNT: 2
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [gpuvm-01.pv]:29500 (errno: 97 - Address family not supported by protocol).
1718702425204 gpuvm-01:gpu3,0 DEBUG ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
1718702430310 gpuvm-01:gpu3,0 DEBUG ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /root/.clearml/venvs-builds/3.11/code/lightning_logs
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 419, in <module>
[rank0]: main()
[rank0]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 414, in main
[rank0]: run(task, current_conf.get('node_rank'), args)
[rank0]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 360, in run
[rank0]: trainer.fit(model, datamodule)
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
[rank0]: call._call_and_handle_interrupt(
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]: return function(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
[rank0]: self._run(model, ckpt_path=ckpt_path)
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 944, in _run
[rank0]: self.__setup_profiler()
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in __setup_profiler
[rank0]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1249, in log_dir
[rank0]: dirpath = self.strategy.broadcast(dirpath)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank0]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank0]: broadcast(object_sizes_tensor, src=src, group=group)
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank0]: work = default_pg.broadcast([tensor], opts)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank0]: Last error:
[rank0]: nvmlDeviceGetHandleByPciBusId() failed: Not Found
1718702435318 gpuvm-01:gpu3,0 DEBUG Process failed, exit code 1
Logs of rank1:
Environment setup completed successfully
Starting Task Execution:
1718702279944 gpuvm-11:gpu0,5 DEBUG InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See:
ClearML results page:
/projects/0eae440b14054464a3f9c808ad6447dd/experiments/342596a41c8344f5b75dfe082002d130/output/log
task id [342596a41c8344f5b75dfe082002d130]
world=4
rank=1
DEVICE_COUNT: 2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [terminal]:29500 (errno: 97 - Address family not supported by protocol).
Missing logger folder: /root/.clearml/venvs-builds/3.11/code/lightning_logs
1718702435519 gpuvm-11:gpu0,5 DEBUG [rank1]: Traceback (most recent call last):
[rank1]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 419, in <module>
[rank1]: main()
[rank1]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 414, in main
[rank1]: run(task, current_conf.get('node_rank'), args)
[rank1]: File "/root/.clearml/venvs-builds/3.11/code/lightning_mnist_ddp.py", line 360, in run
[rank1]: trainer.fit(model, datamodule)
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
[rank1]: call._call_and_handle_interrupt(
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
[rank1]: self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 944, in _run
[rank1]: self.__setup_profiler()
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in __setup_profiler
[rank1]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank1]: ^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1249, in log_dir
[rank1]: dirpath = self.strategy.broadcast(dirpath)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
[rank1]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank1]: broadcast(object_sizes_tensor, src=src, group=group)
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank1]: work = default_pg.broadcast([tensor], opts)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, remote process exited or there was a network error, NCCL version 2.20.5
[rank1]: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
[rank1]: Last error:
[rank1]: socketProgress: Connection closed by remote peer terminal<34282>
1718702445543 gpuvm-11:gpu0,5 DEBUG Process failed, exit code 1
@<1523701435869433856:profile|SmugDolphin23> hi! it works! thanks!
@<1578555761724755968:profile|GrievingKoala83> what error are you getting when using gloo? Is it the same one?