Reputation
Badges 1
43 × Eureka!@<1523701205467926528:profile|AgitatedDove14> thanks for your reply! Do I also need to change the path for docker_pip_cache ?
However, nothing is saved along this path
How to override /root/.cache/pip path?
No, I get start pipelines through cloning as tasks, it's less visual, but this way I can change all my hyperparameters
@<1523701435869433856:profile|SmugDolphin23> Two tasks were created when gpus=2, nodes=2, task.launch_multi_node(node). But their running status does not end, and model training does not begin.
@<1523701435869433856:profile|SmugDolphin23> it work with gpus=1 and node=2 and there are only two tasks is created
Hi @<1523701435869433856:profile|SmugDolphin23> Thank you for your reply!
I use 2 machines.
I set these parameters, but unfortunately, the training has not started.
torch.distributed.DistStoreError: Timed out after 1801 seconds waiting for clients. 2/4 clients joined.
@<1523701435869433856:profile|SmugDolphin23> This error occurs when a secondary task is created with launch_multi_node. And this error disappears when I add the reuse_last_task_id=False flag when initializing the task. But now I have a new problem. I can't request more than 2 nodes. The training logs freezes after several iterations of first epoch with three workers. And if i request four workers i get this error:
DEBUG Epoch 0: 8%|▊ | 200/2484 [04:43<53:55, 0.71it/s, v_num=...
Hi @<1523701205467926528:profile|AgitatedDove14>
I started an experiment with gpus=2 and node=2 and I have the following logs


The /root/clearml.conf file no longer contains anything.
@<1523701070390366208:profile|CostlyOstrich36> Above, I provided the code for this pipeline, I specify cache_executed_step=True for each pipeline step , but it doesn't work.
@<1523701435869433856:profile|SmugDolphin23> I added os.environ["NCCL_SOCKET_IFNAME" and I managed to run on nccl
But it seems that workaround that you said do not run 2 processes on 2 nodes, but 4 processes on 4 different nodescurrent_conf = task.launch_multi_node(args.nodes*args.gpus)os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))os.environ["NODE_RANK"] = str(current_conf["node_rank"] // args.gpus)
`os.environ["LOCAL_RANK"] = str(current_conf["nod...
@<1523701435869433856:profile|SmugDolphin23> yeah, I am running this inside a docker container and cuda is available
Hi @<1523701205467926528:profile|AgitatedDove14>
I define a pipeline through functions. I have a lot of parameters, about 40. It is inconvenient to overwrite them all from the window that is on the screen.