Reputation
Badges 1
32 × Eureka!Hi @<1523701205467926528:profile|AgitatedDove14>
I define a pipeline through functions. I have a lot of parameters, about 40. It is inconvenient to overwrite them all from the window that is on the screen.
@<1523701205467926528:profile|AgitatedDove14> thanks for your reply! Do I also need to change the path for docker_pip_cache ?
How to override /root/.cache/pip path?
If I understand correctly, the cache for pip is stored at /root/.cache/pip. How can I change it? The agent.docker_internal_mounts.pip_cache variable in the config also does not change anything.
in the clearml section in values.yaml:
clearml:
...
clearmlConfig: |-
agent.docker_pip_cache="/mnt/pip_cache"
The /root/clearml.conf file no longer contains anything.
However, nothing is saved along this path
Hi @<1523701087100473344:profile|SuccessfulKoala55> where can I get examples of REST API requests for creating reports?
@<1523701435869433856:profile|SmugDolphin23> Each task shows that process allocates only 1 gpu out of 2 (all task have the same scalar as below)
@<1523701435869433856:profile|SmugDolphin23> Two tasks were created when gpus=2, nodes=2, task.launch_multi_node(node). But their running status does not end, and model training does not begin.
The errors that occur in the second case are presented in this screenshots.
Hi @<1523701087100473344:profile|SuccessfulKoala55> No, I am using self-hosted ClearML enterprise server
do I understand correctly that it is impossible to disable the installation of system packages without CLEARML_AGENT_SKIP_PIP_VENV_INSTALL and CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL?
@<1523701435869433856:profile|SmugDolphin23> I added os.environ["NCCL_SOCKET_IFNAME"
and I managed to run on nccl
But it seems that workaround that you said do not run 2 processes on 2 nodes, but 4 processes on 4 different nodescurrent_conf =
task.launch_multi_node(args.nodes*args.gpus)
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
os.environ["NODE_RANK"] = str(current_conf["node_rank"] // args.gpus)
`os.environ["LOCAL_RANK"] = str(current_conf["nod...
@<1523701435869433856:profile|SmugDolphin23> yeah, I am running this inside a docker container and cuda is available
@<1523701087100473344:profile|SuccessfulKoala55>
@<1523701435869433856:profile|SmugDolphin23> it work with gpus=1 and node=2 and there are only two tasks is created
@<1523701435869433856:profile|SmugDolphin23>
Logs of rank0:
Environment setup completed successfully
Starting Task Execution:
1718702244585 gpuvm-01:gpu3,0 DEBUG InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See:
ClearML results page:
/projects/0eae440b14054464a3f9c808ad6447dd/experiments/beaa8c380f3c46f0b6f5a3feab514dc8/output/log
task id [beaa8c380f3c46f0b6f5a3feab514dc8]
world=4
...
Hi @<1523701435869433856:profile|SmugDolphin23> ! I set NODE_RANK in the environment and now
- if gpus=2, node=2, task.launch_multi_node(node) : three tasks are created, and two of which are completed, but one is failed. In this case, are created (gpus*nodes-1) of tasks, some of which crashes with an error, or they all fall with an error. the behavior is inconsistent.
- if gpus=2, node=2, task.launch_multi_node(node*gpus) : seven tasks are created.I n this case, all tasks are failed except t...
@<1523701435869433856:profile|SmugDolphin23> gloo doesn't work for me either
but torch work with nccl and task.launch_multi_node
problems arise specifically with pytorch-lightning
Hi @<1523701435869433856:profile|SmugDolphin23> Thank you for your reply!
I use 2 machines.
I set these parameters, but unfortunately, the training has not started.
torch.distributed.DistStoreError: Timed out after 1801 seconds waiting for clients. 2/4 clients joined.
@<1523701435869433856:profile|SmugDolphin23> hi! it works! thanks!
for example, global rank from failed task in first scenario
@<1523701435869433856:profile|SmugDolphin23> Everything worked after setting the variables: --env NCCL_IB_DISABLE=1 --env NCCL_SOCKET_IFNAME=ens192 --env NCCL_P2P_DISABLE=1. But previously, these variables were not required for a successful launch. When I run ddp training with two nodes , everything works for me now. But as soon as I increase their number ( nodes > 2 ), I get the following error.
Traceback (most recent call last):
File "/root/.clearml/venvs-builds/3.11/code/light...
@<1523701435869433856:profile|SmugDolphin23> It is possible to request up to 5 workers in the toy example with Feed Forward and MNIST, BUT it is not possible to request more than 2 workers on a real large model
@<1523701435869433856:profile|SmugDolphin23> This error occurs when a secondary task is created with launch_multi_node. And this error disappears when I add the reuse_last_task_id=False flag when initializing the task. But now I have a new problem. I can't request more than 2 nodes. The training logs freezes after several iterations of first epoch with three workers. And if i request four workers i get this error:
DEBUG Epoch 0: 8%|▊ | 200/2484 [04:43<53:55, 0.71it/s, v_num=...
@<1523701435869433856:profile|SmugDolphin23> if task.aunch_multi_node(4)
, then all 4 tasks are failed
Hi @<1523701205467926528:profile|AgitatedDove14>
I started an experiment with gpus=2 and node=2 and I have the following logs
I had a similar behavior: the parameters for starting the pipeline are not selected in a detailes view
, only in the table view