Reputation
Badges 1
29 × Eureka!do I understand correctly that it is impossible to disable the installation of system packages without CLEARML_AGENT_SKIP_PIP_VENV_INSTALL and CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL?
Hi @<1523701205467926528:profile|AgitatedDove14>
I started an experiment with gpus=2 and node=2 and I have the following logs
@<1523701435869433856:profile|SmugDolphin23> hi! it works! thanks!
I had a similar behavior: the parameters for starting the pipeline are not selected in a detailes view
, only in the table view
No, I get start pipelines through cloning as tasks, it's less visual, but this way I can change all my hyperparameters
@<1523701435869433856:profile|SmugDolphin23> I added os.environ["NCCL_SOCKET_IFNAME"
and I managed to run on nccl
But it seems that workaround that you said do not run 2 processes on 2 nodes, but 4 processes on 4 different nodescurrent_conf =
task.launch_multi_node(args.nodes*args.gpus)
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
os.environ["NODE_RANK"] = str(current_conf["node_rank"] // args.gpus)
`os.environ["LOCAL_RANK"] = str(current_conf["nod...
Hi @<1523701087100473344:profile|SuccessfulKoala55> where can I get examples of REST API requests for creating reports?
Hi @<1523701435869433856:profile|SmugDolphin23> ! I set NODE_RANK in the environment and now
- if gpus=2, node=2, task.launch_multi_node(node) : three tasks are created, and two of which are completed, but one is failed. In this case, are created (gpus*nodes-1) of tasks, some of which crashes with an error, or they all fall with an error. the behavior is inconsistent.
- if gpus=2, node=2, task.launch_multi_node(node*gpus) : seven tasks are created.I n this case, all tasks are failed except t...
The errors that occur in the second case are presented in this screenshots.
for example, global rank from failed task in first scenario
@<1523701435869433856:profile|SmugDolphin23> it work with gpus=1 and node=2 and there are only two tasks is created
@<1523701435869433856:profile|SmugDolphin23>
Logs of rank0:
Environment setup completed successfully
Starting Task Execution:
1718702244585 gpuvm-01:gpu3,0 DEBUG InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See:
ClearML results page:
/projects/0eae440b14054464a3f9c808ad6447dd/experiments/beaa8c380f3c46f0b6f5a3feab514dc8/output/log
task id [beaa8c380f3c46f0b6f5a3feab514dc8]
world=4
...
@<1523701435869433856:profile|SmugDolphin23> if task.aunch_multi_node(4)
, then all 4 tasks are failed
@<1523701435869433856:profile|SmugDolphin23> yeah, I am running this inside a docker container and cuda is available
@<1523701435869433856:profile|SmugDolphin23> gloo doesn't work for me either
but torch work with nccl and task.launch_multi_node
problems arise specifically with pytorch-lightning
@<1523701087100473344:profile|SuccessfulKoala55>
Hi @<1523701087100473344:profile|SuccessfulKoala55> No, I am using self-hosted ClearML enterprise server
@<1523701205467926528:profile|AgitatedDove14> thanks for your reply! Do I also need to change the path for docker_pip_cache ?
How to override /root/.cache/pip path?
Hi @<1523701205467926528:profile|AgitatedDove14>
I define a pipeline through functions. I have a lot of parameters, about 40. It is inconvenient to overwrite them all from the window that is on the screen.
If I understand correctly, the cache for pip is stored at /root/.cache/pip. How can I change it? The agent.docker_internal_mounts.pip_cache variable in the config also does not change anything.
Hi @<1523701435869433856:profile|SmugDolphin23> Thank you for your reply!
I use 2 machines.
I set these parameters, but unfortunately, the training has not started.
torch.distributed.DistStoreError: Timed out after 1801 seconds waiting for clients. 2/4 clients joined.
kubectl exec -it clearml-agent-85fd8ccc6d-7fdk7 -n clearml bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "k8s-glue" out of: k8s-glue, init-k8s-glue (init)
root@clearml-agent-85fd8ccc6d-7fdk7:~# cat /root/clearml.conf
agent.git_user=gitlab_agent
agent.git_pass=682S-pH9ay1nidsxBGyT
agent.cuda_version=118
#agent.docker_internal_mounts.venv_build=/home/s3_cache/venvs-builds
#agent.do...
in the clearml section in values.yaml:
clearml:
...
clearmlConfig: |-
agent.docker_pip_cache="/mnt/pip_cache"
However, nothing is saved along this path
I store my data in s3 and clearml tracks this data. I want to migrate this data from one ClearML instance to another, that is, transfer it to another s3 and have a new ClearML instance track it
@<1523701435869433856:profile|SmugDolphin23> Each task shows that process allocates only 1 gpu out of 2 (all task have the same scalar as below)
@<1523701435869433856:profile|SmugDolphin23> Two tasks were created when gpus=2, nodes=2, task.launch_multi_node(node). But their running status does not end, and model training does not begin.