
Reputation
Badges 1
43 × Eureka!I create a pipeline via PipelineController with adding a step as a function
pipe = PipelineController(
name=cfg.clearml.pipeline_name,
project=cfg.clearml.project_name,
target_project=True,
version=cfg.clearml.version,
add_pipeline_tags=True,
docker=cfg.clearml.dockerfile,
docker_args=DefaultMLPLATparam().docker_arg,
packages=packages,
retry_on_failure=3
)
for parameter in cfg.clearml.params:
pipe.add_...
@<1523701070390366208:profile|CostlyOstrich36>
@<1523701070390366208:profile|CostlyOstrich36> Any ideas?
@<1523701435869433856:profile|SmugDolphin23> I added os.environ["NCCL_SOCKET_IFNAME"
and I managed to run on nccl
But it seems that workaround that you said do not run 2 processes on 2 nodes, but 4 processes on 4 different nodescurrent_conf =
task.launch_multi_node(args.nodes*args.gpus)
os.environ["NODE_RANK"] = str(current_conf.get("node_rank", ""))
os.environ["NODE_RANK"] = str(current_conf["node_rank"] // args.gpus)
`os.environ["LOCAL_RANK"] = str(current_conf["nod...
@<1523701435869433856:profile|SmugDolphin23>
Logs of rank0:
Environment setup completed successfully
Starting Task Execution:
1718702244585 gpuvm-01:gpu3,0 DEBUG InsecureRequestWarning: Certificate verification is disabled! Adding certificate verification is strongly advised. See:
ClearML results page:
/projects/0eae440b14054464a3f9c808ad6447dd/experiments/beaa8c380f3c46f0b6f5a3feab514dc8/output/log
task id [beaa8c380f3c46f0b6f5a3feab514dc8]
world=4
...
Hi @<1523701435869433856:profile|SmugDolphin23> ! I set NODE_RANK in the environment and now
- if gpus=2, node=2, task.launch_multi_node(node) : three tasks are created, and two of which are completed, but one is failed. In this case, are created (gpus*nodes-1) of tasks, some of which crashes with an error, or they all fall with an error. the behavior is inconsistent.
- if gpus=2, node=2, task.launch_multi_node(node*gpus) : seven tasks are created.I n this case, all tasks are failed except t...
@<1523701435869433856:profile|SmugDolphin23> hi! it works! thanks!
@<1523701435869433856:profile|SmugDolphin23> Two tasks were created when gpus=2, nodes=2, task.launch_multi_node(node). But their running status does not end, and model training does not begin.
kubectl exec -it clearml-agent-85fd8ccc6d-7fdk7 -n clearml bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "k8s-glue" out of: k8s-glue, init-k8s-glue (init)
root@clearml-agent-85fd8ccc6d-7fdk7:~# cat /root/clearml.conf
agent.git_user=gitlab_agent
agent.git_pass=682S-pH9ay1nidsxBGyT
agent.cuda_version=118
#agent.docker_internal_mounts.venv_build=/home/s3_cache/venvs-builds
#agent.do...
However, nothing is saved along this path
Hi @<1523701087100473344:profile|SuccessfulKoala55> No, I am using self-hosted ClearML enterprise server
The /root/clearml.conf file no longer contains anything.
If I understand correctly, the cache for pip is stored at /root/.cache/pip. How can I change it? The agent.docker_internal_mounts.pip_cache variable in the config also does not change anything.
I store my data in s3 and clearml tracks this data. I want to migrate this data from one ClearML instance to another, that is, transfer it to another s3 and have a new ClearML instance track it