Reputation
Badges 1
22 × Eureka!@<1523701087100473344:profile|SuccessfulKoala55> Could you give some advice?
Not in docker mode. So I just need to set venvs_cache.path=/home/frank/env
? I do not think this works.
Even I only set agent.venvs_cache.path=/home/frank/env
in the clearml.conf, the train still failed to start. ClearML always create a new venvs and install Cython and some other packages, so weird!... Installing collected packages: Cython Successfully installed Cython-0.29.32 Adding venv into cache: /general-user/frank/.clearml/venvs-builds/3.8 Running task id [42a050853c43445ebf9248bd4aa54091]: [.]$ /general-user/frank/.clearml/venvs-builds/3.8/bin/python -u tools/train.py
I am conf...
After I enqueue the experiment task into a queue, clearml agent always creates a new venv and installs pakcages.
What I want is all the experiments use the same codes and the preinstalled python virtual env.
Hi TimelyMouse69 can you give any advice? Or can somebody else help? Thanks in advance.
Hi TimelyMouse69 I am not going to cache the virtual env but reuse a preinstalled one. I have set agent.python_binary to use the virtual env python, but clearml use it to create another new virtual env not the preinstalled one. I have install the virtual env before I use clearml, so I want to reuse it.
Thank you CostlyOstrich36 What's the difference between the two environment variables?
Shall I set export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/home/bar/env/bin/python3
orexport CLEARML_AGENT_SKIP_PYTHON_VENV_INSTALL=/home/bar/env/bin/python3
orexport CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/home/bar/env
andexport CLEARML_AGENT_SKIP_PYTHON_VENV_INSTALL=/home/bar/env/bin/python3
Hi @<1523701205467926528:profile|AgitatedDove14> . Yes, Agent will execute the cloned task and Task.init()
inside my code, but I don't know which cmd it use, python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64
or python train.py --batch 64
? Another question is how the task's name and project's name are setting for the WebAPP gives the names and Task.init()
also gives the names.
` clearml-agent list
workers:
- company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: clearml
id: inspur-dev:gpu1,2
ip: 10.180.151.125
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b_5347c06242f2445c8af46e2900a02e2a_inspur-dev:gpu1,2
last_activity_time: '2022-12-13T05:41:54.130672+00:00'
last_report_time: '2022-12-13T05:41:54.130672+00:00'
queues:- id: 89b7dcf476284004a56974749ef6c405
register_time: '2022-12-13T02:07:05.424950+00:00'
register_timeout: 600
system_tags: ...
- id: 89b7dcf476284004a56974749ef6c405
I got it. The workers are created by AI user, but I use user frank to stop them. Now I can stop them when I switch to user AI. AgitatedDove14 Thank you.
clearml-agent daemon --stop Could not find a running clearml-agent instance with worker_name=inspur-dev worker_id=
AgitatedDove14 Fail to stop any worker.
The logs outputs the 10 epoch's evaluation result and stops there. But it should run 50 epochs.
Does Clearml has a skip_git_code
environment to skip clone new code from code base? @<1523701087100473344:profile|SuccessfulKoala55>
The official doc just gives too much choices but less details about how to set them and the relationships.
Yes, I have tested it through another toy example and it reused the first run.
As far as I known, ClearML will not record the whole cmd python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml
. And there is a file path issue as following. The cloned and enqueued task on the WebApp didn't pass --data coco.yaml
to the train.py
and result in the train.py
can not get data args! @<1523701205467926528:profile|AgitatedDove14> could you help?
Yes, I think you are right. Just set export
CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=
/home/bar/env/bin/python3
.
None of them works... CostlyOstrich36 😵💫 😵💫 😵💫
I set the environment variable export CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/general-user/frank/mlflow_ray/env/bin/python3
, then I clone a task and run it. This time no more reinstallation, the task use /general-user/frank/mlflow_ray/env/bin/python3
to run the experiment. But it stuck in the epoch10 when the first evaluation over and just stops there.
Hi AgitatedDove14 can you give me some help?