Reputation
Badges 1
92 × Eureka!oh ... maybe the bottleneck is augmentation in CPU !
But is it normal that the agent don't detect the GPU count and type properly ?
So we have 3 python package, store in github.com
On the dev machine, the datascientist (DS) will add the local ssh key to his github account as authorized ssh keys, account level.
With the DS can run git clone git@github.com:org/repo1
then install that python package via pip install -e .
Do that for all 3 python packages, each in its own repo1
, repo2
and repo3
. All 3 can be clone using the same key that the DS added to his account.
The DS run a tra...
@<1523701070390366208:profile|CostlyOstrich36>
Yes. I am investigating that route now.
what you mean by different script ?
from the logs, it feels like after git clone, it spend minutes without outputting anything. @<1523701205467926528:profile|AgitatedDove14> Do you know what is the agent suppose to do after git clone ?
I guess a check that all packages is installed ? But then with CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1, what is the agent doing ??
If you are using multi storage place, I don't see any other choice than putting multi credential in the conf file ... Free or Paid Clearml Server ...
I think a proper screenshot of the full log with some information redacted is the way to go. Otherwise we are just guessing in the dark
@<1523701087100473344:profile|SuccessfulKoala55> Should I raise a github issue ?
and in the train.py
, I have task.add_requirements("requirements.txt")
so it's not suppose to say "illegal output destination ..." ?
may be specific to fastai
as I cannot reproduce it with another training using yolov5
While creating the autoscaler instance I did provide my git credentials, i.e my username and Personal Access Token.
How exactly did you do that ?
@<1523701087100473344:profile|SuccessfulKoala55> I managed to make this working by:
concat the existing OS ca bundle and zscaler certificate. And set REQUESTS_CA_BUNDLE
to that bundle file
I don;t think there is a "kill task" code. By principle, in Linux, as a parent process, ClearML agent launch the training process. When a parent process is terminated, the linux kernel will, in most of the case, kill all child processes, including your training process.
There may be some way to resume a task from ClearML agent when it restart, but I don;t think that is the default behavior
@<1523701087100473344:profile|SuccessfulKoala55> it is set to "all" as :
NV_LIBCUBLAS_VERSION=12.2.5.6-1NVIDIA_VISIBLE_DEVICES=allCLRML_API_SERVER_URL=https://<redacted>HOSTNAME=1b6a5b546a6bNVIDIA_REQUIRE_CUDA=cuda>=12.2 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=qua...
I will try it. But it's a bit random when this happen so ... We will see
If the agent is the one running the experiment, very likely that your task will be killed.
And when the agent come back, immediately or later, probably nothing will happen. It won't resume ...
I can only guess with little information here. You better try to debug with print statement. Is this happening in submodule uncommited changes ?
not sure ... providing Zscaler certificate seems to allow clearml to talk to our clearml server, hosted in azure, Task init worked. But then failed to connect to the storage account (Azure too) ...
(wrong tab sorry :P)
nevermind, all the database files are in data folder
there is a whole discussion about it here: None
@<1523701087100473344:profile|SuccessfulKoala55> Actually it failed now: failed to talked to our storage in Azure:
ClearML Task: created new task id=c47dd71dea2f421db05647a21d78ed26
2024-01-25 21:45:23,926 - clearml.storage - ERROR - Failed uploading: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)
2024-01-25 21:46:48,877 - clearml.storage - WARNING - Storage helper problem for .clearml.0149daec-7a03-4853-a0cd-a7e2b295...
So I tried:
CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/data/hieu/opt/python-venv/fastai/bin/python3.10
clearml-agent daemon --queue no_venv
Then enqueue a cloned task to no_venv
It is still trying to create a venv (and fail):
[...]
tag =
docker_cmd =
entry_point = debug.py
working_dir = apple_ic
created virtual environment CPython3.10.10.final.0-64 in 140ms
creator CPython3Posix(dest=/data/hieu/deleteme/clearml-agent/venvs-builds/3.10, clear=False, no_vcs_ignore=False, gl...
CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/path/to/my/vemv/bin/python3.12 clearml-agent bla
Set that env var in the terminal before running the agent ?