what you mean by different script ?
If you are using multi storage place, I don't see any other choice than putting multi credential in the conf file ... Free or Paid Clearml Server ...
I think a proper screenshot of the full log with some information redacted is the way to go. Otherwise we are just guessing in the dark
@<1523701087100473344:profile|SuccessfulKoala55> Should I raise a github issue ?
so it's not suppose to say "illegal output destination ..." ?
may be specific to fastai
as I cannot reproduce it with another training using yolov5
While creating the autoscaler instance I did provide my git credentials, i.e my username and Personal Access Token.
How exactly did you do that ?
@<1523701087100473344:profile|SuccessfulKoala55> I managed to make this working by:
concat the existing OS ca bundle and zscaler certificate. And set REQUESTS_CA_BUNDLE
to that bundle file
I don;t think there is a "kill task" code. By principle, in Linux, as a parent process, ClearML agent launch the training process. When a parent process is terminated, the linux kernel will, in most of the case, kill all child processes, including your training process.
There may be some way to resume a task from ClearML agent when it restart, but I don;t think that is the default behavior
@<1523701087100473344:profile|SuccessfulKoala55> it is set to "all" as :
NV_LIBCUBLAS_VERSION=12.2.5.6-1NVIDIA_VISIBLE_DEVICES=allCLRML_API_SERVER_URL=https://<redacted>HOSTNAME=1b6a5b546a6bNVIDIA_REQUIRE_CUDA=cuda>=12.2 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=qua...
I will try it. But it's a bit random when this happen so ... We will see
If the agent is the one running the experiment, very likely that your task will be killed.
And when the agent come back, immediately or later, probably nothing will happen. It won't resume ...
I can only guess with little information here. You better try to debug with print statement. Is this happening in submodule uncommited changes ?
not sure ... providing Zscaler certificate seems to allow clearml to talk to our clearml server, hosted in azure, Task init worked. But then failed to connect to the storage account (Azure too) ...
(wrong tab sorry :P)
nevermind, all the database files are in data folder
there is a whole discussion about it here: None
@<1523701087100473344:profile|SuccessfulKoala55> Actually it failed now: failed to talked to our storage in Azure:
ClearML Task: created new task id=c47dd71dea2f421db05647a21d78ed26
2024-01-25 21:45:23,926 - clearml.storage - ERROR - Failed uploading: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)
2024-01-25 21:46:48,877 - clearml.storage - WARNING - Storage helper problem for .clearml.0149daec-7a03-4853-a0cd-a7e2b295...
So I tried:
CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/data/hieu/opt/python-venv/fastai/bin/python3.10
clearml-agent daemon --queue no_venv
Then enqueue a cloned task to no_venv
It is still trying to create a venv (and fail):
[...]
tag =
docker_cmd =
entry_point = debug.py
working_dir = apple_ic
created virtual environment CPython3.10.10.final.0-64 in 140ms
creator CPython3Posix(dest=/data/hieu/deleteme/clearml-agent/venvs-builds/3.10, clear=False, no_vcs_ignore=False, gl...
CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/path/to/my/vemv/bin/python3.12 clearml-agent bla
Set that env var in the terminal before running the agent ?
Clear. Thanks @<1523701070390366208:profile|CostlyOstrich36> !
if you are using a self hosted clearml server spin up with docker-compose, then you can just mount your NAS to /opt/clearml/fileserver
on the host machine, prior to starting clearml server with docker-compose up
from what I understand, the docker mode were designed for apt
based image and also running as root
inside the container.
We have container that are not apt
based and running not as root
We also do some "start up" that fetch credentials from Key Vault prior running the agent
no. I set apo.file_server to the None in Both the remote agent clearml.conf and my local clearml.conf
In which case, both case where the code is ran from local or remote, will store metrics to cloud storage
@<1523701087100473344:profile|SuccessfulKoala55> I can confirm that v1.8.1rc2 fixed the issue in our case. I manage to reproduce it:
- Do a local commit without pushing
- Create task and queue it
- The queue task failed as expected as the commit is only local
- Push your local commit
- Requeue the task
- Expecting that the task succeeed as the commit is avail: but it fails as the vcs seems to be in weird state from previous failure
- Now with v1.8.1rc2 the issue is solved