Reputation
Badges 1
88 × Eureka!I don't have it so I don't know how things are setup and how to pass on credentials in this case
@<1523701087100473344:profile|SuccessfulKoala55> it is set to "all" as :
NV_LIBCUBLAS_VERSION=12.2.5.6-1NVIDIA_VISIBLE_DEVICES=allCLRML_API_SERVER_URL=https://<redacted>HOSTNAME=1b6a5b546a6bNVIDIA_REQUIRE_CUDA=cuda>=12.2 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=qua...
in my case using self-hosted and agent inside a docker container:
47:45 : taks foo pulled
[ git clone, pip install, check that all requirements satisfied, and nothing is downloaded]
48:16 : start training
i need to do a git clone
You need to do it to test if it works. Clearml-agent will run it itself when it take in a task
In summary:
Spin down the local server
Backup the data folder
In the cloud, extract the data backup
Spin up the cloud server
Set that env var in the terminal before running the agent ?
because when I was running both agents on my local machine everything was working perfectly fine
This is probably you (or someone) had set up ssh public key with your git repo sometime in the past
please share your .service
content too as there are a lot of way to "spawn" in systemd
most of people probable wont even know what that do
I don't think agent are aware of each other. Which mean that you can have as many agent as you want and depending on your task usage, they will be fighting for CPU and GPU usage ...
not sure how for debug sample and scalars ....
But theorically, with the above, one should be able to fully reproduce a run
Based on this : it feels like S3 is supported
How are you using the function update_output_model
?
@<1523701087100473344:profile|SuccessfulKoala55> Should I raise a github issue ?
Should I put that in the clearml.conf file?
To me the whole point of having pipeline is to have a system that "know" previous state and make "smart" decision on what should run and what not. If it's just about if then else, then code already handle all that.
And what I struggle a bit is to find doc on how it determine the existing state and how it make decision what to run. thus the initial question
while the other may need to be 1
instead of true
inside the script that launch the agent, I set all the env need (aka disable installation with the var above)
thanks for all the pointer ! I will try to have a good play around
interesting, the issue happen with mamba
venv. Now I use a python native venv and it is detecting correctly
Are you running within a zero-trust environment like ZScaler ?
Feels like your issue is not ClearML itself, but issue with https/SSL and certificate from your zero-trust system
About the caching: how does it work ? ClearML maintain it own cache and monitor if any of you code changes? Even code that get change inside an import ?