Reputation
Badges 1
45 × Eureka!this doesn't interrupt jobs, but it slows it down, and it takes a lot of time to quit (adds ~2 hours for the process to end)
Its a simple training loop that trains models for 2-3 epochs for a total of 200-300 iterations, saves a few checkpoints and saves a final model at the end of it
@<1523701070390366208:profile|CostlyOstrich36> , as written above, I've done that. It still tries to send to 8081
No, it was fixed by restarting clearml then and some services. But currently, we gave up and we use debug=True so we dont use the services queue
I set it up like this: clearml-agent daemon --detached --gpus 0,1,2 --queue single-gpu-24 --docker
but when I create the session : clearml-session --docker xyz --git-credentials
and I run nvidia-smi
I only see one gpu
I tried that earlier - that checks out , it matches the s3 path I provide in the conf
Thanks, I can have docker
+ poetry
execution modes then?
Thanks! so it seems like the key is the Task.connect
and bubble up params to original task, correct?
@<1537605940121964544:profile|EnthusiasticShrimp49> , now that I have run the task on remote, can I copy the artefacts/files it creates back to my local fs?
Lets say the artefacts are something likeartefacts = [checkpoint.pth, dvc.lock, some_other_dynamically_generated_file]
Hmmm, my only issue there is that not all of my "artefacts" are clearml artefacts.
The files I need are models and other locally modified files that get generated by the clearml task on remote
I do change the task and the project name, the task name change works fine but the project name change silently fails
Hey @<1577106212921544704:profile|WickedSquirrel54> , I would definitely be interested in this. A gist would be cool too
Would I also be able to change the task name from within the subprocess?
it worked. The env variables definitely do not work! Had to use clearml.conf along with use_credential_chain=True
Also @<1523701070390366208:profile|CostlyOstrich36> - are these actions available for on prem OSS clearml-server deployments too?
I've also overriden CLEARML_FILES_HOST= None , and configured it in clearml.conf file. Don't know where its picking 8081 😕
I'm using clearml 1.9.3 client side
This is the issue
Setting up connection to remote session
Starting SSH tunnel to root@192.168.1.185, port 10022
SSH tunneling failed, retrying in 3 seconds
so the 192.xxxx network is the physical network, and not on the tailscale network
In the end I forked the clearml-session library and removed mechanisms to access the interactive terminal. I added ipc=host.
There's one identifiable issue with clearml-session+tailscale though - while it does launch the daemon properly, it registers the wrong ip address to the task (sometimes the external ip address even when --external is not passed). At the end of the day, if we know which machine it was launched on, we're able to replace that ip address with a tailscale equivalent and st...
I need to mock it - because I'm writing some unittests