Reputation
Badges 1
46 × Eureka!No, it was fixed by restarting clearml then and some services. But currently, we gave up and we use debug=True so we dont use the services queue
@<1523701070390366208:profile|CostlyOstrich36> , as written above, I've done that. It still tries to send to 8081
As mentioned above, I've tried both (env and clearml.conf). Here are my configs (I've blacked out urls and creds)
conf file
api {
web_server:
api_server:
files_server:
credentials {
"access_key" = "xyz"
"secret_key" = "xyz"
}
}
Relevant log (it uploads to S3, I can see the artefact fine on clearml's experiment tracker, but it still causes the job to hang)
2023-12-11 16:06:44,008 - clearml.sto...
this doesn't interrupt jobs, but it slows it down, and it takes a lot of time to quit (adds ~2 hours for the process to end)
is it in the OSS version too?
where is it persisted? if I have multiple sessions I want to persist, is that possible?
Thanks, I can have docker + poetry execution modes then?
With respect to unstructured data, do hyperdatasets work well with audio data (and associated metadata) ?
it worked. The env variables definitely do not work! Had to use clearml.conf along with use_credential_chain=True
I need to mock it - because I'm writing some unittests
Thanks! so it seems like the key is the Task.connect and bubble up params to original task, correct?
can I combine docker and poetry mode?
I want the script to be agnostic to whether it is run using clearml or not, with a particular queue or not
@<1523701087100473344:profile|SuccessfulKoala55> Could you elaborate? I believe both the ips are visible to the container.
This is making things slightly complicated because now I have to introduce a jumphost for people who aren’t on the same physical network and are on the same tail scale network
How does it work with k8s? how can I request the two pods to sit on the same gpu?
Its a simple training loop that trains models for 2-3 epochs for a total of 200-300 iterations, saves a few checkpoints and saves a final model at the end of it
Also @<1523701070390366208:profile|CostlyOstrich36> - are these actions available for on prem OSS clearml-server deployments too?
I've also overriden CLEARML_FILES_HOST= None , and configured it in clearml.conf file. Don't know where its picking 8081 😕
so the 192.xxxx network is the physical network, and not on the tailscale network
I do change the task and the project name, the task name change works fine but the project name change silently fails
This is the issue
Setting up connection to remote session
Starting SSH tunnel to root@192.168.1.185, port 10022
SSH tunneling failed, retrying in 3 seconds
I set it up like this: clearml-agent daemon --detached --gpus 0,1,2 --queue single-gpu-24 --docker
but when I create the session : clearml-session --docker xyz --git-credentials and I run nvidia-smi I only see one gpu