Reputation
Badges 1
46 × Eureka!Sorry false alarm
this doesn't interrupt jobs, but it slows it down, and it takes a lot of time to quit (adds ~2 hours for the process to end)
This is the issue
Setting up connection to remote session
Starting SSH tunnel to root@192.168.1.185, port 10022
SSH tunneling failed, retrying in 3 seconds
As mentioned above, I've tried both (env and clearml.conf). Here are my configs (I've blacked out urls and creds)
conf file
api {
web_server:
api_server:
files_server:
credentials {
"access_key" = "xyz"
"secret_key" = "xyz"
}
}
Relevant log (it uploads to S3, I can see the artefact fine on clearml's experiment tracker, but it still causes the job to hang)
2023-12-11 16:06:44,008 - clearml.sto...
@<1523701087100473344:profile|SuccessfulKoala55> Could you elaborate? I believe both the ips are visible to the container.
This is making things slightly complicated because now I have to introduce a jumphost for people who aren’t on the same physical network and are on the same tail scale network
is the agent execution dependent on some CMD in my docker file?
I need to mock it - because I'm writing some unittests
so the 192.xxxx network is the physical network, and not on the tailscale network
No, it was fixed by restarting clearml then and some services. But currently, we gave up and we use debug=True so we dont use the services queue
So I am deploying clearml-server on an on-prem server, and the checkpoints etc. are quite large for the experiments I will do.
Instead I want to periodically upload / back up this data to s3, and free up local disk space. Is that something that is supported?
I see that in my docker-compose installation, most of the big files are in /opt/clearml/data
where is it persisted? if I have multiple sessions I want to persist, is that possible?
it worked. The env variables definitely do not work! Had to use clearml.conf along with use_credential_chain=True
I set it up like this: clearml-agent daemon --detached --gpus 0,1,2 --queue single-gpu-24 --docker
but when I create the session : clearml-session --docker xyz --git-credentials
and I run nvidia-smi
I only see one gpu
found out the command swaps singular and plural. It's --gpus 0 and --gpu 0,1,2
That makes sense, but that would mean that each client/user has to manage the upload themselves, right?
(I'm trying to use clearml to create an abstraction over the compute / cloud)
I've also overriden CLEARML_FILES_HOST= None , and configured it in clearml.conf file. Don't know where its picking 8081 😕
is it in the OSS version too?