Reputation
Badges 1
46 × Eureka!As mentioned above, I've tried both (env and clearml.conf). Here are my configs (I've blacked out urls and creds)
conf file
api {
web_server:
api_server:
files_server:
credentials {
"access_key" = "xyz"
"secret_key" = "xyz"
}
}
Relevant log (it uploads to S3, I can see the artefact fine on clearml's experiment tracker, but it still causes the job to hang)
2023-12-11 16:06:44,008 - clearml.sto...
So I am deploying clearml-server on an on-prem server, and the checkpoints etc. are quite large for the experiments I will do.
Instead I want to periodically upload / back up this data to s3, and free up local disk space. Is that something that is supported?
I see that in my docker-compose installation, most of the big files are in /opt/clearml/data
it worked. The env variables definitely do not work! Had to use clearml.conf along with use_credential_chain=True
I set it up like this: clearml-agent daemon --detached --gpus 0,1,2 --queue single-gpu-24 --docker
but when I create the session : clearml-session --docker xyz --git-credentials and I run nvidia-smi I only see one gpu
Also @<1523701070390366208:profile|CostlyOstrich36> - are these actions available for on prem OSS clearml-server deployments too?
nice! I was wondering whether we can trigger it by the UI, like "on publishing" an experiment
feels like a typo somewhere
I'm thinking of using s3fs on the entire /opt/clearml/data folder. What do you think?
I tried that earlier - that checks out , it matches the s3 path I provide in the conf
where is it persisted? if I have multiple sessions I want to persist, is that possible?
@<1523701087100473344:profile|SuccessfulKoala55> Could you elaborate? I believe both the ips are visible to the container.
This is making things slightly complicated because now I have to introduce a jumphost for people who aren’t on the same physical network and are on the same tail scale network
I want the script to be agnostic to whether it is run using clearml or not, with a particular queue or not
With respect to unstructured data, do hyperdatasets work well with audio data (and associated metadata) ?
This is the issue
Setting up connection to remote session
Starting SSH tunnel to root@192.168.1.185, port 10022
SSH tunneling failed, retrying in 3 seconds
this doesn't interrupt jobs, but it slows it down, and it takes a lot of time to quit (adds ~2 hours for the process to end)
is it in the OSS version too?
No, it was fixed by restarting clearml then and some services. But currently, we gave up and we use debug=True so we dont use the services queue
Would I also be able to change the task name from within the subprocess?
Hmmm, my only issue there is that not all of my "artefacts" are clearml artefacts.
The files I need are models and other locally modified files that get generated by the clearml task on remote
Thanks! so it seems like the key is the Task.connect and bubble up params to original task, correct?
We have some scenario where a group of clearml experiments might represent a logical experiment. We then want to use all the trained models in a pipeline to generate some output.
With that output, we probably want to some third party like mechanical turk, do some custom evaluations - and some times more than once. We then want to connect (and present) these evaluations alongwith ClearML experiments.
we have various services internally to do this --> however, we have to manually link it up w...