Reputation
Badges 1
46 × Eureka!No, it was fixed by restarting clearml then and some services. But currently, we gave up and we use debug=True so we dont use the services queue
That makes sense, but that would mean that each client/user has to manage the upload themselves, right?
(I'm trying to use clearml to create an abstraction over the compute / cloud)
As mentioned above, I've tried both (env and clearml.conf). Here are my configs (I've blacked out urls and creds)
conf file
api {
web_server:
api_server:
files_server:
credentials {
"access_key" = "xyz"
"secret_key" = "xyz"
}
}
Relevant log (it uploads to S3, I can see the artefact fine on clearml's experiment tracker, but it still causes the job to hang)
2023-12-11 16:06:44,008 - clearml.sto...
nice! I was wondering whether we can trigger it by the UI, like "on publishing" an experiment
@<1523701070390366208:profile|CostlyOstrich36> , as written above, I've done that. It still tries to send to 8081
Hey @<1577106212921544704:profile|WickedSquirrel54> , I would definitely be interested in this. A gist would be cool too
where is it persisted? if I have multiple sessions I want to persist, is that possible?
I do change the task and the project name, the task name change works fine but the project name change silently fails
Hmmm, my only issue there is that not all of my "artefacts" are clearml artefacts.
The files I need are models and other locally modified files that get generated by the clearml task on remote
@<1537605940121964544:profile|EnthusiasticShrimp49> , now that I have run the task on remote, can I copy the artefacts/files it creates back to my local fs?
Lets say the artefacts are something likeartefacts = [checkpoint.pth, dvc.lock, some_other_dynamically_generated_file]
I've also overriden CLEARML_FILES_HOST= None , and configured it in clearml.conf file. Don't know where its picking 8081 😕
it worked. The env variables definitely do not work! Had to use clearml.conf along with use_credential_chain=True
Sorry false alarm
Thanks, I can have docker
+ poetry
execution modes then?
Its a simple training loop that trains models for 2-3 epochs for a total of 200-300 iterations, saves a few checkpoints and saves a final model at the end of it
this doesn't interrupt jobs, but it slows it down, and it takes a lot of time to quit (adds ~2 hours for the process to end)
found out the command swaps singular and plural. It's --gpus 0 and --gpu 0,1,2
I tried that earlier - that checks out , it matches the s3 path I provide in the conf
can I combine docker and poetry mode?
So I am deploying clearml-server on an on-prem server, and the checkpoints etc. are quite large for the experiments I will do.
Instead I want to periodically upload / back up this data to s3, and free up local disk space. Is that something that is supported?
I see that in my docker-compose installation, most of the big files are in /opt/clearml/data
is it in the OSS version too?
feels like a typo somewhere