Reputation
Badges 1
46 × Eureka!I need to mock it - because I'm writing some unittests
feels like a typo somewhere
because, otherwise it becomes a bit of a chicken and egg problem
- update code
- git push
- docker build and push on CI
- use new docker sha for task execution
- update code
- git push
- repeat?
Hmmm, my only issue there is that not all of my "artefacts" are clearml artefacts.
The files I need are models and other locally modified files that get generated by the clearml task on remote
Thanks! so it seems like the key is the Task.connect
and bubble up params to original task, correct?
I do change the task and the project name, the task name change works fine but the project name change silently fails
Would I also be able to change the task name from within the subprocess?
@<1537605940121964544:profile|EnthusiasticShrimp49> , now that I have run the task on remote, can I copy the artefacts/files it creates back to my local fs?
Lets say the artefacts are something likeartefacts = [checkpoint.pth, dvc.lock, some_other_dynamically_generated_file]
I've also overriden CLEARML_FILES_HOST= None , and configured it in clearml.conf file. Don't know where its picking 8081 😕
I tried that earlier - that checks out , it matches the s3 path I provide in the conf
it worked. The env variables definitely do not work! Had to use clearml.conf along with use_credential_chain=True
this doesn't interrupt jobs, but it slows it down, and it takes a lot of time to quit (adds ~2 hours for the process to end)
As mentioned above, I've tried both (env and clearml.conf). Here are my configs (I've blacked out urls and creds)
conf file
api {
web_server:
api_server:
files_server:
credentials {
"access_key" = "xyz"
"secret_key" = "xyz"
}
}
Relevant log (it uploads to S3, I can see the artefact fine on clearml's experiment tracker, but it still causes the job to hang)
2023-12-11 16:06:44,008 - clearml.sto...
@<1523701070390366208:profile|CostlyOstrich36> , as written above, I've done that. It still tries to send to 8081
Its a simple training loop that trains models for 2-3 epochs for a total of 200-300 iterations, saves a few checkpoints and saves a final model at the end of it