Reputation
Badges 1
75 × Eureka!I am not an expert on this, just started using torchmetrics.
maybe not at the top but in the Task.init description
@<1523701087100473344:profile|SuccessfulKoala55> I am using it as follows:
after calling clearml.Task.init() I create an object:
cache = Cache('/scidata/marek/diskcache')
and then in the loading function I do:
if cache_arg in load_and_crop.cache:
return load_and_crop.cache[cache_arg] ...
We have a training template that is a k8s job definition (yaml) that creates env variables inside the docker images that is used for tranining, and those env variables are credentials for ClearML. Since they are taken from k8s secrets, they are the same for every user.
I can create secrets for every new user and set env variables accordingly, but perhaps you see a better way out?
to avoid loading and cropping a big image
@<1523701435869433856:profile|SmugDolphin23> I have checked that when setting auto_connect_frameworks=False it works, but disabling just joblib is not enough.
@<1523701087100473344:profile|SuccessfulKoala55> I have the same problem with diskcache
Is there a place where I can find details about this approach?
I circumvented the problem by putting timestamp in task name, but I don't think this is necessary.
From the documentation https://github.com/allegroai/clearml-agent :
` Two K8s integration flavours
Spin ClearML-Agent as a long-lasting service pod
use clearml-agent docker image
map docker socket into the pod (soon replaced by podman)
allow the clearml-agent to manage sibling dockers
benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
downside: Sibling containers `
and in the future I do want to have an Agent on the k8s cluster, but then this should not be a problem I guess as the user is set during Task.init , right?
but I do agree that some kind of autoconnect may be the issue
ok, but do you know why did it try to reuse in the first place?
there is a broken symlink in the original repository
ok, understood, it was probably my fault, I was messing up with the services container and probably made the pipeline task interrupted, so the subtasks themselves have finished, but the pipeline task was not alive when it happened
I created my own docker image with a newer python and the error disappeared
they are universal, I thought there is some interface to them in clearml, but probably not
thanks! is this documented? (I am wondering whether I could have avoided bothering you with my question in the first place)
my code snippet
` from clearml import Task
import os
clearml_task_id = os.environ['CLEARML_TASK_ID']
Task.debug_simulate_remote_task(clearml_task_id)
clearml_task = Task.init(auto_connect_arg_parser=False, auto_resource_monitoring=False)
print(clearml_task.id)
clearml_task.logger.report_scalar(series='s', value='123', iteration=2, title='title')
clearml_task.logger.report_text("some text") `
I see the text on my screen
but seriously, I am very thankful you were willing to spend so much time helping me, I am super impressed by your response time and helpfulness!
which is probably why it does not work for me, right?
I am only getting one user for some reason, even though 4 are in the system
@<1523701435869433856:profile|SmugDolphin23> will send later today
@<1523701435869433856:profile|SmugDolphin23> let me know if you need any help in reproducing