
Reputation
Badges 1
30 × Eureka!i can pass any crazy value i want.. it doesn't matter. however, i can use --output_uri= s3://blabla and then at least i get the error that it cannot use that bucket
this is now in my python script:
hello, i'm still not able to save clearml models. They are generated and registered okay, but they are not on the fileserver. i now have Task.init(output_uri=True) and i also have --skip-task-init in clearml commandline so that it doesn't overwrite the task.init call
its as if the line is not there
for comparison: this is when i use --output-uri
the model has this information ... the /tmp seem local URIs suggesting that it doesn't even try to upload them
AgitatedDove14 your trick seems to work (i had to change the url to reflect the fact i run on k8s)
it was to test if reuse_last_task_id made any effect (i have the impression it doesn't)
well it made a difference (the code for the init() is not added anymore) but it still didn't take my output uri
this is the output of the training. it doens't try to upload (note that this is my second try so it already found a model with that name, but on my first try it didn't work either)
but i still think the same should be possible using the Task.init
i'm still trying to understand why it was needed in our case. i have a nvidia gpu operator with mostly the default values installed on our on prem cluster. i found there is an option CONTAINERD_SET_AS_DEFAULT in the operator, which, when enabled, puts the runtime for all pods. we didn't enable that option, maybe if we had enabled it would have worked.
don't know.. but i see for instance when using clearml-task i can put any (even nonsensical) values in Task.init
i set reuse_last_task_id to false to force creation of a new task in all cases
and ... clearml-agent takes a --project and a --name argument that are mandatory , so these are never taken from Task.init
no i don't think so, i think rather Task.init is only used for running outside of agent
task = Task.init(project_name='examples', task_name='moemwap', output_uri=True, reuse_last_task_id=False)
i sniffed the traffic
its a relatively fresh deploy
i think i found it. we had to replace elasticsearch after install of clearml. then i guess clearml migrations iddn't rerun
this seems to be confirmed by this documentation None If you have not changed the default runtime on your GPU nodes, you must explicitly request the NVIDIA runtime by setting runtimeClassName: nvidia
in the Pod spec:
this is the script shown by clearML ui. so the task.init call looks right
(same for environment variable)
i did this as a workaround:
curl -XPUT " None " -H 'Content-Type: application/json' -d'
{
"properties": {
"metric": { "type": "text", "fielddata": true },
"variant": { "type": "text", "fielddata": true }
}
}'
but this workaround should not be needed ,right ? is this a compat issue ? or was my elasticsearch not properly initialized ?
this is my cmdline: clearml-task --name hla --requirements requirements.txt --project examples --output-uri http://clearml-fileserver:8081 --queue aws-instances --script keras_tensorboard.py
and when i try to use --output-uri i can't pass true because obviously i can't pass a boolean only strings