Hi TimelyPenguin76 ,
Instead of running the hyper_parameter_optimizer.py, I tried running the base_template_keras_simple.py instead.. It seems that I didnt use the GPU, however when i ssh into clearml-glueq-id-ffaf55c984ea4dbfb059387b983746ba:gpuall pod, and ran nvidia-smi, it gave an output.
Our main goal, maybe I shld have stated prior. We are data scientists who need a mlops environment to track and also run our experiments..
More than the documentation, my main issue was that naming executed is far too vague.. Maybe something like executed_task_id or something along that line is more appropriate. 👍
Just to add on, I am using minikube now.
For me too, had this issue.. I realised that k8s glue, wasnt using the GPU resource compared to running it as clearml-agent..TimelyPenguin76 suggested using the latest Cuda11.0 images, though it also didnt work.
kkie.. was checking in the forum (if anyone knows anything) before asking them..
AgitatedDove14
Just figured it out..
node.base_task_id is the base task, which will always be in draft mode, Instead we should use the node.executed which references the current executed node.
Hi AgitatedDove14 , This isnt the issue. With or without specifying the queue, I have this error when I do the "Create version" as compared to the "Init version".
I wonder whether this is some issue with using the Create version together with execute_remotely() ..
sure, I'll post some questions once I wrap my mind around it..
Thanks JuicyFox94 .
Not really from devops background, Let me try to digest this.. 🙏
Hi martin, i just untemplate-ed thehelm template clearml-server-chart-0.17.0+1.tgz
I found this lines inside.- name: CLEARML_AGENT_DOCKER_HOST_MOUNT value: /opt/clearml/agent:/root/.clearml
Upon ssh-ing into the folders in the both the physical node (/opt/clearml/agent) and the pod (/root/.clearml), it seems there are some files there.. So the mounting worked, it seems.
I am not sure, I get your answer. Should i change the values to something else ?
Thanks
For the clearml-agent deployment file, I updated this linepython3 -m pip install clearml-agent==0.17.2rc4
and restarted the deployment. However the conf file is still empty.
Should I also update the clearml-agent-services as well in the clearml-agent-services deployment file ?
Yup, i used the value file for the agent. However, i manually edited for the agentservices (as there was no example for it in the github).. Also I am not sure what is the CLEARML_HOST_IP (left it empty)
Nothing changed.. the clearml.conf is still as is (empty)
This is from my k8 cluster. Using the clearml helm package, I was able to set this up.
Yeah, I restarted the deployment and sshed into the host machine also.. (Img below)
Ah kk, it is ---laptop:0 worker is no more now.. But wrt to our original qn, I can see the agent(worker) in the clearml-server UI ..
I just checked the /root/clearml.conf file and it just containssdk{ }
Yup, tried that.. Same error also
Hi AgitatedDove14 , I also fiddled around by changing this line and restarted the deployment. But this just causes it revert back 0.17.2rc4 again.python3 -m pip install clearml-agent==0.17.2rc3
Something is wierd.. It is showing workers which are not running now...
I did update it to clearml-agent 0.17.2 , however the issue still persists for this long-lasting service pod.
However, this issue is no more when trying to dynamically allocate pods using the Kubernetes Glue.k8s_glue_example.py
I just changed the yaml file of clearml-agent to get it to start with the above line.python3 -m pip install clearml-agent==0.17.2rc4
HI another qn,dataset_upload_task = Task.get_task(task_id=args['dataset_task_id'])
iris_pickle = dataset_upload_task.artifacts['dataset'].get_local_copy()
How would I replicate the above for Dataset ? Like how to get the iris_pickle file. I did some hacking likewise below.ds.get_mutable_local_copy(target_folder='data')
Subesequently, I have to load the file by name also.I wonder whether there is more elegant way