Reputation
Badges 1
121 × Eureka!Yeah, I restarted the deployment and sshed into the host machine also.. (Img below)
Yeah, currently we are evaulating Seldon.. But was wondering whether clearml enterprise version wud do something similar ?
So now you donโt have any failures but gpu usage issue?
I didnt run the hyper_parameter_optimzer.py, as I was thinking if there is already a problem with the base, no use with running the series of experiments
How about running the ClearML agent in docker mode?
Prev, we had our clearml-agent run in the bare-metal machine instead in docker formation. There wasnt any issue.. Though I havent tried with 0.17.2 version
Using clearml-task, I am able to pass in the exact requirements.txt file, I am not sure how we can accomplish that when you using the Python train_it.py and execute_remotely() option.
AgitatedDove14
It is like generating a report per Task level (esp for Training Jobs).. It's like packaging a report out per Training job..
This is from my k8 cluster. Using the clearml helm package, I was able to set this up.
For the clearml-agent deployment file, I updated this linepython3 -m pip install clearml-agent==0.17.2rc4
and restarted the deployment. However the conf file is still empty.
Should I also update the clearml-agent-services as well in the clearml-agent-services deployment file ?
I just changed the yaml file of clearml-agent to get it to start with the above line.python3 -m pip install clearml-agent==0.17.2rc4
Hi AgitatedDove14 , I also fiddled around by changing this line and restarted the deployment. But this just causes it revert back 0.17.2rc4 again.python3 -m pip install clearml-agent==0.17.2rc3
Nothing changed.. the clearml.conf is still as is (empty)
Could it be another applications's "elasticsearch-pv" and not clearml's
I just checked the /root/clearml.conf file and it just containssdk{ }
Just figured out..
Seems like the docker image below, didnt have tensorflow package.. ๐ฎtensorflow/tensorflow:latest-devel-gpu
I shld have checked prior... My Bad..
Thanks for the help
Essentially, while running on k8s_glue, I want to pull the docker image/container, then pip install the additional requirements.txt into them...
Is this some sort of polling ?
End of the day, we are just worried whether this will hog resources compared to a web-hook ? Any ideas
Hi TimelyPenguin76 ,
Instead of running the hyper_parameter_optimizer.py, I tried running the base_template_keras_simple.py instead.. It seems that I didnt use the GPU, however when i ssh into clearml-glueq-id-ffaf55c984ea4dbfb059387b983746ba:gpuall pod, and ran nvidia-smi, it gave an output.
kkie..now I get it.. I set up the clearml-agent on an EC2 instance. and it works now.
Thanks
sure, I'll post some questions once I wrap my mind around it..
For me too, had this issue.. I realised that k8s glue, wasnt using the GPU resource compared to running it as clearml-agent..TimelyPenguin76 suggested using the latest Cuda11.0 images, though it also didnt work.
nice.. this looks a bit friendly.. ๐ .. Let me try it.. Thanks
Our main goal, maybe I shld have stated prior. We are data scientists who need a mlops environment to track and also run our experiments..
Hi SuccessfulKoala55 , kkie..
1)Actually, now i am using AWS. I am trying to set up Clearml server in K8. However, clearml-agents will be just another ec2-instance/docker image.
2) For phase 2, I will try Clearml AWS AutoScaler Service.
3) At this point, I think I will have a crack at JuicyFox94 's solution as well.
We have to do it in-premise.. Cloud providers are not allowed for the final implementation. Of course, now we use Cloud to test out our ideas.
Yup, i used the value file for the agent. However, i manually edited for the agentservices (as there was no example for it in the github).. Also I am not sure what is the CLEARML_HOST_IP (left it empty)