Yeah, I restarted the deployment and sshed into the host machine also.. (Img below)
So now you don’t have any failures but gpu usage issue?
I didnt run the hyper_parameter_optimzer.py, as I was thinking if there is already a problem with the base, no use with running the series of experiments
How about running the ClearML agent in docker mode?
Prev, we had our clearml-agent run in the bare-metal machine instead in docker formation. There wasnt any issue.. Though I havent tried with 0.17.2 version
Using clearml-task, I am able to pass in the exact requirements.txt file, I am not sure how we can accomplish that when you using the Python train_it.py and execute_remotely() option.
AgitatedDove14
It is like generating a report per Task level (esp for Training Jobs).. It's like packaging a report out per Training job..
This is from my k8 cluster. Using the clearml helm package, I was able to set this up.
For the clearml-agent deployment file, I updated this linepython3 -m pip install clearml-agent==0.17.2rc4
and restarted the deployment. However the conf file is still empty.
Should I also update the clearml-agent-services as well in the clearml-agent-services deployment file ?
I just changed the yaml file of clearml-agent to get it to start with the above line.python3 -m pip install clearml-agent==0.17.2rc4
Hi AgitatedDove14 , I also fiddled around by changing this line and restarted the deployment. But this just causes it revert back 0.17.2rc4 again.python3 -m pip install clearml-agent==0.17.2rc3
Nothing changed.. the clearml.conf is still as is (empty)
I just checked the /root/clearml.conf file and it just containssdk{ }
Just figured out..
Seems like the docker image below, didnt have tensorflow package.. 😮tensorflow/tensorflow:latest-devel-gpu
I shld have checked prior... My Bad..
Thanks for the help
Essentially, while running on k8s_glue, I want to pull the docker image/container, then pip install the additional requirements.txt into them...
Hi TimelyPenguin76 ,
Instead of running the hyper_parameter_optimizer.py, I tried running the base_template_keras_simple.py instead.. It seems that I didnt use the GPU, however when i ssh into clearml-glueq-id-ffaf55c984ea4dbfb059387b983746ba:gpuall pod, and ran nvidia-smi, it gave an output.
sure, I'll post some questions once I wrap my mind around it..
For me too, had this issue.. I realised that k8s glue, wasnt using the GPU resource compared to running it as clearml-agent..TimelyPenguin76 suggested using the latest Cuda11.0 images, though it also didnt work.
Our main goal, maybe I shld have stated prior. We are data scientists who need a mlops environment to track and also run our experiments..
Hi SuccessfulKoala55 , kkie..
1)Actually, now i am using AWS. I am trying to set up Clearml server in K8. However, clearml-agents will be just another ec2-instance/docker image.
2) For phase 2, I will try Clearml AWS AutoScaler Service.
3) At this point, I think I will have a crack at JuicyFox94 's solution as well.
We have to do it in-premise.. Cloud providers are not allowed for the final implementation. Of course, now we use Cloud to test out our ideas.
Yup, i used the value file for the agent. However, i manually edited for the agentservices (as there was no example for it in the github).. Also I am not sure what is the CLEARML_HOST_IP (left it empty)
TimelyPenguin76 : Yup that's what I do now.. However, shld config to use some distributed storage later
Yes, I am already using a Pipeline.
2. I have another project build using the Pipeline. The pipeline always loads the last commited dataset from the above Dataset project and run few other stuff.
Just not sure, how to make the Pipeline to listen to changes in the Dataset project.
AgitatedDove14 We too self host (on prem) the helm charts in our local k8s ecosystem.
Triggering - Will be nice feature indeed, currently we are using clearml.monitors to address these now
Is it the UI presenting the entire workflow? - This portion will also be nice. (Let's say someone uses a 1) clearmldataset -> 2) Pipeline Controller (Contains preprocessing, training, hyperparamter tuning) -> 3) clearml-serving ).. If they can see the entire thing, in one flow
We are using seldon f...
Hi guys,
I filled up the default_output_ur in the conf file, but it doesnt get reflected in the clearml ui.
Disclaimer : Clearml is setup as a k8s pod using the Helm chartssdk { development { # Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead. default_output_uri: "
" } }