Reputation
Badges 1
121 × Eureka!This is my example. Iteration 10 so there are 10 runs. Looking at the 4th run, 60% of the jobs, 91% iteration, 94% time.. What does it mean ?
Is this some sort of polling ?
End of the day, we are just worried whether this will hog resources compared to a web-hook ? Any ideas
` Could not load dynamic library 'libcupti.so.11.0'; dlerror: libcupti.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-03-11 09:11:17.368793: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcupti.so'; dlerror: libcupti.so: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-03-11 09...
Hi guys,
I filled up the default_output_ur in the conf file, but it doesnt get reflected in the clearml ui.
Disclaimer : Clearml is setup as a k8s pod using the Helm chartssdk { development { # Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead. default_output_uri: "
" } }
let me run the clearml-agent outside the k8 system.. and get back to u
what does a control plane do ? I cant understand this..
Like the serving engine, will get the user input, preprocess, infer it and send back the results..
Hi SuccessfulKoala55 , kkie..
1)Actually, now i am using AWS. I am trying to set up Clearml server in K8. However, clearml-agents will be just another ec2-instance/docker image.
2) For phase 2, I will try Clearml AWS AutoScaler Service.
3) At this point, I think I will have a crack at JuicyFox94 's solution as well.
Hi, for the values.yaml, is there some reference for it esp so , if we assign more Memory to webserver service etc. I tried googling around but so far no luck
Maybe more of data repository than a model repository...
Using clearml-task, I am able to pass in the exact requirements.txt file, I am not sure how we can accomplish that when you using the Python train_it.py and execute_remotely() option.
AgitatedDove14
So now you donβt have any failures but gpu usage issue?
I didnt run the hyper_parameter_optimzer.py, as I was thinking if there is already a problem with the base, no use with running the series of experiments
How about running the ClearML agent in docker mode?
Prev, we had our clearml-agent run in the bare-metal machine instead in docker formation. There wasnt any issue.. Though I havent tried with 0.17.2 version
Mostly DL, but I suppose there could be ML use cases also
Yeah, that worked.. As I was the running the agent in a different machine as our deployment of clearml was in k8s.
The above screenshot is from my local settings... My agents run in the k8s system (like in a pod)
Hi AgitatedDove14 , Attached my create version compared to init version..
When I enqueue both the init and create version into my clearmlQueue, it seems the create version doesnt execute at all.
It just mentions "2021-05-26 16:02:13,053 - clearml - WARNING - Terminating local execution process" and says it has completed successfully.
We have k8s on ec2 instances in the cloud. I'll try it there 2morrow and report back..
We have to do it in-premise.. Cloud providers are not allowed for the final implementation. Of course, now we use Cloud to test out our ideas.
Hi TimelyPenguin76 ,
Instead of running the hyper_parameter_optimizer.py, I tried running the base_template_keras_simple.py instead.. It seems that I didnt use the GPU, however when i ssh into clearml-glueq-id-ffaf55c984ea4dbfb059387b983746ba:gpuall pod, and ran nvidia-smi, it gave an output.
Our main goal, maybe I shld have stated prior. We are data scientists who need a mlops environment to track and also run our experiments..
More than the documentation, my main issue was that naming executed is far too vague.. Maybe something like executed_task_id or something along that line is more appropriate. π
Just to add on, I am using minikube now.
For me too, had this issue.. I realised that k8s glue, wasnt using the GPU resource compared to running it as clearml-agent..TimelyPenguin76 suggested using the latest Cuda11.0 images, though it also didnt work.
kkie.. was checking in the forum (if anyone knows anything) before asking them..
AgitatedDove14
Just figured it out..
node.base_task_id is the base task, which will always be in draft mode, Instead we should use the node.executed which references the current executed node.
Hi AgitatedDove14 , This isnt the issue. With or without specifying the queue, I have this error when I do the "Create version" as compared to the "Init version".
I wonder whether this is some issue with using the Create version together with execute_remotely() ..