Reputation
Badges 1
121 × Eureka!It is like generating a report per Task level (esp for Training Jobs).. It's like packaging a report out per Training job..
kkie..now I get it.. I set up the clearml-agent on an EC2 instance. and it works now.
Thanks
RoughTiger69
So prefect tasks :
Loads Data into clearml-data Runs trainining in clearml Publish model (manual trigger required, so user publishes model) and return model url Seldon deploys the model ( model url passed in)
Hi, sorry for the delayed response. Btw, all the pods are running all good.
Hi TimelyPenguin76 ,
Instead of running the hyper_parameter_optimizer.py, I tried running the base_template_keras_simple.py instead.. It seems that I didnt use the GPU, however when i ssh into clearml-glueq-id-ffaf55c984ea4dbfb059387b983746ba:gpuall pod, and ran nvidia-smi, it gave an output.
Yeah, that worked.. As I was the running the agent in a different machine as our deployment of clearml was in k8s.
Yeah, currently we are evaulating Seldon.. But was wondering whether clearml enterprise version wud do something similar ?
kkie.. was checking in the forum (if anyone knows anything) before asking them..
So now you donโt have any failures but gpu usage issue?
I didnt run the hyper_parameter_optimzer.py, as I was thinking if there is already a problem with the base, no use with running the series of experiments
How about running the ClearML agent in docker mode?
Prev, we had our clearml-agent run in the bare-metal machine instead in docker formation. There wasnt any issue.. Though I havent tried with 0.17.2 version
Hi AgitatedDove14 , imho links are def better, unless someone decides to archive their Tasks.. Just wondering about the possibility only..
Ah, so in the future, we can add non-clearml code as a step in the pipeline controller.
Ah kk, it is ---laptop:0 worker is no more now.. But wrt to our original qn, I can see the agent(worker) in the clearml-server UI ..
what does a control plane do ? I cant understand this..
Like the serving engine, will get the user input, preprocess, infer it and send back the results..
For the clearml-agent deployment file, I updated this linepython3 -m pip install clearml-agent==0.17.2rc4and restarted the deployment. However the conf file is still empty.
Should I also update the clearml-agent-services as well in the clearml-agent-services deployment file ?
Yeah within clearml , we use the PipelineController. We are now mainly looking for a single tool to stitch together other products.
But of course, will give first precedence to tools which will work best with clearml. Thus asking, if anyone has had similar experience on setting up such systems.
We have k8s on ec2 instances in the cloud. I'll try it there 2morrow and report back..
Hi, will proceed to close this thread. We found some issue with the underlying docker in our machines. We've have not shifted to another k8 of ec2 instances in AWS.
Essentially, while running on k8s_glue, I want to pull the docker image/container, then pip install the additional requirements.txt into them...
Maybe more of data repository than a model repository...
One use case now :
Load Data from Label Studio (Manager to manually approve) Push data to Clearml-data Run Training (Manager to manually Publish) Pushes model uri to next step Seldon deploy itLater, if seldon detects a data drift, it will automatically run (steps 2-5)..
At this point, we havent drilled all of it down yet
Just figured out..
Seems like the docker image below, didnt have tensorflow package.. ๐ฎtensorflow/tensorflow:latest-devel-gpuI shld have checked prior... My Bad..
Thanks for the help
nice.. this looks a bit friendly.. ๐ .. Let me try it.. Thanks
Just to add on, I am using minikube now.
` Could not load dynamic library 'libcupti.so.11.0'; dlerror: libcupti.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-03-11 09:11:17.368793: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcupti.so'; dlerror: libcupti.so: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-03-11 09...
Thanks. ๐
AgitatedDove14 We too self host (on prem) the helm charts in our local k8s ecosystem.
Triggering - Will be nice feature indeed, currently we are using clearml.monitors to address these now
Is it the UI presenting the entire workflow? - This portion will also be nice. (Let's say someone uses a 1) clearmldataset -> 2) Pipeline Controller (Contains preprocessing, training, hyperparamter tuning) -> 3) clearml-serving ).. If they can see the entire thing, in one flow
We are using seldon f...

