Reputation
Badges 1
121 × Eureka!I just changed the yaml file of clearml-agent to get it to start with the above line.python3 -m pip install clearml-agent==0.17.2rc4
Hi AgitatedDove14 , Just your reply on https://github.com/allegroai/clearml-agent/issues/50#issuecomment-811554045Basically as jobs are pulled by order, they are pushed into the k8s, then if we hit the max k8s instance limit, we stop pulling jobs until a k8s job is completed, then we continue.For this scenario,
k8s has an instance limit of 10 (let's say)
I run Optimization (it has about 100 jobs) but only the first 10 will be pulled in k8. After this, I run a single Deep Learning (DL)...
Yeah within clearml , we use the PipelineController. We are now mainly looking for a single tool to stitch together other products.
But of course, will give first precedence to tools which will work best with clearml. Thus asking, if anyone has had similar experience on setting up such systems.
Hi AgitatedDove14 , This isnt the issue. With or without specifying the queue, I have this error when I do the "Create version" as compared to the "Init version".
I wonder whether this is some issue with using the Create version together with execute_remotely() ..
Essentially, while running on k8s_glue, I want to pull the docker image/container, then pip install the additional requirements.txt into them...
kkie..now I get it.. I set up the clearml-agent on an EC2 instance. and it works now.
Thanks
Ah kk, it is ---laptop:0 worker is no more now.. But wrt to our original qn, I can see the agent(worker) in the clearml-server UI ..
Using clearml-task, I am able to pass in the exact requirements.txt file, I am not sure how we can accomplish that when you using the Python train_it.py and execute_remotely() option.
AgitatedDove14
Yeah, that worked.. As I was the running the agent in a different machine as our deployment of clearml was in k8s.
Hi AgitatedDove14 , I also fiddled around by changing this line and restarted the deployment. But this just causes it revert back 0.17.2rc4 again.python3 -m pip install clearml-agent==0.17.2rc3
This is where I downloaed the log. Seems like some docker issue, though i cant seem to figure it out. As an alternative, I spawned a clearml-agent outside the k8 environment and it was able to execute well.
Nice tutorial.. Though personally, I prefer a more clean-cut presentation (without the Yays and muaks or the the turtle). 😄 But usually, as long as content is there, it shldnt matter...
Yup, tried that.. Same error also
Mostly DL, but I suppose there could be ML use cases also
The above screenshot is from my local settings... My agents run in the k8s system (like in a pod)
Hi guys,
I filled up the default_output_ur in the conf file, but it doesnt get reflected in the clearml ui.
Disclaimer : Clearml is setup as a k8s pod using the Helm chartssdk { development { # Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead. default_output_uri: " " } }
Yeah, I restarted the deployment and sshed into the host machine also.. (Img below)
Maybe more of data repository than a model repository...
For me too, had this issue.. I realised that k8s glue, wasnt using the GPU resource compared to running it as clearml-agent..TimelyPenguin76 suggested using the latest Cuda11.0 images, though it also didnt work.
I did update it to clearml-agent 0.17.2 , however the issue still persists for this long-lasting service pod.
However, this issue is no more when trying to dynamically allocate pods using the Kubernetes Glue.k8s_glue_example.py
The use case, is lets say i runpython k8s_glue_example.py --queue glue_qAnd some guys pushes an hyperparameterization job with over 100 experiments to the glue_q, one minute later, I push a simple training job to glue_q.. But I will be forced to wait for the 100 experiments to finish.


