Essentially, while running on k8s_glue, I want to pull the docker image/container, then pip install the additional requirements.txt into them...
kkie..now I get it.. I set up the clearml-agent on an EC2 instance. and it works now.
Thanks
Ah kk, it is ---laptop:0 worker is no more now.. But wrt to our original qn, I can see the agent(worker) in the clearml-server UI ..
Using clearml-task, I am able to pass in the exact requirements.txt file, I am not sure how we can accomplish that when you using the Python train_it.py and execute_remotely() option.
AgitatedDove14
Yeah, that worked.. As I was the running the agent in a different machine as our deployment of clearml was in k8s.
Hi AgitatedDove14 , I also fiddled around by changing this line and restarted the deployment. But this just causes it revert back 0.17.2rc4 again.python3 -m pip install clearml-agent==0.17.2rc3
This is where I downloaed the log. Seems like some docker issue, though i cant seem to figure it out. As an alternative, I spawned a clearml-agent outside the k8 environment and it was able to execute well.
Nice tutorial.. Though personally, I prefer a more clean-cut presentation (without the Yays and muaks or the the turtle). 😄 But usually, as long as content is there, it shldnt matter...
Mostly DL, but I suppose there could be ML use cases also
Hi guys,
I filled up the default_output_ur in the conf file, but it doesnt get reflected in the clearml ui.
Disclaimer : Clearml is setup as a k8s pod using the Helm chartssdk { development { # Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead. default_output_uri: " " } }
Yeah, I restarted the deployment and sshed into the host machine also.. (Img below)
Maybe more of data repository than a model repository...
For me too, had this issue.. I realised that k8s glue, wasnt using the GPU resource compared to running it as clearml-agent..TimelyPenguin76 suggested using the latest Cuda11.0 images, though it also didnt work.
I did update it to clearml-agent 0.17.2 , however the issue still persists for this long-lasting service pod.
However, this issue is no more when trying to dynamically allocate pods using the Kubernetes Glue.k8s_glue_example.py
The use case, is lets say i runpython k8s_glue_example.py --queue glue_qAnd some guys pushes an hyperparameterization job with over 100 experiments to the glue_q, one minute later, I push a simple training job to glue_q.. But I will be forced to wait for the 100 experiments to finish.
Hi AgitatedDove14 , Thanks for the explanation .python k8s_glue_example.py --queue high_priority_q --ports-mode --num-of-services 10 python k8s_glue_example.py --queue low_priority_q --ports-mode --num-of-services 2Would the above be a good way to simulate the below ?clearml-agent daemon --queue high_priority_q low_priority_q
RoughTiger69
So prefect tasks :
Loads Data into clearml-data Runs trainining in clearml Publish model (manual trigger required, so user publishes model) and return model url Seldon deploys the model ( model url passed in)





