
Reputation
Badges 1
110 × Eureka!When I push a job to an agent node, i got this error.
"Error response from daemon: network None not found"
However, I am able to get it to work, if I launch a clearml-agent outside the kubernetes ecosystem.
It'll be good if there was yaml file to deploy clearml-agents into the k8 system.
Hi, sorry for the delayed response. Btw, all the pods are running all good.
Hi AgitatedDove14 , Just updated that flag, but the problem continues..
` agent.package_manager.system_site_packages = true
.....
Environment setup completed successfully
Starting Task Execution:
ClearML results page: files_server:
Traceback (most recent call last):
File "base_template_keras_simple.py", line 15, in <module>
import tensorflow as tf # noqa: F401
File "/root/.clearml/venvs-builds/3.6/lib/python3.6/site-packages/clearml/binding/import_bind.py", line 59, in __pat...
Just figured out..
Seems like the docker image below, didnt have tensorflow package.. 😮tensorflow/tensorflow:latest-devel-gpu
I shld have checked prior... My Bad..
Thanks for the help
Essentially, while running on k8s_glue, I want to pull the docker image/container, then pip install the additional requirements.txt into them...
This is where I downloaed the log. Seems like some docker issue, though i cant seem to figure it out. As an alternative, I spawned a clearml-agent outside the k8 environment and it was able to execute well.
we also might have some other steps incorporated for other tools. We intend to have Label-Studio upstream.. So defintely needed some orchestrator tool
Hi AgitatedDove14 ,
At this point, Showing the url of the cleamltask might be sufficient. Unless in the future, someone wants it to be customised.
But the bigger question is if there is tool to aid with this workflow building ? We are currently experimenting with airflow/prefect.
Ah, so in the future, we can add non-clearml code as a step in the pipeline controller.
One use case now :
Load Data from Label Studio (Manager to manually approve) Push data to Clearml-data Run Training (Manager to manually Publish) Pushes model uri to next step Seldon deploy itLater, if seldon detects a data drift, it will automatically run (steps 2-5)..
At this point, we havent drilled all of it down yet
sure, I'll post some questions once I wrap my mind around it..
Hi SuccessfulKoala55 , kkie..
1)Actually, now i am using AWS. I am trying to set up Clearml server in K8. However, clearml-agents will be just another ec2-instance/docker image.
2) For phase 2, I will try Clearml AWS AutoScaler Service.
3) At this point, I think I will have a crack at JuicyFox94 's solution as well.
nice.. this looks a bit friendly.. 🙂 .. Let me try it.. Thanks
Just to add on, I am using minikube now.
We have to do it in-premise.. Cloud providers are not allowed for the final implementation. Of course, now we use Cloud to test out our ideas.
Thanks JuicyFox94 .
Not really from devops background, Let me try to digest this.. 🙏
Our main goal, maybe I shld have stated prior. We are data scientists who need a mlops environment to track and also run our experiments..
Hi guys,
I filled up the default_output_ur in the conf file, but it doesnt get reflected in the clearml ui.
Disclaimer : Clearml is setup as a k8s pod using the Helm chartssdk { development { # Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead. default_output_uri: "
" } }
yup, i updated this in my local clearml.conf... Or should be updating this elsewhere as well
Yeah, that worked.. As I was the running the agent in a different machine as our deployment of clearml was in k8s.
I just downloaded the logs from the Failed task. Seem I have set the agent.package_manager.system_site_packages: true
in the agent as well.
Using clearml-task, I am able to pass in the exact requirements.txt file, I am not sure how we can accomplish that when you using the Python train_it.py and execute_remotely() option.
AgitatedDove14
The above screenshot is from my local settings... My agents run in the k8s system (like in a pod)
Hi AgitatedDove14 , Attached my create version compared to init version..
When I enqueue both the init and create version into my clearmlQueue, it seems the create version doesnt execute at all.
It just mentions "2021-05-26 16:02:13,053 - clearml - WARNING - Terminating local execution process" and says it has completed successfully.
Hi AgitatedDove14 , This isnt the issue. With or without specifying the queue, I have this error when I do the "Create version" as compared to the "Init version".
I wonder whether this is some issue with using the Create version together with execute_remotely() ..