Reputation
Badges 1
121 × Eureka!Hi AgitatedDove14 , Just updated that flag, but the problem continues..
` agent.package_manager.system_site_packages = true
.....
Environment setup completed successfully
Starting Task Execution:
ClearML results page: files_server:
Traceback (most recent call last):
File "base_template_keras_simple.py", line 15, in <module>
import tensorflow as tf # noqa: F401
File "/root/.clearml/venvs-builds/3.6/lib/python3.6/site-packages/clearml/binding/import_bind.py", line 59, in __pat...
Just figured out..
Seems like the docker image below, didnt have tensorflow package.. 😮tensorflow/tensorflow:latest-devel-gpu
I shld have checked prior... My Bad..
Thanks for the help
It'll be good if there was yaml file to deploy clearml-agents into the k8 system.
This is where I downloaed the log. Seems like some docker issue, though i cant seem to figure it out. As an alternative, I spawned a clearml-agent outside the k8 environment and it was able to execute well.
Hi, will proceed to close this thread. We found some issue with the underlying docker in our machines. We've have not shifted to another k8 of ec2 instances in AWS.
When I push a job to an agent node, i got this error.
"Error response from daemon: network None not found"
Hi, Some walk around I thought of.. Btw, I havent tried . AnxiousSeal95 , your comments
1 ) Attach a clearml-task id to each new dataset-id
So in the future, when new data comes in, get the last data commit from the project(Dataset) and get the clearml-task for it. Then clone the clearml-task, and pass in the new data. The only downside, is the need to clone the cleaml-task.
Or alternatively
2) Attach a gitsha-id of the processing code to each new dataset-id.
This can't give the exact code ...
No, the agent can be in any machine.
But the agent has to be running on the machine with gpu
nice... we need moarrrrrrrr !!!!!!!!
It wud be really helpful, if you cud do the next episode on setting up clearml in kubernetes.. 😇
In anyways, keep up the good work for the community
Yup, i used the value file for the agent. However, i manually edited for the agentservices (as there was no example for it in the github).. Also I am not sure what is the CLEARML_HOST_IP (left it empty)
hi FriendlySquid61 , The clearml-agent got filled up due to values.yaml file. However, agentservices was empty so I filled it up manually..
Yup, tried that.. Same error also
Hi martin, i just untemplate-ed thehelm template clearml-server-chart-0.17.0+1.tgz
I found this lines inside.- name: CLEARML_AGENT_DOCKER_HOST_MOUNT value: /opt/clearml/agent:/root/.clearml
Upon ssh-ing into the folders in the both the physical node (/opt/clearml/agent) and the pod (/root/.clearml), it seems there are some files there.. So the mounting worked, it seems.
I am not sure, I get your answer. Should i change the values to something else ?
Thanks
I did update it to clearml-agent 0.17.2 , however the issue still persists for this long-lasting service pod.
However, this issue is no more when trying to dynamically allocate pods using the Kubernetes Glue.k8s_glue_example.py
I just checked the /root/clearml.conf file and it just containssdk{ }
Yeah, I restarted the deployment and sshed into the host machine also.. (Img below)
Nice tutorial.. Though personally, I prefer a more clean-cut presentation (without the Yays and muaks or the the turtle). 😄 But usually, as long as content is there, it shldnt matter...
Hi, sorry for the delayed response. Btw, all the pods are running all good.
Hi, using the pipeline examples, withstep1_dataset_artifact.py, step2_data_processing.py, step3_train_model.py ==> pipeline_controller.py
In the above example, the pipeline_controller is stringing together 3 python files, instead could it string together 3 containers instead. Of course, we can manually compile each into a docker image, but does clearml has some similar approach baked in.
The use case, is lets say i runpython k8s_glue_example.py --queue glue_q
And some guys pushes an hyperparameterization job with over 100 experiments to the glue_q, one minute later, I push a simple training job to glue_q.. But I will be forced to wait for the 100 experiments to finish.
Hi AgitatedDove14
I am still not very clear on using this, even after looking at k8s_glue_example.py 's code
Is it possible to give a sample usage of how this works ?python k8s_glue_example.py --ports-mode --num-of-services
Another question, I am still not sure , how this resolves my original question.
https://github.com/allegroai/clearml-agent/issues/50#issuecomment-811554045
How will imposing an instance limit , prevent or allow --order-fairness feature for example, which ex...
kkie.. I have two differenet projects under clearml web server.
First project , stores datasets only.. using clearml-data (PROJ_1) Second project, is a clearml-pipeline project, (PROJ_2) which pulls the latest commited dataset from (PROJ_1) and does few other steps ... Now, I manually start the PROJ_2 when i know the dataset is updated in PROJ_1.
AgitatedDove14 I am confused now.. Isnt this feature not available in the k8 glue ? Or is it going to be implemented ?