Reputation
Badges 1
121 × Eureka!Just figured out..
Seems like the docker image below, didnt have tensorflow package.. ๐ฎtensorflow/tensorflow:latest-devel-gpu
I shld have checked prior... My Bad..
Thanks for the help
We have k8s on ec2 instances in the cloud. I'll try it there 2morrow and report back..
let me run the clearml-agent outside the k8 system.. and get back to u
Hi martin, i just untemplate-ed thehelm template clearml-server-chart-0.17.0+1.tgz
I found this lines inside.- name: CLEARML_AGENT_DOCKER_HOST_MOUNT value: /opt/clearml/agent:/root/.clearml
Upon ssh-ing into the folders in the both the physical node (/opt/clearml/agent) and the pod (/root/.clearml), it seems there are some files there.. So the mounting worked, it seems.
I am not sure, I get your answer. Should i change the values to something else ?
Thanks
When I push a job to an agent node, i got this error.
"Error response from daemon: network None not found"
HI another qn,dataset_upload_task = Task.get_task(task_id=args['dataset_task_id'])
iris_pickle = dataset_upload_task.artifacts['dataset'].get_local_copy()
How would I replicate the above for Dataset ? Like how to get the iris_pickle file. I did some hacking likewise below.ds.get_mutable_local_copy(target_folder='data')
Subesequently, I have to load the file by name also.I wonder whether there is more elegant way
Hi, will proceed to close this thread. We found some issue with the underlying docker in our machines. We've have not shifted to another k8 of ec2 instances in AWS.
I just downloaded the logs from the Failed task. Seem I have set the agent.package_manager.system_site_packages: true
in the agent as well.
Hi AgitatedDove14 , Attached my create version compared to init version..
When I enqueue both the init and create version into my clearmlQueue, it seems the create version doesnt execute at all.
It just mentions "2021-05-26 16:02:13,053 - clearml - WARNING - Terminating local execution process" and says it has completed successfully.
AgitatedDove14 Not creating but more for orchestrating...
Currently, we manually push a dataset to cleaml-dataset .
Have a pipeline controller Task which (takes in data from clearml-dataset, runs preprocessing, runs training) and Publishes a model (if certain threshold is met).
We have clearml monitor which will monitor all Published models .It will push the uri of the published model to a rabbitmq.
We have a subscriber (python code) listening to the rabbitmq. This takes in the uri from t...
Hi AgitatedDove14 , Just updated that flag, but the problem continues..
` agent.package_manager.system_site_packages = true
.....
Environment setup completed successfully
Starting Task Execution:
ClearML results page: files_server:
Traceback (most recent call last):
File "base_template_keras_simple.py", line 15, in <module>
import tensorflow as tf # noqa: F401
File "/root/.clearml/venvs-builds/3.6/lib/python3.6/site-packages/clearml/binding/import_bind.py", line 59, in __pat...
kkie.. was checking in the forum (if anyone knows anything) before asking them..
TimelyPenguin76 :from clearml import Dataset ds = Dataset.get(dataset_project="PROJ_1", dataset_name="dataset")
hi FriendlySquid61 , The clearml-agent got filled up due to values.yaml file. However, agentservices was empty so I filled it up manually..
Ah, so in the future, we can add non-clearml code as a step in the pipeline controller.
So now you donโt have any failures but gpu usage issue?
I didnt run the hyper_parameter_optimzer.py, as I was thinking if there is already a problem with the base, no use with running the series of experiments
How about running the ClearML agent in docker mode?
Prev, we had our clearml-agent run in the bare-metal machine instead in docker formation. There wasnt any issue.. Though I havent tried with 0.17.2 version
Hi AgitatedDove14 ,
At this point, Showing the url of the cleamltask might be sufficient. Unless in the future, someone wants it to be customised.
But the bigger question is if there is tool to aid with this workflow building ? We are currently experimenting with airflow/prefect.
Hi AgitatedDove14 , Just your reply on https://github.com/allegroai/clearml-agent/issues/50#issuecomment-811554045Basically as jobs are pulled by order, they are pushed into the k8s, then if we hit the max k8s instance limit, we stop pulling jobs until a k8s job is completed, then we continue.
For this scenario,
k8s has an instance limit of 10 (let's say)
I run Optimization (it has about 100 jobs) but only the first 10 will be pulled in k8. After this, I run a single Deep Learning (DL)...
Hi AgitatedDove14
I am still not very clear on using this, even after looking at k8s_glue_example.py 's code
Is it possible to give a sample usage of how this works ?python k8s_glue_example.py --ports-mode --num-of-services
Another question, I am still not sure , how this resolves my original question.
https://github.com/allegroai/clearml-agent/issues/50#issuecomment-811554045
How will imposing an instance limit , prevent or allow --order-fairness feature for example, which ex...
Using clearml-task, I am able to pass in the exact requirements.txt file, I am not sure how we can accomplish that when you using the Python train_it.py and execute_remotely() option.
AgitatedDove14
It'll be good if there was yaml file to deploy clearml-agents into the k8 system.
Yeah within clearml , we use the PipelineController. We are now mainly looking for a single tool to stitch together other products.
But of course, will give first precedence to tools which will work best with clearml. Thus asking, if anyone has had similar experience on setting up such systems.
Is there any documentation on how, we can use this ports mode ? I didnt seem to find any.. Tks
Ah kk, it is ---laptop:0 worker is no more now.. But wrt to our original qn, I can see the agent(worker) in the clearml-server UI ..
kkie..now I get it.. I set up the clearml-agent on an EC2 instance. and it works now.
Thanks