Reputation
Badges 1
45 × Eureka!My own agent.
I want to clarify:
I was asking if such a feature exists (that limits number of simultaneous service tasks that can be brought up when using service mode) and if so how can I utilize it.
Hi SuccessfulKoala55 , just for clarification, running
clearml-agent daemon --services-mode 5 --detached --cpu-only --queue cpu_queue --docker
would allow at most 5 concurrent services, right?
with a self-hosted clearml server
I am not sure what you mean. This is text, while I grab it from the artifact via python and print it, newlines are printed as expected
Well, on the first task it grabs it opens a different WORKER:gpu0 worker entry as expected while the agent stays with WORKER:dgpu0,1,2,3
but the other tasks on queue won't start and upon the first task's completion the following are not being run on WORKER:gpu0 but on WORKER:dgpu0,1,2,3 instead using only 1 GPU (the task execution says it runs on WORKER:gpu0)
SuccessfulKoala55
Well, I've removed the requirement altogether and now it won't fail on this anymore (TF is provided anyway AFAIK via the image) but now I get the following:
Any ideas?
*Needless to say, when running locally this works with no problem. Also the http://nvcr.io/nvidia/tensorflow:21.02-tf2-py3 image is able to run TRT
SuccessfulKoala55 On another note, I'm also getting
ERROR: Could not find a version that satisfies the requirement pandas==1.3.4 (from versions: 0.1, 0.2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.21.1, 0.22.0, 0.23....
SuccessfulKoala55 , meanwhile I try that, I encounter something weird. I am using a clearml agent with the following
clearml-agent daemon --detached --docker --gpus 0,1,2,3 --dynamic-gpus --queue kenny_1_gpu_queue=1
But for some reason although all the gpus are free and no other agent is on the machine, only one task is executed at the time instead of 4. Why is that?
I am also running from a NVIDIA container and I get
ERROR: No matching distribution found for tensorflow==2.4.0+nv
clearml_agent: ERROR: Could not install task requirements!
docker image is
http://nvcr.io/nvidia/tensorflow:21.10-tf2-py3
What should I do?
Well the requirements were automatically filled, not by me
SuccessfulKoala55 I've tried changing manually the TF version but it fails. I get:
import tensorflow as tf
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/tensorflow/init.py", line 435, in <module>
_ll.load_library(_main_dir)
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/tensorflow/python/framework/load_library.py", line 153, in load_library
py_tf.TF_LoadLibrary(lib)
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/py...
Great, it is quite important for my use case. If you could also allow task.get_reported_console_output()
to get a log level as input (or minimal log level), I'd be grateful.
Hi SuccessfulKoala55 ,
failed. I read in the docs I can use mark_failed .
How should I use it correctly with task.close()?
Or should the task.close appear before the task.mark_failed?
TimelyPenguin76
Wouldn'ttask.mark_failed() task.close()
Work?
Yes, fail it and then close it
try making two tasks, both with the same project name (While the project name contains '//') and you will get the same error.