Reputation
Badges 1
45 × Eureka!SuccessfulKoala55
Well, I've removed the requirement altogether and now it won't fail on this anymore (TF is provided anyway AFAIK via the image) but now I get the following:
Any ideas?
*Needless to say, when running locally this works with no problem. Also the http://nvcr.io/nvidia/tensorflow:21.02-tf2-py3 image is able to run TRT
A task can also have plots - for example 2d scatter plots and histograms
But this is not the data I want
It should be possible somehow, as they are attached to the Task and displayed in the Task's results tab
I want to access their data
clearml-agent daemon --detached --gpus 0,1,2 --dynamic-gpus --queue 2_gpu_queue=2 --docker --stop
Well the requirements were automatically filled, not by me
We think we fixed it.
The problem seemed to be having a path with // and clearml not handling it well
SuccessfulKoala55 On another note, I'm also getting
ERROR: Could not find a version that satisfies the requirement pandas==1.3.4 (from versions: 0.1, 0.2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.21.1, 0.22.0, 0.23....
Well, on the first task it grabs it opens a different WORKER:gpu0 worker entry as expected while the agent stays with WORKER:dgpu0,1,2,3
but the other tasks on queue won't start and upon the first task's completion the following are not being run on WORKER:gpu0 but on WORKER:dgpu0,1,2,3 instead using only 1 GPU (the task execution says it runs on WORKER:gpu0)
SuccessfulKoala55 I've tried changing manually the TF version but it fails. I get:
import tensorflow as tf
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/tensorflow/init.py", line 435, in <module>
_ll.load_library(_main_dir)
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/tensorflow/python/framework/load_library.py", line 153, in load_library
py_tf.TF_LoadLibrary(lib)
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/py...
project name is: RemoteStorage06/saips06/rdekel/hackathon_baselines/DATA_DIR/
No strange characters as far as I can tell
I'd like if possible a command line, same as I'd just sent, to recognize the specific worker that was brought up in this manner and kill only it
AgitatedDove14 , could it be that the GitHub is not synchronized? I can find only up to 1.2.0.rc3 in it.
try making two tasks, both with the same project name (While the project name contains '//') and you will get the same error.
My own agent.
I want to clarify:
I was asking if such a feature exists (that limits number of simultaneous service tasks that can be brought up when using service mode) and if so how can I utilize it.
Hi SuccessfulKoala55 , just for clarification, running
clearml-agent daemon --services-mode 5 --detached --cpu-only --queue cpu_queue --docker
would allow at most 5 concurrent services, right?
with a self-hosted clearml server
Yes, fail it and then close it
TimelyPenguin76
Wouldn'ttask.mark_failed() task.close()
Work?