Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Have A Task That'S Running On A Docker Container. Now - There Are A Bunch Of Other Docker Containers (Namely, Nvidia'S Tf 21.1 To 21.10) For Which I Want To Run The Task. How Can I Do This Using Agents / Remote Execution? Thanks

Hi,
I have a task that's running on a docker container.
Now - There are a bunch of other docker containers (namely, Nvidia's TF 21.1 to 21.10) for which I want to run the task.
How can I do this using agents / Remote execution?
Thanks

  
  
Posted 3 years ago
Votes Newest

Answers 15


Well, on the first task it grabs it opens a different WORKER:gpu0 worker entry as expected while the agent stays with WORKER:dgpu0,1,2,3
but the other tasks on queue won't start and upon the first task's completion the following are not being run on WORKER:gpu0 but on WORKER:dgpu0,1,2,3 instead using only 1 GPU (the task execution says it runs on WORKER:gpu0)

  
  
Posted 3 years ago

SuccessfulKoala55
Well, I've removed the requirement altogether and now it won't fail on this anymore (TF is provided anyway AFAIK via the image) but now I get the following:

Any ideas?

*Needless to say, when running locally this works with no problem. Also the http://nvcr.io/nvidia/tensorflow:21.02-tf2-py3 image is able to run TRT

  
  
Posted 3 years ago

You should adjust the tensorflow version to match the image... Edit the task requirements for that

  
  
Posted 3 years ago

SuccessfulKoala55 On another note, I'm also getting
ERROR: Could not find a version that satisfies the requirement pandas==1.3.4 (from versions: 0.1, 0.2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.21.1, 0.22.0, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.25.2, 0.25.3, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.0.5, 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5)
ERROR: No matching distribution found for pandas==1.3.4
clearml_agent: ERROR: Could not install task requirements!

while pandas==1.3.4 is easily available from pypi

  
  
Posted 3 years ago

SuccessfulKoala55 , meanwhile I try that, I encounter something weird. I am using a clearml agent with the following
clearml-agent daemon --detached --docker --gpus 0,1,2,3 --dynamic-gpus --queue kenny_1_gpu_queue=1
But for some reason although all the gpus are free and no other agent is on the machine, only one task is executed at the time instead of 4. Why is that?

  
  
Posted 3 years ago

Well, assuming you have a ClearML Agent daemon running in docker mode, you should simply:
Clone your current task (right-click on the task in the tasks list and choose Clone) Go to the Execution tab Edit the Container section and set whatever docker image you need Enqueue the task to the queue watched by the daemon

  
  
Posted 3 years ago

I am also running from a NVIDIA container and I get
ERROR: No matching distribution found for tensorflow==2.4.0+nv
clearml_agent: ERROR: Could not install task requirements!

docker image is
http://nvcr.io/nvidia/tensorflow:21.10-tf2-py3
What should I do?

  
  
Posted 3 years ago

Well the requirements were automatically filled, not by me

  
  
Posted 3 years ago

SuccessfulKoala55
Exactly

  
  
Posted 3 years ago

SuccessfulKoala55 I've tried changing manually the TF version but it fails. I get:
import tensorflow as tf
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/tensorflow/init.py", line 435, in <module>
_ll.load_library(_main_dir)
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/tensorflow/python/framework/load_library.py", line 153, in load_library
py_tf.TF_LoadLibrary(lib)
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.8/dist-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so: undefined symbol: _ZN10tensorflow14kernel_factory17OpKernelRegistrar12InitInternalEPKNS_9KernelDefEN4absl12lts_2021032411string_viewESt10unique_ptrINS0_15OpKernelFactoryESt14default_deleteIS9_EE
2021-11-14 18:09:10
Process failed, exit code 1

I assume every task you run on this container (latest Nvidia's TF container) will reproduce the same error upon importing of tensorflow

  
  
Posted 3 years ago

It does match the image.

tensorflow==2.4.0+nv should be changed to tensorflow==2.4.0

But for some reason although all the gpus are free and no other agent is on the machine, only one task is executed at the time instead of 4. Why is that?

Can you make sure the agent did receive the GPUs 0 through 3?

  
  
Posted 3 years ago

ImmensePenguin78 this is probably for a different python version ...

  
  
Posted 3 years ago

Will try. Thanks.

  
  
Posted 3 years ago

Hi ImmensePenguin78 , you mean you want to run other copies of the same task using different docker images?

  
  
Posted 3 years ago

It does match the image.

  
  
Posted 3 years ago
978 Views
15 Answers
3 years ago
one year ago
Tags