ClearML FAQ | Hi, I Have A Task That'S Running On A Docker Container. Now - There Are A Bunch Of Other Docker Containers (Namely, Nvidia'S Tf 21.1 To 21.10) For Which I Want To Run The Task. How Can I Do This Using Agents / Remote Execution? Thanks

Answered

Hi, I Have A Task That'S Running On A Docker Container. Now - There Are A Bunch Of Other Docker Containers (Namely, Nvidia'S Tf 21.1 To 21.10) For Which I Want To Run The Task. How Can I Do This Using Agents / Remote Execution? Thanks

Hi,
I have a task that's running on a docker container.
Now - There are a bunch of other docker containers (namely, Nvidia's TF 21.1 to 21.10) for which I want to run the task.
How can I do this using agents / Remote execution?
Thanks

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ImmensePenguin78
				
					0
					 × 1

Votes Newest

Answers 15

SuccessfulKoala55 , meanwhile I try that, I encounter something weird. I am using a clearml agent with the following
clearml-agent daemon --detached --docker --gpus 0,1,2,3 --dynamic-gpus --queue kenny_1_gpu_queue=1
But for some reason although all the gpus are free and no other agent is on the machine, only one task is executed at the time instead of 4. Why is that?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ImmensePenguin78
				
					0
					 × 1

SuccessfulKoala55
Well, I've removed the requirement altogether and now it won't fail on this anymore (TF is provided anyway AFAIK via the image) but now I get the following:

Any ideas?

*Needless to say, when running locally this works with no problem. Also the http://nvcr.io/nvidia/tensorflow:21.02-tf2-py3 image is able to run TRT

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ImmensePenguin78
				
					0
					 × 1

I am also running from a NVIDIA container and I get
ERROR: No matching distribution found for tensorflow==2.4.0+nv
clearml_agent: ERROR: Could not install task requirements!

docker image is
http://nvcr.io/nvidia/tensorflow:21.10-tf2-py3
What should I do?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ImmensePenguin78
				
					0
					 × 1

Well the requirements were automatically filled, not by me

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ImmensePenguin78
				
					0
					 × 1

SuccessfulKoala55 I've tried changing manually the TF version but it fails. I get:
import tensorflow as tf
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/tensorflow/init.py", line 435, in <module>
_ll.load_library(_main_dir)
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/tensorflow/python/framework/load_library.py", line 153, in load_library
py_tf.TF_LoadLibrary(lib)
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.8/dist-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so: undefined symbol: _ZN10tensorflow14kernel_factory17OpKernelRegistrar12InitInternalEPKNS_9KernelDefEN4absl12lts_2021032411string_viewESt10unique_ptrINS0_15OpKernelFactoryESt14default_deleteIS9_EE
2021-11-14 18:09:10
Process failed, exit code 1

I assume every task you run on this container (latest Nvidia's TF container) will reproduce the same error upon importing of tensorflow

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ImmensePenguin78
				
					0
					 × 1

Hi ImmensePenguin78 , you mean you want to run other copies of the same task using different docker images?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

SuccessfulKoala55
Exactly

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ImmensePenguin78
				
					0
					 × 1

Well, assuming you have a ClearML Agent daemon running in docker mode, you should simply:
Clone your current task (right-click on the task in the tasks list and choose Clone) Go to the Execution tab Edit the Container section and set whatever docker image you need Enqueue the task to the queue watched by the daemon

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

It does match the image.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ImmensePenguin78
				
					0
					 × 1

ImmensePenguin78 this is probably for a different python version ...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Will try. Thanks.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ImmensePenguin78
				
					0
					 × 1

Well, on the first task it grabs it opens a different WORKER:gpu0 worker entry as expected while the agent stays with WORKER:dgpu0,1,2,3
but the other tasks on queue won't start and upon the first task's completion the following are not being run on WORKER:gpu0 but on WORKER:dgpu0,1,2,3 instead using only 1 GPU (the task execution says it runs on WORKER:gpu0)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ImmensePenguin78
				
					0
					 × 1

You should adjust the tensorflow version to match the image... Edit the task requirements for that

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

SuccessfulKoala55 On another note, I'm also getting
ERROR: Could not find a version that satisfies the requirement pandas==1.3.4 (from versions: 0.1, 0.2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.21.1, 0.22.0, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.25.2, 0.25.3, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.0.5, 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5)
ERROR: No matching distribution found for pandas==1.3.4
clearml_agent: ERROR: Could not install task requirements!

while pandas==1.3.4 is easily available from pypi

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ImmensePenguin78
				
					0
					 × 1

It does match the image.

tensorflow==2.4.0+nv should be changed to tensorflow==2.4.0

But for some reason although all the gpus are free and no other agent is on the machine, only one task is executed at the time instead of 4. Why is that?

Can you make sure the agent did receive the GPUs 0 through 3?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

2K Views

15 Answers

3 years ago

2 years ago