I Have A Set Up An Agent, On A Gpu Machine, And Spun Up The Daemon In Docker Moder, And Specifically Specified A Gpu That It Will Work With. The Image Is Okay And I Verified That By Running

Answered

I have a set up an agent, on a GPU machine, and spun up the daemon in docker moder, and specifically specified a GPU that it will work with. The image is okay and I verified that by running docker run ... nvidia-smi and made sure it does have a valid output.

When I launch a task from an environment without GPU and use the task.execute_remotely to execute it on the agent, the agent pulls the task but the code does not run on the GPU (TensorFlow code). Could this be because the TF installed on the environment executing the line task.execute_remotely is the CPU version and therefor to replicate trains installs the same CPU variant instead of GPU? Or something else could be wrong here?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Votes Newest

Answers 32

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

nvidia/cuda:10.1-base-ubuntu18.04

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

(I changed it in the settings)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

This is odd because the screen grab point to CUDA 10.2 ...

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

and the machine I have is 10.2.

I also tried nvidia/cuda:10.2-base-ubuntu18.04 which is the latest

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

that is because my own machine has 10.2 (not the docker, the machine the agent is on)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

that is because my own machine has 10.2 (not the docker, the machine the agent is on)

No that has nothing to do with it, the CUDA is inside the container. I'm referring to this image https://allegroai-trains.slack.com/archives/CTK20V944/p1593440299094400?thread_ts=1593437149.089400&cid=CTK20V944
Assuming this is the output from your code running inside the docker , it points to cuda version 10.2
Am I missing something ?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I really don't know, as you can see in my last screenshot, I've configured my base image to be 10.1

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Hmmm could you attach the entire log?
Remove any info that you feel is too sensitive :)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

On it

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Here you go

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

By the way, just inspecting, the CUDA version on the output of nvidia-smi is matching the driver installed on the host, and not the container - look at the image below

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

But I'm naive enough to believe that 10.2 is compatible with 10.1 as it is a minor upgrade

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

LOL

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

😅

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

https://hub.docker.com/layers/nvidia/cuda/10.1-cudnn7-runtime-ubuntu18.04/images/sha256-963696628c9a0d27e9e5c11c5a588698ea22eeaf138cc9bff5368c189ff79968?context=explore
the docker image is missing the cudnn which is a must for TF to work 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

O_O

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

replace the base-docker-image and it should work fine 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Let's try

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

but remember, it didnt work also with the default one (nvidia/cuda)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

We might need to change the default base docker image, but I remember it was there... Let me check again

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

🤞

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

well cudnn is actually missing from the base image...

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Lets see if this is really the issue

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

😄

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

It works!

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

That was the issue then

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

I guess not many tensorflowers running agents around here if this wasn't brought up already

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

glad I managed to help back in some way

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Show more results

Write your answer

108K Views

32 Answers

5 years ago

one year ago