Hi Everyone, I Tried To Launch Experiments Using Conda With Different Cuda Versions, I Tried To Comment This Fields From The Trains.Conf File On The Remove Machine #Cuda_Version: 10.1 #Cudnn_Version: 7.0 But It Seems That When I Comment It (Like A

Answered

Hi everyone,
I tried to launch experiments using conda with different cuda versions, I tried to comment this fields from the trains.conf file on the remove machine
#cuda_version: 10.1
#cudnn_version: 7.0
but it seems that when I comment it (like above), trains set the versions by default to
agent.cuda_version = 102
agent.cudnn_version = 0
(this is taken from the logs of the run)
and then installed cudatoolkit 102 that collapsed the experiments

Is there a way to make trains go with the cudatoolkit that exist in the python environment that I executed the training script with?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

Votes Newest

Answers 17

BTW, what about running trains-agent in docker mode? That can solve all your cuda issues

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

when my system was "clean" I installed cuda 10.1 (never installed cuda 10.2) hope i'm not mistaken

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

Didnt use it so far, but I will start 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

weird, I will try to find why is that

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

I can give it a shot (I'm using conda now) what is the overhead of going into dockers with the fact that I dont have "docker hands on experience"?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

Hi RattySeagull0 ,

If not specified, the values are taken from nvidia-smi for cuda_version, can you share you output for nvidia-smi ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

You changed the version from 10.2 to 10.1 and nvidia-smi output is the same? did you do a restart after the change?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

is the flow using dockers is more supported than conda? is there a guide regarding the configuration required for dockers?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

Ohhh I thought you changed it from 10.2 to 10.1, my mistake.

What do you get for nvcc --version ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

got it thanks!
Is it possible to use different dockers (containing different cuda versions) in different experiments?
or I have to open different queues for that? (or something like that)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

How do you clone the tasks? with Task.clone ? If so, you can use cloned_task.set_base_docker(<VALUE FOR BASE DOCKER IMAGE>)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

The version of the cudatoolkit is 10.1 inside the experiment, and trains try to work with 10.2, probably because the same reason it displays in the nvidia-smi

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

Is it something that I can config from the call to task.init? (my goal is that I wont be required to change in manualy)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

what do you mean change?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

Hi TimelyPenguin76
you are right, it written cuda version 10.2 (even though I installed only cuda 10.1, weird)
do you know why it's 10.2?
and do you know why trains count on that? (instead of looking in the python environment of the executed script?)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RattySeagull0
				
					0
					 × 1

Actually you can, when you clone an experiment, in the EXECUTION section , you can change the BASE DOCKER IMAGE to the image you like the experiment to run with. This way you can use different docker images for different experiments.

You can use the same queue :)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

is there a guide regarding the configuration required for dockers?

Yes we do have a guide: https://github.com/allegroai/trains-agent#starting-the-trains-agent-in-docker-mode

You can also specified the image for the docker, in the example the image is nvidia/cuda but you can put a specific one for your needs (maybe nvidia/cuda:10.1-runtime-ubuntu18.04 ?

I can give it a shot (I’m using conda now) what is the overhead of going into dockers with the fact that I dont have “docker hands on experience”?

You don’t really need “docker hands on experience”

is the flow using dockers is more supported than conda?

Its the same flow, but running inside a docker image

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Write your answer

1K Views

17 Answers

4 years ago

one year ago