Hey Guys, I'M Experiencing Seemingly Random Problems With The Experiments. There Are 4 Gpus And 8 Workers (2 Workers Per Gpu) , And Sometimes Experiments Randomly Fail (Or Complete) In The Middle Of The Epoch Without Any Additional Info In The Logs. What

Answered

hey guys, I'm experiencing seemingly random problems with the experiments. there are 4 GPUs and 8 workers (2 workers per GPU) , and sometimes experiments randomly fail (or complete) in the middle of the epoch without any additional info in the logs. what would be the best way to find out the root problem?

  				
Posted 
	4 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Votes Newest

Answers 8

example of the failed experiment

  				
Posted 
	4 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

docker mode
different ids

  				
Posted 
	4 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

nice idea, thanks

  				
Posted 
	4 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Hi DilapidatedDucks58 ,
Just making sure all 8 works have different worker ids? (you can see 8 in the workers page in the UI)
Also, are they running this docker or venv mode?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Could you verify you have 8 subfolders named 'venv.X' in the cache folder ~/. trains ?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

it might be that there is not enough space on our SSD, experiments cache a lot of preprocessed data during the first epoch...

  				
Posted 
	4 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

  				
Posted 
	4 years ago

					More  		
  Report
		
					DilapidatedDucks58
				
					0
					 × 1

If that's the case check the free space in the monitoring of the experiment, you will find the free space in GB logged

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

8 Answers

4 years ago

one year ago