Hey There, Since A Bit I Often Find Experiments Being Stuck While Training A Model. It Seems To Happen Randomly And I Could Not Find A Reproducible Scenario So Far, But It Happens Often Enough To Be Annoying (I'D Say 1 Out Of 5 Experiments). The Symptoms

Unanswered

Any chance this is reproducible ?

Unfortunately not at the moment, I could find a reproducible scenario. If I clone a task that was stuck and start it, it might not be stuck

How many processes do you see running (i.e. ps -Af | grep python) ?

I will check that when the next one will be blocked 👍

What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?

I train with pytorch (1.11) and ignite (0.4.8), using multiprocess (via the dataloader with n_workers=8) on linux, not running inside a docker container

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

167 Views

0 Answers

2 years ago

one year ago