Thanks For Releasing This Awesome Experiment Manager! I Was Logging A Single Training Session On Multiple Gpus (Using Detectron2), And Torch.Mp Is Called For Each Gpu. This Creates A Separate Task In Trains For Each Gpu, And Only One Of The Tasks Has The

Answered

Thanks for releasing this awesome experiment manager! I was logging a single training session on multiple GPUs (using Detectron2), and torch.mp is called for each GPU. This creates a separate task in TRAINS for each GPU, and only one of the tasks has the plotting outputs. I also have to manually save the hyperparameters (not a huge deal) due to the placement of task.init(...), and add "-W ignore" to the python call to bypass the warning :
UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdownHas anyone else encountered this or have any suggestions? Right now, it's just blowing up the number of tasks in my archive (can we delete from the archive yet?).

  				
Posted 
	5 years ago

					More  		
  Report
		
					VexedKangaroo32
				
					0
					 × 1

Votes Newest

Answers 8

Thanks VexedKangaroo32 , this is great news :)

  				
Posted 
	5 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Meant to get back to you a bit sooner, but I can report that I no longer have duplicate tasks after updating to 0.13.4rc0 and putting Task.init in those two places. The job hasn't run to completion, so I can't report if it ends cleanly or not.

No manual logging attempted Tensorboard, terminal outputs, and argparser log properly No longer need the "-W ignore" arguments

  				
Posted 
	5 years ago

					More  		
  Report
		
					VexedKangaroo32
				
					0
					 × 1

Hi VexedKangaroo32 , there is now an RC with a fix:
pip install trains==0.13.4rc0Let me know if it solved the problem

  				
Posted 
	5 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So the way it will work, is you will also need to have a Task.init in main process (the one using the launch function) and the same Task.init in the main_func. What it does is it signals the sub processes to use the main process task. This way they all report to the same task. Obviously to test it you will need to wait for the RC (after the weekend :)

  				
Posted 
	5 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I'm not using torch launch, but the launch function in https://github.com/facebookresearch/detectron2/blob/master/detectron2/engine/launch.py I placed Task.init(...) inside the "main_func" that gets called in mp.spawn.

  				
Posted 
	5 years ago

					More  		
  Report
		
					VexedKangaroo32
				
					0
					 × 1

BTW, VexedKangaroo32 are you using torch launch ?

  				
Posted 
	5 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Since this fix is all about synchronizing different processes, we wanted to be extra careful with the release. That said I think that what we have now should be quite stable. Plan is to have the RC available right after the weekend.

  				
Posted 
	5 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi VexedKangaroo32 , funny enough this is one of the fixes we will be releasing soon. There is a release scheduled for later this week, right after that I'll put here a link to an RC containing a fix to this exact issue.

  				
Posted 
	5 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

8 Answers

5 years ago

2 years ago