Hi Everyone, I Have A Training Job Task Which Was Using Gpu That Went To

Answered

Hi Everyone,
I have a training job task which was using GPU that went to failed status because of CUDA Out of memory . However when i look at the worker view, i can see that a worker is still clogging up GPU resources which are tied to this experiment. Why would the resources not be freed up and what would be the right way to cleanup the worker ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ObedientToad56
				
					0
					 × 1

Votes Newest

Answers 6

Yeah GPU utilization was 100% . I cleaned it up using nvidia-smi and killing the process. But i was expecting the clean up to happen automatically since the process failed.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ObedientToad56
				
					0
					 × 1

sure Thanks SuccessfulKoala55 Not sure if is a one off event. I will try to reproduce it.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ObedientToad56
				
					0
					 × 1

Well, the agent is supposed to kill the task's process - didn't it?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

no , it didn't kill the process.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ObedientToad56
				
					0
					 × 1

Well, if you have any relevant debugging info I would appreciate it, or any hints on how to reproduce 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi ObedientToad56 , I guess somehow the training code left the GPU resources in an unstable state? Is the worker currently running anything?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

2K Views

6 Answers

3 years ago

2 years ago