Hi, I Have A Worker On A Machine Using Gpus 0,1 And Another Worker On The Same Machine Using Gpus 0,1,2,3,4,5. A Worker Ran A Task On Gpus 0,1 But For Some Reason The Second Worker Started Additional Task In Queue On Gpus 0,1,2,3,4,5, Which Caused Both Of

Answered

Hi, I have a worker on a machine using gpus 0,1 and another worker on the same machine using gpus 0,1,2,3,4,5. A worker ran a task on gpus 0,1 but for some reason the second worker started additional task in queue on gpus 0,1,2,3,4,5, which caused both of the worker to fail. The first worker use a queue called 2_gpu and the second worker use a queue called 6_gpu. I was expecting the second worker to wait until the first finishes, given the GPUs are taken. How can I use trains-agent to overcome such cases? I think there are a few behaviors possible for this case (One behavior is waiting until all gpus are ready and then starting the second worker, meanwhile more 2 gpus only workers can run additional more tasks. Another behavior is waiting for the 2 gpu only task to finish while blocking all new 2 gpus tasks until the 6 gpu task finishes),
Could you maybe implement such automatic logics so we could choose from?
Anything would be better than the current state, haha.

  				
Posted 
	4 years ago

					More  		
  Report
		
					SmarmySeaurchin8
				
					0
					 × 1

Votes Newest

Answers 7

I see, will keep that in mind. Thanks Martin!

  				
Posted 
	4 years ago

					More  		
  Report
		
					SmarmySeaurchin8
				
					0
					 × 1

you mean in the enterprise

Enterprise with the smarter GPU scheduler, this is inherent problem of sharing resources, there is no perfect solution, you either have fairness, but then you get idle GPU's of you have races, where you can get starvation

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

When you say I can still get race/starvation cases, you mean in the enterprise or regular version?

  				
Posted 
	4 years ago

					More  		
  Report
		
					SmarmySeaurchin8
				
					0
					 × 1

BTW: you still can get race/starvation cases... But at least no crash

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This is part if a more advanced set of features of the scheduler, but only available in the enterprise edition 🙂

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I am aware this is the current behavior, but could it be changed to something more intelligent? 😇

  				
Posted 
	4 years ago

					More  		
  Report
		
					SmarmySeaurchin8
				
					0
					 × 1

If you spin two agent on the same GPU, they are not ware of one another ... So this is expected behavior ...
Make sense ?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

7 Answers

4 years ago

2 years ago