Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Have A Worker On A Machine Using Gpus 0,1 And Another Worker On The Same Machine Using Gpus 0,1,2,3,4,5. A Worker Ran A Task On Gpus 0,1 But For Some Reason The Second Worker Started Additional Task In Queue On Gpus 0,1,2,3,4,5, Which Caused Both Of

Hi, I have a worker on a machine using gpus 0,1 and another worker on the same machine using gpus 0,1,2,3,4,5. A worker ran a task on gpus 0,1 but for some reason the second worker started additional task in queue on gpus 0,1,2,3,4,5, which caused both of the worker to fail. The first worker use a queue called 2_gpu and the second worker use a queue called 6_gpu. I was expecting the second worker to wait until the first finishes, given the GPUs are taken. How can I use trains-agent to overcome such cases? I think there are a few behaviors possible for this case (One behavior is waiting until all gpus are ready and then starting the second worker, meanwhile more 2 gpus only workers can run additional more tasks. Another behavior is waiting for the 2 gpu only task to finish while blocking all new 2 gpus tasks until the 6 gpu task finishes),
Could you maybe implement such automatic logics so we could choose from?
Anything would be better than the current state, haha.

  
  
Posted 4 years ago
Votes Newest

Answers 7


I am aware this is the current behavior, but could it be changed to something more intelligent? 😇

  
  
Posted 4 years ago

When you say I can still get race/starvation cases, you mean in the enterprise or regular version?

  
  
Posted 4 years ago

BTW: you still can get race/starvation cases... But at least no crash

  
  
Posted 4 years ago

you mean in the enterprise

Enterprise with the smarter GPU scheduler, this is inherent problem of sharing resources, there is no perfect solution, you either have fairness, but then you get idle GPU's of you have races, where you can get starvation

  
  
Posted 4 years ago

I see, will keep that in mind. Thanks Martin!

  
  
Posted 4 years ago

If you spin two agent on the same GPU, they are not ware of one another ... So this is expected behavior ...
Make sense ?

  
  
Posted 4 years ago

This is part if a more advanced set of features of the scheduler, but only available in the enterprise edition 🙂

  
  
Posted 4 years ago