Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Everyone, I Have A Training Job Task Which Was Using Gpu That Went To

Hi Everyone,
I have a training job task which was using GPU that went to failed status because of CUDA Out of memory . However when i look at the worker view, i can see that a worker is still clogging up GPU resources which are tied to this experiment. Why would the resources not be freed up and what would be the right way to cleanup the worker ?

  
  
Posted 2 years ago
Votes Newest

Answers 6


no , it didn't kill the process.

  
  
Posted 2 years ago

Well, the agent is supposed to kill the task's process - didn't it?

  
  
Posted 2 years ago

Yeah GPU utilization was 100% . I cleaned it up using nvidia-smi and killing the process. But i was expecting the clean up to happen automatically since the process failed.

  
  
Posted 2 years ago

sure Thanks SuccessfulKoala55 Not sure if is a one off event. I will try to reproduce it.

  
  
Posted 2 years ago

Hi ObedientToad56 , I guess somehow the training code left the GPU resources in an unstable state? Is the worker currently running anything?

  
  
Posted 2 years ago

Well, if you have any relevant debugging info I would appreciate it, or any hints on how to reproduce 🙂

  
  
Posted 2 years ago