Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Everyone, I Have A Training Job Task Which Was Using Gpu That Went To

Hi Everyone,
I have a training job task which was using GPU that went to failed status because of CUDA Out of memory . However when i look at the worker view, i can see that a worker is still clogging up GPU resources which are tied to this experiment. Why would the resources not be freed up and what would be the right way to cleanup the worker ?

  
  
Posted one year ago
Votes Newest

Answers 6


Hi ObedientToad56 , I guess somehow the training code left the GPU resources in an unstable state? Is the worker currently running anything?

  
  
Posted one year ago

Yeah GPU utilization was 100% . I cleaned it up using nvidia-smi and killing the process. But i was expecting the clean up to happen automatically since the process failed.

  
  
Posted one year ago

sure Thanks SuccessfulKoala55 Not sure if is a one off event. I will try to reproduce it.

  
  
Posted one year ago

Well, the agent is supposed to kill the task's process - didn't it?

  
  
Posted one year ago

Well, if you have any relevant debugging info I would appreciate it, or any hints on how to reproduce 🙂

  
  
Posted one year ago

no , it didn't kill the process.

  
  
Posted one year ago