Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Thanks For Releasing This Awesome Experiment Manager! I Was Logging A Single Training Session On Multiple Gpus (Using Detectron2), And Torch.Mp Is Called For Each Gpu. This Creates A Separate Task In Trains For Each Gpu, And Only One Of The Tasks Has The

Thanks for releasing this awesome experiment manager! I was logging a single training session on multiple GPUs (using Detectron2), and torch.mp is called for each GPU. This creates a separate task in TRAINS for each GPU, and only one of the tasks has the plotting outputs. I also have to manually save the hyperparameters (not a huge deal) due to the placement of task.init(...), and add "-W ignore" to the python call to bypass the warning :
UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdownHas anyone else encountered this or have any suggestions? Right now, it's just blowing up the number of tasks in my archive (can we delete from the archive yet?).

  
  
Posted 4 years ago
Votes Newest

Answers 8


BTW, VexedKangaroo32 are you using torch launch ?

  
  
Posted 4 years ago

Since this fix is all about synchronizing different processes, we wanted to be extra careful with the release. That said I think that what we have now should be quite stable. Plan is to have the RC available right after the weekend.

  
  
Posted 4 years ago

Hi VexedKangaroo32 , funny enough this is one of the fixes we will be releasing soon. There is a release scheduled for later this week, right after that I'll put here a link to an RC containing a fix to this exact issue.

  
  
Posted 4 years ago

Thanks VexedKangaroo32 , this is great news :)

  
  
Posted 4 years ago

So the way it will work, is you will also need to have a Task.init in main process (the one using the launch function) and the same Task.init in the main_func. What it does is it signals the sub processes to use the main process task. This way they all report to the same task. Obviously to test it you will need to wait for the RC (after the weekend :)

  
  
Posted 4 years ago

Hi VexedKangaroo32 , there is now an RC with a fix:
pip install trains==0.13.4rc0Let me know if it solved the problem

  
  
Posted 4 years ago

I'm not using torch launch, but the launch function in https://github.com/facebookresearch/detectron2/blob/master/detectron2/engine/launch.py I placed Task.init(...) inside the "main_func" that gets called in mp.spawn.

  
  
Posted 4 years ago

Meant to get back to you a bit sooner, but I can report that I no longer have duplicate tasks after updating to 0.13.4rc0 and putting Task.init in those two places. The job hasn't run to completion, so I can't report if it ends cleanly or not.

No manual logging attempted Tensorboard, terminal outputs, and argparser log properly No longer need the "-W ignore" arguments

  
  
Posted 4 years ago