Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Am Switching From Wandb To Clearml In My Pytorch Ddp Training Script. With Wandb I Used To Have Worker Nr 1 Handle Logging To Wandb And Initiating The Connection. If I Simply Exchange Wandb Calls With Clearml Calls, Worker Nr 1, Which Handles The Co

Hi, I am switching from WandB to clearml in my pytorch DDP training script. With WandB i used to have worker nr 1 handle logging to WandB and initiating the connection. If i simply exchange WandB calls with clearml calls, worker nr 1, which handles the connection, start straggling behind in the validation and training loop, causing a massive slowdown of the training session, up to a factor 5. Is this normal, and if not, how should clearml be used when logging in pytorch multigpu DDP type training environment?

  
  
Posted 7 months ago
Votes Newest

Answers 5


Hi Eugen
Thanks for the response!
let me know if you need a running example, that will require some effort to produce.

  
  
Posted 7 months ago

It seems that running Task.init() outside in the main process before spawning multiple processes, and calling Task.current_task() inside the rank = 0 process works with no slowdown

  
  
Posted 7 months ago

Hi @<1671689458606411776:profile|StormySeaturtle98> ! Do you have a sample snippet that could help us diagnose this problem?

  
  
Posted 7 months ago

That makes sense. You should generally have only 1 task (initialized in the master process). The other subprocesses will inherit this task which should speed up the process

  
  
Posted 7 months ago

It just puzzles me that if a single subprocess spawns a task, meaning that only one task is active, and that task is only accessed by the worker creating it, the massive slowdown happens

  
  
Posted 7 months ago
585 Views
5 Answers
7 months ago
7 months ago
Tags