Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

Hi! I am currently using clearml (with remote execution), to train an object detection model with https://github.com/facebookresearch/detectron2 . It was working well in a single GPU setting, with the tensorboard logs auto-magically displayed onto the clearml dashboard.

However, when I went into a multi-gpu setting (still single machine), the tensorboard logs are not longer displayed on the clearml dashboard, although the tensorboard logs are still getting written by detectron2. Note that detectron2 does multi-gpu training in a https://pytorch.org/tutorials/intermediate/ddp_tutorial.html style (aka there is a process spawned for each gpu) through a https://github.com/facebookresearch/detectron2/blob/master/detectron2/engine/launch.py . Is anyone able to help with this issue?

  
  
Posted 2 years ago
Votes Newest

Answers 19


Hi AgitatedDove14 , so sorry, I have to re-open this issue as the same issue is still happening when I incorporate clearml in my detectron2 training in our setup. In our setup, we are using K8s-glue agent, and I am sending training jobs to be executed remotely. For single gpu training, everything works as intended, tensorboard graphs show up auto-magically on clearml dashboard.

However, when train with multi-gpu (same machine), the tensorboard graphs does not show up on the clearml dashboard. However, everything else still trains correctly and the tensorboard logs written in the k8s container are correct as well. The logging is also showing up normally on the clearml dashboard, which shows that the training process is "connected" to clearml. Also, when I explicitly report scalars, in the training process, it does not show up as well.

I've attached a zip file which contains 2 folders (single-gpu, multi-gpu). They contain the respective codes and logs (as well as screenshots of the clearml dashboard).

Thank you so much! Looking forward to your reply.

  
  
Posted 2 years ago

Yup i could view the tensorboard logs through a local tensorboard with all the metrics in

  
  
Posted 2 years ago

K8s-glue agent

  
  
Posted 2 years ago

AgitatedDove14 I see! I will try adding Task.current_task() and see how it goes.

That said, I already have a Task.get_task() in the main function which each subprocess runs. Is that not enough to trigger clearml? https://github.com/levan92/det2_clearml/blob/2634d2c6f898f8946f5b3379dba929635d81d0a9/trainer.py#L206

  
  
Posted 2 years ago

Oh! Thank you for pointing that out! Didn’t notice that. Yes, it turns out in my requirements.txt i specified that version. Once I changed it to the latest version of clearml, the tensorboard graphs shows up in the dashboard.

  
  
Posted 2 years ago

AgitatedDove14 you can ignore my last question, I've tried it out on a minimal example here: https://github.com/levan92/clearml_test_mp

I've ascertain that I need Task.current_task() in order to trigger clearml ( Task.get_task() is not enough). Thank you!

  
  
Posted 2 years ago

Hi NonchalantDeer14
In multi-gpu, can you still see the logs on the local Tensorboard ?
Are you running manually or with an agent ?

  
  
Posted 2 years ago

NonchalantDeer14
I think the issue is the way it spins the subprocess is not with fork but with Popen, so clearml is not "loaded" into the subprocess hence no logging.
The easiest fix is to call Task.current_task() inside the actual code (somewhere when it starts), it should trigger clearml.

  
  
Posted 2 years ago

👍

  
  
Posted 2 years ago

i submitted the job through the bash script "train_coco.sh", which basically runs the python script "train_net_clearml.py" with various arguments.

  
  
Posted 2 years ago

Thanks NonchalantDeer14 !
BTW: how do you submit the multi GPU job? Is it multi-gpu or multi node ?

  
  
Posted 2 years ago

Okay let me check the code and comeback with followup questions

  
  
Posted 2 years ago

clearml - WARNING - Could not retrieve remote configuration named 'hyperparams'

What's the clearml-server version you are working with ?

In both logs I see (even in the single GPU log, it seems you "see" two GPUs, is that correct?)
GPU 0,1 Tesla V100-SXM2-32GB (arch=7.0)

Last question, this is using relatively old clearml version (0.17.5), can you test with the latest version (1.1.1)?

  
  
Posted 2 years ago

Sorry about that, thank you for your help :)

  
  
Posted 2 years ago

it's multi-gpu, single node!

  
  
Posted 2 years ago

TimelyPenguin76 AgitatedDove14 so sorry for pressing, just bumping this up, do you all have any ideas why this happens? Otherwise I will have to proceed with using the clearml task logging to manually report the metrics

  
  
Posted 2 years ago

Hi AgitatedDove14 sorry for the late reply. Yes, pod does get allocated 2 gpus. "script path" is "train_net_clearml.py"

  
  
Posted 2 years ago

Just verifying the Pod does get allocated 2 gpus, correct ?
What do you have under the "script path" in the Task?

  
  
Posted 2 years ago