AgitatedDove14 you can ignore my last question, I've tried it out on a minimal example here: https://github.com/levan92/clearml_test_mp
I've ascertain that I need Task.current_task()
in order to trigger clearml ( Task.get_task()
is not enough). Thank you!
Yup i could view the tensorboard logs through a local tensorboard with all the metrics in
Thanks NonchalantDeer14 !
BTW: how do you submit the multi GPU job? Is it multi-gpu or multi node ?
Okay let me check the code and comeback with followup questions
Hi AgitatedDove14 , so sorry, I have to re-open this issue as the same issue is still happening when I incorporate clearml in my detectron2 training in our setup. In our setup, we are using K8s-glue agent, and I am sending training jobs to be executed remotely. For single gpu training, everything works as intended, tensorboard graphs show up auto-magically on clearml dashboard.
However, when train with multi-gpu (same machine), the tensorboard graphs does not show up on the clearml dashboard. However, everything else still trains correctly and the tensorboard logs written in the k8s container are correct as well. The logging is also showing up normally on the clearml dashboard, which shows that the training process is "connected" to clearml. Also, when I explicitly report scalars, in the training process, it does not show up as well.
I've attached a zip file which contains 2 folders (single-gpu, multi-gpu). They contain the respective codes and logs (as well as screenshots of the clearml dashboard).
Thank you so much! Looking forward to your reply.
NonchalantDeer14
I think the issue is the way it spins the subprocess is not with fork but with Popen, so clearml is not "loaded" into the subprocess hence no logging.
The easiest fix is to call Task.current_task() inside the actual code (somewhere when it starts), it should trigger clearml.
i submitted the job through the bash script "train_coco.sh", which basically runs the python script "train_net_clearml.py" with various arguments.
TimelyPenguin76 AgitatedDove14 so sorry for pressing, just bumping this up, do you all have any ideas why this happens? Otherwise I will have to proceed with using the clearml task logging to manually report the metrics
Hi AgitatedDove14 sorry for the late reply. Yes, pod does get allocated 2 gpus. "script path" is "train_net_clearml.py"
AgitatedDove14 I see! I will try adding Task.current_task() and see how it goes.
That said, I already have a Task.get_task() in the main
function which each subprocess runs. Is that not enough to trigger clearml? https://github.com/levan92/det2_clearml/blob/2634d2c6f898f8946f5b3379dba929635d81d0a9/trainer.py#L206
if it helps, here is my training code: https://github.com/levan92/det2_clearml/blob/master/train_net_clearml.py
Oh! Thank you for pointing that out! Didn’t notice that. Yes, it turns out in my requirements.txt i specified that version. Once I changed it to the latest version of clearml, the tensorboard graphs shows up in the dashboard.
Hi NonchalantDeer14
In multi-gpu, can you still see the logs on the local Tensorboard ?
Are you running manually or with an agent ?
clearml - WARNING - Could not retrieve remote configuration named 'hyperparams'
What's the clearml-server version you are working with ?
In both logs I see (even in the single GPU log, it seems you "see" two GPUs, is that correct?)GPU 0,1 Tesla V100-SXM2-32GB (arch=7.0)
Last question, this is using relatively old clearml version (0.17.5), can you test with the latest version (1.1.1)?
Just verifying the Pod does get allocated 2 gpus, correct ?
What do you have under the "script path" in the Task?
Sorry about that, thank you for your help :)