Reputation
Badges 1
18 × Eureka!you can take a look at the log, that's what I see on the UI
sorry, i'll try to give you a toy example when i have the time to
TimelyPenguin76 , env info can be found in the logs. thanks!
if it helps, here is my training code: https://github.com/levan92/det2_clearml/blob/master/train_net_clearml.py
Oh! Thank you for pointing that out! Didn’t notice that. Yes, it turns out in my requirements.txt i specified that version. Once I changed it to the latest version of clearml, the tensorboard graphs shows up in the dashboard.
AgitatedDove14 you can ignore my last question, I've tried it out on a minimal example here: https://github.com/levan92/clearml_test_mp
I've ascertain that I need Task.current_task()
in order to trigger clearml ( Task.get_task()
is not enough). Thank you!
Hi AgitatedDove14 sorry for the late reply. Yes, pod does get allocated 2 gpus. "script path" is "train_net_clearml.py"
AgitatedDove14 I see! I will try adding Task.current_task() and see how it goes.
That said, I already have a Task.get_task() in the main
function which each subprocess runs. Is that not enough to trigger clearml? https://github.com/levan92/det2_clearml/blob/2634d2c6f898f8946f5b3379dba929635d81d0a9/trainer.py#L206
Hi AgitatedDove14 , so sorry, I have to re-open this issue as the same issue is still happening when I incorporate clearml in my detectron2 training in our setup. In our setup, we are using K8s-glue agent, and I am sending training jobs to be executed remotely. For single gpu training, everything works as intended, tensorboard graphs show up auto-magically on clearml dashboard.
However, when train with multi-gpu (same machine), the tensorboard graphs does not show up on the clearml dashboar...
Yup i could view the tensorboard logs through a local tensorboard with all the metrics in
K8s-glue agent
i submitted the job through the bash script "train_coco.sh", which basically runs the python script "train_net_clearml.py" with various arguments.
it's multi-gpu, single node!
TimelyPenguin76 AgitatedDove14 so sorry for pressing, just bumping this up, do you all have any ideas why this happens? Otherwise I will have to proceed with using the clearml task logging to manually report the metrics
My current workaround is this: https://github.com/levan92/mmdet_clearml/blob/0028b89a4bc337087b58337f19d226dc0acc8074/tools/torchrun.py#L688-L690
Sorry about that, thank you for your help :)
I suspect the issue stems from this https://github.com/open-mmlab/mmcv/blob/2f023453d6fc419e6ed3a8720fcf601d3863b42b/mmcv/runner/checkpoint.py#L703-L705 . Does ClearML expect the 2nd argument to torch.save
to be a filename? In this case it is a BytesIO
object instead