Reputation
Badges 1
18 × Eureka!TimelyPenguin76 , env info can be found in the logs. thanks!
I suspect the issue stems from this https://github.com/open-mmlab/mmcv/blob/2f023453d6fc419e6ed3a8720fcf601d3863b42b/mmcv/runner/checkpoint.py#L703-L705 . Does ClearML expect the 2nd argument to torch.save
to be a filename? In this case it is a BytesIO
object instead
sorry, i'll try to give you a toy example when i have the time to
AgitatedDove14 I see! I will try adding Task.current_task() and see how it goes.
That said, I already have a Task.get_task() in the main
function which each subprocess runs. Is that not enough to trigger clearml? https://github.com/levan92/det2_clearml/blob/2634d2c6f898f8946f5b3379dba929635d81d0a9/trainer.py#L206
Yup i could view the tensorboard logs through a local tensorboard with all the metrics in
if it helps, here is my training code: https://github.com/levan92/det2_clearml/blob/master/train_net_clearml.py
Hi AgitatedDove14 , so sorry, I have to re-open this issue as the same issue is still happening when I incorporate clearml in my detectron2 training in our setup. In our setup, we are using K8s-glue agent, and I am sending training jobs to be executed remotely. For single gpu training, everything works as intended, tensorboard graphs show up auto-magically on clearml dashboard.
However, when train with multi-gpu (same machine), the tensorboard graphs does not show up on the clearml dashboar...
TimelyPenguin76 AgitatedDove14 so sorry for pressing, just bumping this up, do you all have any ideas why this happens? Otherwise I will have to proceed with using the clearml task logging to manually report the metrics
Hi AgitatedDove14 sorry for the late reply. Yes, pod does get allocated 2 gpus. "script path" is "train_net_clearml.py"
Sorry about that, thank you for your help :)
it's multi-gpu, single node!
Oh! Thank you for pointing that out! Didn’t notice that. Yes, it turns out in my requirements.txt i specified that version. Once I changed it to the latest version of clearml, the tensorboard graphs shows up in the dashboard.
AgitatedDove14 you can ignore my last question, I've tried it out on a minimal example here: https://github.com/levan92/clearml_test_mp
I've ascertain that I need Task.current_task()
in order to trigger clearml ( Task.get_task()
is not enough). Thank you!
i submitted the job through the bash script "train_coco.sh", which basically runs the python script "train_net_clearml.py" with various arguments.
K8s-glue agent
My current workaround is this: https://github.com/levan92/mmdet_clearml/blob/0028b89a4bc337087b58337f19d226dc0acc8074/tools/torchrun.py#L688-L690
you can take a look at the log, that's what I see on the UI