NonchalantDeer14

3 Questions, 18 Answers

Active since 10 January 2023

Last activity 2 years ago

Reputation

Badges 1

18 × Eureka!

Questions 3
Answers 18

0 Votes

3 Answers

2K Views

0 Votes 3 Answers 2K Views

Hi, I Am Currently Trying To Train With

Hi, I am currently trying to train with https://github.com/open-mmlab/mmdetection using ClearML, and executing remotely. The recommended way of training mult...

clearml

3 years ago

0 Votes

19 Answers

2K Views

0 Votes 19 Answers 2K Views

Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

Hi! I am currently using clearml (with remote execution), to train an object detection model with https://github.com/facebookresearch/detectron2 . It was wor...

tensorboard

4 years ago

0 Votes

7 Answers

2K Views

0 Votes 7 Answers 2K Views

Hi, Another Issue Is Faced When Using Mmdetection/Mmcv With Clearml. The Automatic Uploading Of Checkpoint Meets The Following Error:

Hi, another issue is faced when using mmdetection/mmcv with clearml. The automatic uploading of checkpoint meets the following error: clearml.storage - ERROR...

clearml

3 years ago

0 Hi, I Am Currently Trying To Train With

My current workaround is this: https://github.com/levan92/mmdet_clearml/blob/0028b89a4bc337087b58337f19d226dc0acc8074/tools/torchrun.py#L688-L690

3 years ago

0 Hi, I Am Currently Trying To Train With

you can take a look at the log, that's what I see on the UI

3 years ago

0 Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

K8s-glue agent

4 years ago

0 Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

Hi AgitatedDove14 sorry for the late reply. Yes, pod does get allocated 2 gpus. "script path" is "train_net_clearml.py"

4 years ago

0 Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

if it helps, here is my training code: https://github.com/levan92/det2_clearml/blob/master/train_net_clearml.py

4 years ago

0 Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

i submitted the job through the bash script "train_coco.sh", which basically runs the python script "train_net_clearml.py" with various arguments.

4 years ago

0 Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

AgitatedDove14 I see! I will try adding Task.current_task() and see how it goes.

That said, I already have a Task.get_task() in the main function which each subprocess runs. Is that not enough to trigger clearml? https://github.com/levan92/det2_clearml/blob/2634d2c6f898f8946f5b3379dba929635d81d0a9/trainer.py#L206

4 years ago

0 Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

it's multi-gpu, single node!

4 years ago

0 Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

Oh! Thank you for pointing that out! Didn’t notice that. Yes, it turns out in my requirements.txt i specified that version. Once I changed it to the latest version of clearml, the tensorboard graphs shows up in the dashboard.

4 years ago

0 Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

Hi AgitatedDove14 , so sorry, I have to re-open this issue as the same issue is still happening when I incorporate clearml in my detectron2 training in our setup. In our setup, we are using K8s-glue agent, and I am sending training jobs to be executed remotely. For single gpu training, everything works as intended, tensorboard graphs show up auto-magically on clearml dashboard.

However, when train with multi-gpu (same machine), the tensorboard graphs does not show up on the clearml dashboar...

4 years ago

0 Hi, Another Issue Is Faced When Using Mmdetection/Mmcv With Clearml. The Automatic Uploading Of Checkpoint Meets The Following Error:

TimelyPenguin76 , env info can be found in the logs. thanks!

3 years ago

0 Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

Sorry about that, thank you for your help :)

4 years ago

0 Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

TimelyPenguin76 AgitatedDove14 so sorry for pressing, just bumping this up, do you all have any ideas why this happens? Otherwise I will have to proceed with using the clearml task logging to manually report the metrics

4 years ago

0 Hi, Another Issue Is Faced When Using Mmdetection/Mmcv With Clearml. The Automatic Uploading Of Checkpoint Meets The Following Error:

I suspect the issue stems from this https://github.com/open-mmlab/mmcv/blob/2f023453d6fc419e6ed3a8720fcf601d3863b42b/mmcv/runner/checkpoint.py#L703-L705 . Does ClearML expect the 2nd argument to torch.save to be a filename? In this case it is a BytesIO object instead

3 years ago

0 Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

Yup i could view the tensorboard logs through a local tensorboard with all the metrics in

4 years ago

0 Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

AgitatedDove14 you can ignore my last question, I've tried it out on a minimal example here: https://github.com/levan92/clearml_test_mp

I've ascertain that I need Task.current_task() in order to trigger clearml ( Task.get_task() is not enough). Thank you!

4 years ago

0 Hi, Another Issue Is Faced When Using Mmdetection/Mmcv With Clearml. The Automatic Uploading Of Checkpoint Meets The Following Error:

sorry, i'll try to give you a toy example when i have the time to

3 years ago

0 Hi, Another Issue Is Faced When Using Mmdetection/Mmcv With Clearml. The Automatic Uploading Of Checkpoint Meets The Following Error:

Python 3.7.11

3 years ago