Hi! I Am Currently Using Clearml (With Remote Execution), To Train An Object Detection Model With

Answered

Hi! I am currently using clearml (with remote execution), to train an object detection model with https://github.com/facebookresearch/detectron2 . It was working well in a single GPU setting, with the tensorboard logs auto-magically displayed onto the clearml dashboard.

However, when I went into a multi-gpu setting (still single machine), the tensorboard logs are not longer displayed on the clearml dashboard, although the tensorboard logs are still getting written by detectron2. Note that detectron2 does multi-gpu training in a https://pytorch.org/tutorials/intermediate/ddp_tutorial.html style (aka there is a process spawned for each gpu) through a https://github.com/facebookresearch/detectron2/blob/master/detectron2/engine/launch.py . Is anyone able to help with this issue?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					NonchalantDeer14
				
					0
					 × 1

Votes Newest

Answers 19

K8s-glue agent

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					NonchalantDeer14
				
					0
					 × 1

Hi AgitatedDove14 sorry for the late reply. Yes, pod does get allocated 2 gpus. "script path" is "train_net_clearml.py"

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					NonchalantDeer14
				
					0
					 × 1

i submitted the job through the bash script "train_coco.sh", which basically runs the python script "train_net_clearml.py" with various arguments.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					NonchalantDeer14
				
					0
					 × 1

👍

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

it's multi-gpu, single node!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					NonchalantDeer14
				
					0
					 × 1

Just verifying the Pod does get allocated 2 gpus, correct ?
What do you have under the "script path" in the Task?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

NonchalantDeer14
I think the issue is the way it spins the subprocess is not with fork but with Popen, so clearml is not "loaded" into the subprocess hence no logging.
The easiest fix is to call Task.current_task() inside the actual code (somewhere when it starts), it should trigger clearml.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Sorry about that, thank you for your help :)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					NonchalantDeer14
				
					0
					 × 1

Thanks NonchalantDeer14 !
BTW: how do you submit the multi GPU job? Is it multi-gpu or multi node ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi NonchalantDeer14
In multi-gpu, can you still see the logs on the local Tensorboard ?
Are you running manually or with an agent ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Oh! Thank you for pointing that out! Didn’t notice that. Yes, it turns out in my requirements.txt i specified that version. Once I changed it to the latest version of clearml, the tensorboard graphs shows up in the dashboard.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					NonchalantDeer14
				
					0
					 × 1

Yup i could view the tensorboard logs through a local tensorboard with all the metrics in

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					NonchalantDeer14
				
					0
					 × 1

AgitatedDove14 you can ignore my last question, I've tried it out on a minimal example here: https://github.com/levan92/clearml_test_mp

I've ascertain that I need Task.current_task() in order to trigger clearml ( Task.get_task() is not enough). Thank you!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					NonchalantDeer14
				
					0
					 × 1

TimelyPenguin76 AgitatedDove14 so sorry for pressing, just bumping this up, do you all have any ideas why this happens? Otherwise I will have to proceed with using the clearml task logging to manually report the metrics

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					NonchalantDeer14
				
					0
					 × 1

Hi AgitatedDove14 , so sorry, I have to re-open this issue as the same issue is still happening when I incorporate clearml in my detectron2 training in our setup. In our setup, we are using K8s-glue agent, and I am sending training jobs to be executed remotely. For single gpu training, everything works as intended, tensorboard graphs show up auto-magically on clearml dashboard.

However, when train with multi-gpu (same machine), the tensorboard graphs does not show up on the clearml dashboard. However, everything else still trains correctly and the tensorboard logs written in the k8s container are correct as well. The logging is also showing up normally on the clearml dashboard, which shows that the training process is "connected" to clearml. Also, when I explicitly report scalars, in the training process, it does not show up as well.

I've attached a zip file which contains 2 folders (single-gpu, multi-gpu). They contain the respective codes and logs (as well as screenshots of the clearml dashboard).

Thank you so much! Looking forward to your reply.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					NonchalantDeer14
				
					0
					 × 1

if it helps, here is my training code: https://github.com/levan92/det2_clearml/blob/master/train_net_clearml.py

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					NonchalantDeer14
				
					0
					 × 1

AgitatedDove14 I see! I will try adding Task.current_task() and see how it goes.

That said, I already have a Task.get_task() in the main function which each subprocess runs. Is that not enough to trigger clearml? https://github.com/levan92/det2_clearml/blob/2634d2c6f898f8946f5b3379dba929635d81d0a9/trainer.py#L206

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					NonchalantDeer14
				
					0
					 × 1

clearml - WARNING - Could not retrieve remote configuration named 'hyperparams'

What's the clearml-server version you are working with ?

In both logs I see (even in the single GPU log, it seems you "see" two GPUs, is that correct?)
GPU 0,1 Tesla V100-SXM2-32GB (arch=7.0)

Last question, this is using relatively old clearml version (0.17.5), can you test with the latest version (1.1.1)?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Okay let me check the code and comeback with followup questions

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

19 Answers

4 years ago

2 years ago