[Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

Answered

[ClearML with Pytorch-based distributed training}
Hi everyone! Is the combination of ClearML with torch.distributed.launch or torchrun actively supported? A brief search in this channel for “distributed” showed some threads but wasn’t clear on whether this completely works or not.

What I’ve discovered is that the training gets somehow stuck when I have it create a ClearML Task (event if CLEARML_OFFLINE_MODE=True) is set. I can show some outputs but I couldn’t derive much from that.

Also, I’ve seen that the https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_distributed_example.py does not seem to use torchrun or distributed.launch but rather starts a set of sub-processes itself - is that the intended way to get distributed working with ClearML?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ScantChimpanzee51
				
					0
					 × 1

Votes Newest

Answers 10

Hi ScantChimpanzee51
How are you launching the code ?
Basically the easiest way is to do so with the example you just mentioned,
Can this issue be reproduced ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So my own repo I’m launching with either
torchrun --nproc_per_node 2 --standalone --master_addr 127.0.0.1 --master_port 29500 -m http://my_folder.my _script --some_option
or
python3 -m torch.distributed.launch --nproc_per_node 2 --master_addr 127.0.0.1 --master_port 29500 -m http://my_folder.my _script --some_option

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ScantChimpanzee51
				
					0
					 × 1

AgitatedDove14 maybe to come at this from a broader angle:
Is ClearML combined with DataParallel or DistributedDataParallel officially supported / should that work without many adjustments? If so, would it be started via python ... or via torchrun ... ? What about remote runs, how will they support the parallel execution? To go even deeper, what about the machines started via ClearML Autoscaler? Can they either run multiple agents on them and/or start remote distributed launches?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ScantChimpanzee51
				
					0
					 × 1

Is ClearML combined with DataParallel or DistributedDataParallel officially supported / should that work without many adjustments?Yes it is suported, and should work
If so, would it be started via python ... or via torchrun ... ?Yes it should, hence the request for a code snippet to reproduce the issue you are experiencing
What about remote runs, how will they support the parallel execution?Supported, You should see in the "script entry" something like "-m -m torch.distributed.launch --nproc_per_node 2 ..."

To go even deeper, what about the machines started via ClearML Autoscaler?

Should work out of the box, this is considered a single Job/Task no need to spin multiple agents for that

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ok great! I will debug starting with a simpler training script.
Just as a last question, is torchrun also supported rather than the (now deprecated but still usable) torch.distributed.launch ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ScantChimpanzee51
				
					0
					 × 1

It should actually work the same, if you find out it fails to properly register let me know (and then I guess a github issue is the next step)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi @<1523701205467926528:profile|AgitatedDove14> , so I’ve managed to reproduce a bit more.
When I run very basic code via torchrun or torch.distributed.run then multiple ClearML tasks are created and visible in the UI (screenshot below). The logs and scalars are not aggregated but the task of each rank reports its own.

If however I branch out via torch.multiprocessing like below, everything works as expected. The “script path” just shows the single python script, all logs and scalars from all ranks are aggregated into a single task.

    task = Task.init("Glass-ClearML Demo", "Distributed basic mp.spawn, simple model, v3")
    n_gpus = torch.cuda.device_count()
    world_size = n_gpus
    mp.spawn(demo_basic, args=(world_size,), nprocs=world_size, join=True)

All code is taken from the Pytorch tutorial , I just add a ClearML Task into it as shown above.
ClearML version is 1.7.1

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ScantChimpanzee51
				
					0
					 × 1

Results of a bit more investigation:

The ClearML example does use the Pytorch dist package but none of the DistributedDataParallel functionality, instead, it reduces gradients “manually”. This script is also not prepared for torchrun as it launches more processes itself (w/o using the multiprocessing of Python or Pytorch.)

When running a simple example (code attached below) that includes artifact uploads to S3 and launching processes via torch.multiprocessing , the training hangs at the end - any idea where to investigate closer?

ClearML Task: created new task id=f070414bfb84402baa597a0167d1a21e
2023-01-26 17:34:22,564 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page:


Running basic DDP on rank 2.
Running basic DDP on rank 0.
Running basic DDP on rank 1.
saving...
2023-01-26 17:34:35,507 - clearml.Task - INFO - Waiting to finish uploads
2023-01-26 17:34:35,510 - clearml.Task - INFO - Waiting to finish uploads
saved
2023-01-26 17:34:37,042 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_olqpu7no.tmp => glass-clearml/Glass-ClearML Demo/Distributed basic mp.spawn, S3 upload.f070414bfb84402baa597a0167d1a21e/models/checkpoint.pth
2023-01-26 17:34:37,048 - clearml.Task - INFO - Waiting to finish uploads
2023-01-26 17:34:37,550 - clearml.Task - INFO - Completed model upload to

 Demo/Distributed basic mp.spawn, S3 upload.f070414bfb84402baa597a0167d1a21e/models/checkpoint.pth
2023-01-26 17:34:44,129 - clearml.Task - INFO - Finished uploading
2023-01-26 17:34:45,926 - clearml.Task - INFO - Finished uploading

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ScantChimpanzee51
				
					0
					 × 1

When running on our bigger research repository which includes saving checkpoints and uploading to S3, the training ends with errors as shown below and a Killed message for the main process (I do not abort the main process manually):

2023-01-26 17:37:17,527 INFO: Save the latest model.
2023-01-26 17:37:19,158 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_cvqpor8r.tmp => glass-clearml/RealESR/Glass-ClearML Demo/[Lambda] FMEN distributed check, v10 fileserver upload.5af23077a8d2481ebd904f749af7ee51/models/net_g_latest.pth
2023-01-26 17:37:22,133 - clearml.storage - INFO - Uploading: 5.02MB / 18.77MB @ 1.69MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/home/manuel/venv/real-esr/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
2023-01-26 17:37:24,405 - clearml.model - INFO - Selected model id: 31f67a1ac95643d4aa12af9eb52ed032
2023-01-26 17:37:25,318 - clearml.storage - INFO - Uploading: 10.02MB / 18.77MB @ 1.57MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
Loading model from: /home/manuel/venv/real-esr/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth
2023-01-26 17:37:25,832 - clearml.model - INFO - Selected model id: 108e1a350bf1457da94f408cde9cfd82
2023-01-26 17:37:27,589 - clearml.storage - INFO - Uploading: 15.02MB / 18.77MB @ 2.20MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
2023-01-26 17:37:30,226 - clearml.Task - INFO - Completed model upload to

 Demo/[Lambda] FMEN distributed check, v10 fileserver upload.5af23077a8d2481ebd904f749af7ee51/models/net_g_latest.pth
2023-01-26 17:37:57,508 INFO: Validation validation
	 # ssim: 0.1691	Best: 0.1691 @ 11 iter
	 # lpips: 0.7296	Best: 0.7296 @ 11 iter

2023-01-26 17:38:39,719 INFO: Validation train-val
	 # ssim: 0.1691	Best: 0.1691 @ 11 iter
	 # lpips: 0.7296	Best: 0.7296 @ 11 iter



2023-01-26 17:38:56,935 - clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ###
Killed

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ScantChimpanzee51
				
					0
					 × 1

Sorry that these issues go quite deep and chaotic - we would appreciate any help or ideas you can think of!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ScantChimpanzee51
				
					0
					 × 1

Write your answer

2K Views

10 Answers

2 years ago