Hello. Anyone Know How To Add "-M Torch.Distributed.Launch" To The Command For Distributed Training. Like In This Document

Answered

hello. anyone know how to add "-m torch.distributed.launch" to the command for distributed training. Like in this document
None

I tried to use like this:

clearml-task --project AAAA --name rtmdet-ins_l_8xb32-300e_coco.py --script tools/train.py --args config=configs/rtmdet/rtmdet-ins_l_8xb32-300e_coco.py launcher=pytorch m=torch.distributed.launch nproc_per_node=4 nnodes=1 node_rank=0 master_addr=127.0.0.1 master_port=29500 --docker mmdet-3.0 --docker_args="--network=host --gpus all" --queue default

But it complained that [RANK, WORLD_SIZE] variables were not set. It also meaned that the "-m torch.distributed.launch" was not parsed to python command line.

If I manually added [RANK, WORLD_SIZE] values then the training will timeout and failed. the only values that worked is [0, 1], but in this case only the first available GPU was used for training.

  				
Posted 
	one year ago

					More  		
  Report
		
					FantasticBeetle92
				
					0
					 × 1

Votes Newest

Answers 2

hello anyone know about this problem?

  				
Posted 
	one year ago

					More  		
  Report
		
					FantasticBeetle92
				
					0
					 × 1

following because I have a similar question

  				
Posted 
	one year ago

					More  		
  Report
		
					FantasticPig28
				
					0

Write your answer

1K Views

2 Answers

one year ago