hello anyone know about this problem?
hello. anyone know how to add "-m torch.distributed.launch" to the command for distributed training. Like in this document
None
I tried to use like this:
clearml-task --project AAAA --name rtmdet-ins_l_8xb32-300e_coco.py --script tools/train.py --args config=configs/rtmdet/rtmdet-ins_l_8xb32-300e_coco.py launcher=pytorch m=torch.distributed.launch nproc_per_node=4 nnodes=1 node_rank=0 master_addr=127.0.0.1 master_port=29500 --docker mmdet-3.0 --docker_args="--network=host --gpus all" --queue default
But it complained that [RANK, WORLD_SIZE] variables were not set. It also meaned that the "-m torch.distributed.launch" was not parsed to python command line.
If I manually added [RANK, WORLD_SIZE] values then the training will timeout and failed. the only values that worked is [0, 1], but in this case only the first available GPU was used for training.