The first stage is a rank0 pytorch script. The downstream stages are rankN scripts, they are waiting for the IP address of the first stage. But the first stage doesn’t return, it simply waits for the rankN scripts to connect to it. But in this case, the rankN scripts doesn’t start. So its probably necessary to have just a single stage.
If i were to start a single rank0, and subsequent rankN tasks, it would be rather messy on ClearML Dashboard. Best to have either a single clearml application or clearml pipeline to do this.