This task is picked up by first agent; it runs DDP launch script for itself and then creates clones of itself with task.create_function_task() and passes its address as argument to the function
Hi UnevenHorse85
Interesting use case, just for my understanding, the idea is to use ClearML for the node allocation/scheduling and PyTorch DDP for the actual communication, is that correct ?
passes its address as argument to the function
This seems like a great solution.
the queue is polluted with lots of cloned tasks that have to be aborted manually, and the whole job only requires only ...
I wouldn't say the queue pollution is the issue (or the multiple copies of the cloned Tasks), I think the main issue here is that the allocated nodes have to wait until all nodes are allocated, no?
Regrading Task pollution, when the master node is done, it can delete all child/cloned Tasks so it is easier on the eyes. This way if something goes wrong in one of the nodes, you have full visibility, but when everything works, you end up with a clean single copy.
wdyt?