Hi folks! Can someone advise/share examples on how to properly combine Hydra and ClearML for working with hyperparameters and DDP? I tried to follow the documentation (here None and there None ), but it works somewhat strangely, hyperparameters are passed, but the number of instances launched is as specified in train.py.
For example:
Here I would like to training on 4 k8s nodes:
python3 train.py trainer.max_epochs=6 trainer=ddp trainer.devices=1 trainer.num_nodes=4 ++logger.mlflow.tracking_uri=
+logger.mlflow.experiment_name="debug-exp"
but only 3 nodes are spawned, as it written in train.py:
task.launch_multi_node(total_num_nodes=3, port=29500, queue='default', wait=True, addr=None)
As a result, the training runs indefinitely (does not start at all) because it expects the fourth node/instance to be present.
I would appreciate any help!
P.S. The other hyperparams, like numbers of epochs etc are always the same as specified, e.g. trainer.max_epochs=6 runs the training with 6 epochs
Thanks in advance!