
Reputation
Badges 1
3 × Eureka!Awesome! Thanks for the quick help!
IMHO ClearML would just start the execution on multiple hosts. Keep in mind that the hosts need to be on the same LAN and have a very high bandwidth.
What you are looking for is called "DistributedDataParallel". Maybe this tutorial gives you a starting point:
None
ClearML is usually just moving the execution down to the nodes. I'm unsure what role ClearML is playing in your issue
Yes that makes sense. I solved it by actively reading them via Task.parameters. Now that works, I just had to adjust the config parser a bit
I would recommend you start getting familiar with the distributed training modes (for example DDP in PyTorch). There are some important concepts that are required to train multi-GPU and multi-devices.
Before you start with a sophisticated model, I'd recommend to try this training setup with a baseline model, check that data, gradients, weights, metrics, etc. are synced correctly.
The optimizer part works out of the box, yes. But my training is usually consuming a yaml file with parameters and not via argparse. This is the part I had to adjust
Another workaround that did the trick for me was to fix the version of urllib3 in your requirements.txturllib3==1.26.15