Reputation
Badges 1
3 × Eureka!ClearML is usually just moving the execution down to the nodes. I'm unsure what role ClearML is playing in your issue
IMHO ClearML would just start the execution on multiple hosts. Keep in mind that the hosts need to be on the same LAN and have a very high bandwidth.
What you are looking for is called "DistributedDataParallel". Maybe this tutorial gives you a starting point:
None
I would recommend you start getting familiar with the distributed training modes (for example DDP in PyTorch). There are some important concepts that are required to train multi-GPU and multi-devices.
Before you start with a sophisticated model, I'd recommend to try this training setup with a baseline model, check that data, gradients, weights, metrics, etc. are synced correctly.
Another workaround that did the trick for me was to fix the version of urllib3 in your requirements.txturllib3==1.26.15
Yes that makes sense. I solved it by actively reading them via Task.parameters. Now that works, I just had to adjust the config parser a bit
The optimizer part works out of the box, yes. But my training is usually consuming a yaml file with parameters and not via argparse. This is the part I had to adjust
Awesome! Thanks for the quick help!