@<1523701205467926528:profile|AgitatedDove14> We want to use Ray for distributed training, where multiple nodes will be running Ray and clearml and training the model. With one Node being the controller . Similar to, torch distributed training
Hi @<1658281093108862976:profile|EncouragingPenguin15>
Should work, I'm assuming multiple nodes are running agents ? or are you saying Ray spins the jobs and clearml logs them ?
Should work out of the box, maybe the only thing to notice is that you will get a Task for every local_rank 0 process
does that make sense ?
We're using Ray for hyperparameter search for non-CV model successfully on ClearML