Let me just specify more situation. Our company considering, building ClearML Main server on single node, and ClearML Agent to other gpu servers, In that case, can we use ClearML Agent scheduling with multi-node multi-gpu distributed learning? For now documentation of ClearML seems to have only support single node running in terms of using ClearML Agent. Basically it automatically schedules to use unoccupied resources, however, it doesn’t support multi-node distribution learning using scheduling and orchestration from ClearML Agent, right?
I mean is there any integration with horovod or other multi-node distribute learning framework?
Hi DiminutiveBaldeagle77 ,
Yes - https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_kubernetes_helm/ If you already have K8s cluster it is beneficial since you get scheduling capabilities which are not normally present in K8s
Hi CostlyOstrich36 . I’m Steve who works with Ivan. In our company we have serveral gpu servers. For example, there are 4 gpu server nodes which have two 3090 RTX gpus, respectively, so total number of gpu is 8. We are wondering how to train single machine learning model leveraging all 8 gpus in different nodes. Does clearML support this functionality? If so, where can I find documentation related to this?
Thanks.
CostlyOstrich36 I appreciate your answer! However, I did not understand clearly. I think my question was quite not obvious.
As I understood so far, ClearML over k8s can be advantageous for management of cluster or scheduling capabilities, right?
Then, for multi host cluster setup, does ClearML not support native distributed mode?