Unanswered
Does Clearml Have The Ability To Run A Single Experiment Across Multiple Nodes/Gpus In A K8 Cluster?
it seems like each task is setup to run on a single pod/node based on the attributes like
gpu memory
,
os
,
num of cores,
worker
BoredHedgehog47 of course you can scale on multiple node.
The way to do that is to create a k8s Yaml with replicas, each pod is actually running the exact same code with the exact same setup, notice that inside the code itself the DL frameworks need to be able to communicate with one another and by definition only the "master" one does all the reporting.
That said, from the ClearML perspective you are seeing a single Task
I'm not sure that in the Info you will be able to see the WORLD_SIZE value, but at least in theory you should
160 Views
0
Answers
2 years ago
one year ago