Hi BoredHedgehog47 , I think there was some confusion here - you can run on a pod with multiple GPUs, but you can't run a single task on multiple nodes
Okay, so basically the DL framework manages the master/worker relationship. I just need to use pod replicas for my k8 agents.
Hi BoredHedgehog47 , yes it can. You would obviously need to set it up first 🙂
Maybe SuccessfulKoala55 might have more insight on setting K8s integration 🙂
AgitatedDove14 How do I setup a master task to do all the reporting?
When I click on a task details -> info tab, it seems like each task is setup to run on a single pod/node based on the attributes like gpu memory
, os
, num of cores,
worker
Actually this is by default for any multi node training framework torch DDP / openmpi etc.
SuccessfulKoala55 Darn, so I can only scale vertically?
it seems like each task is setup to run on a single pod/node based on the attributes like
gpu memory
,
os
,
num of cores,
worker
BoredHedgehog47 of course you can scale on multiple node.
The way to do that is to create a k8s Yaml with replicas, each pod is actually running the exact same code with the exact same setup, notice that inside the code itself the DL frameworks need to be able to communicate with one another and by definition only the "master" one does all the reporting.
That said, from the ClearML perspective you are seeing a single Task
I'm not sure that in the Info you will be able to see the WORLD_SIZE value, but at least in theory you should
AWS, I've setup the shared memory between k8 nodes