Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Does Clearml Have The Ability To Run A Single Experiment Across Multiple Nodes/Gpus In A K8 Cluster?

Does ClearML have the ability to run a single experiment across multiple nodes/GPUs in a k8 cluster?

  
  
Posted one year ago
Votes Newest

Answers 14


Okay, so basically the DL framework manages the master/worker relationship. I just need to use pod replicas for my k8 agents.

  
  
Posted one year ago

When I click on a task details -> info tab, it seems like each task is setup to run on a single pod/node based on the attributes like gpu memory , os , num of cores, worker

  
  
Posted one year ago

Hi BoredHedgehog47 , yes it can. You would obviously need to set it up first 🙂

  
  
Posted one year ago

Is your K8s on-prem or over cloud?

  
  
Posted one year ago

it seems like each task is setup to run on a single pod/node based on the attributes like

gpu memory

,

os

,

num of cores,

worker

BoredHedgehog47 of course you can scale on multiple node.
The way to do that is to create a k8s Yaml with replicas, each pod is actually running the exact same code with the exact same setup, notice that inside the code itself the DL frameworks need to be able to communicate with one another and by definition only the "master" one does all the reporting.
That said, from the ClearML perspective you are seeing a single Task
I'm not sure that in the Info you will be able to see the WORLD_SIZE value, but at least in theory you should

  
  
Posted one year ago

Hi BoredHedgehog47 , I think there was some confusion here - you can run on a pod with multiple GPUs, but you can't run a single task on multiple nodes

  
  
Posted one year ago

Actually this is by default for any multi node training framework torch DDP / openmpi etc.

  
  
Posted one year ago

Maybe SuccessfulKoala55 might have more insight on setting K8s integration 🙂

  
  
Posted one year ago

AgitatedDove14 How do I setup a master task to do all the reporting?

  
  
Posted one year ago

As they are singular not plural

  
  
Posted one year ago

In addition to an EFS mount

  
  
Posted one year ago

Exactly !

  
  
Posted one year ago

AWS, I've setup the shared memory between k8 nodes

  
  
Posted one year ago

SuccessfulKoala55 Darn, so I can only scale vertically?

  
  
Posted one year ago
680 Views
14 Answers
one year ago
one year ago
Tags