Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Does Clearml Have The Ability To Run A Single Experiment Across Multiple Nodes/Gpus In A K8 Cluster?

Does ClearML have the ability to run a single experiment across multiple nodes/GPUs in a k8 cluster?

  
  
Posted 2 years ago
Votes Newest

Answers 14


Is your K8s on-prem or over cloud?

  
  
Posted 2 years ago

AgitatedDove14 How do I setup a master task to do all the reporting?

  
  
Posted 2 years ago

Okay, so basically the DL framework manages the master/worker relationship. I just need to use pod replicas for my k8 agents.

  
  
Posted 2 years ago

Maybe SuccessfulKoala55 might have more insight on setting K8s integration 🙂

  
  
Posted 2 years ago

it seems like each task is setup to run on a single pod/node based on the attributes like

gpu memory

,

os

,

num of cores,

worker

BoredHedgehog47 of course you can scale on multiple node.
The way to do that is to create a k8s Yaml with replicas, each pod is actually running the exact same code with the exact same setup, notice that inside the code itself the DL frameworks need to be able to communicate with one another and by definition only the "master" one does all the reporting.
That said, from the ClearML perspective you are seeing a single Task
I'm not sure that in the Info you will be able to see the WORLD_SIZE value, but at least in theory you should

  
  
Posted 2 years ago

Hi BoredHedgehog47 , I think there was some confusion here - you can run on a pod with multiple GPUs, but you can't run a single task on multiple nodes

  
  
Posted 2 years ago

In addition to an EFS mount

  
  
Posted 2 years ago

When I click on a task details -> info tab, it seems like each task is setup to run on a single pod/node based on the attributes like gpu memory , os , num of cores, worker

  
  
Posted 2 years ago

AWS, I've setup the shared memory between k8 nodes

  
  
Posted 2 years ago

Actually this is by default for any multi node training framework torch DDP / openmpi etc.

  
  
Posted 2 years ago

SuccessfulKoala55 Darn, so I can only scale vertically?

  
  
Posted 2 years ago

Exactly !

  
  
Posted 2 years ago

As they are singular not plural

  
  
Posted 2 years ago

Hi BoredHedgehog47 , yes it can. You would obviously need to set it up first 🙂

  
  
Posted 2 years ago
1K Views
14 Answers
2 years ago
one year ago
Tags