Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, I Have A Question Regarding Multi-Node Training Using The Clearml-Agent. What Is The Recommended Setup In This Case? Say I Have 3 Nodes With 3 Agents Running On Them. How Do I Make Sure They All Run The Same Job?

Hi all,
I have a question regarding multi-node training using the clearml-agent. What is the recommended setup in this case? Say I have 3 nodes with 3 agents running on them. How do I make sure they all run the same job?

  
  
Posted 2 years ago
Votes Newest

Answers 11


The problem is not really for the agents to wait (this is easily solved by additional high priority queue) the problem is will you have a "free" agent... you see my point ?

  
  
Posted 2 years ago

So in a simple "all-or-nothing"

Actually this is the only solution unless preemption is supported, i.e. abort running Task to free-up an agent...
There is no "magic" solution for complex multi-node scheduling, even SLURM will essentially do the same ...

  
  
Posted 2 years ago

Lets start with a simple setup. Multi-node DDP in pytorch

  
  
Posted 2 years ago

pytorch DDP

with what backend ? gloo ? nvcc ? openmpi ?

  
  
Posted 2 years ago

So in theory you can clone yourself 2 extra times and push into an execution queue, but the issue might be actually making sure the resources are available. what did you have in mind?

  
  
Posted 2 years ago

Hi ExcitedFish86
Good question, how do you "connect" the 3 nodes? (i.e. what the framework you are using)

  
  
Posted 2 years ago

I see what you mean. So in a simple "all-or-nothing" solution I have to choose between potentially starving either the single node tasks (high priority + wait) or multi-node tasks (wait for a time when there are enough available agents and only then allocate the resource).

I actually meant NCCL. nvcc is the CUDA compiler 😅
NCCL communication can be both inter- and intra- node

  
  
Posted 2 years ago

I thought of some sort of gang-scheduling scheme should be implemented on top of the job.
Maybe the agents should somehow go through a barrier with a counter and wait there until enough agents arrived

  
  
Posted 2 years ago

not really... what do you mean by "free" agent?

  
  
Posted 2 years ago

nvcc

  
  
Posted 2 years ago

available agent, i.e. not running anything else.
I mean how long would instance 1 wait until instance 2 of the experiment is up and running?
In other words what happens of all the nodes/agents are working and we still "need" additional instance.
This is basically like "pre-allocating" the nodes, only they wait in real-time until the additional node joins them.
Agent A pulls the 3 node Task, the Task clones itself (Task B) and enqueues on "very high priory queue" Task A wait until Task B is running. Agent B picks Task B and starts running Task A "talks" to Task BThis is the equivalent of "allocating 2 agents" (basically you have to preserve one and wait for the other to be available).
BTW: Is nvcc multi Node or multi GPU ? (I thought it is a single node multi-gpu)

  
  
Posted 2 years ago
625 Views
11 Answers
2 years ago
one year ago
Tags