Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, I Have A Question Regarding Multi-Node Training Using The Clearml-Agent. What Is The Recommended Setup In This Case? Say I Have 3 Nodes With 3 Agents Running On Them. How Do I Make Sure They All Run The Same Job?

Hi all,
I have a question regarding multi-node training using the clearml-agent. What is the recommended setup in this case? Say I have 3 nodes with 3 agents running on them. How do I make sure they all run the same job?

  
  
Posted 2 years ago
Votes Newest

Answers 11


Lets start with a simple setup. Multi-node DDP in pytorch

  
  
Posted 2 years ago

Hi ExcitedFish86
Good question, how do you "connect" the 3 nodes? (i.e. what the framework you are using)

  
  
Posted 2 years ago

The problem is not really for the agents to wait (this is easily solved by additional high priority queue) the problem is will you have a "free" agent... you see my point ?

  
  
Posted 2 years ago

nvcc

  
  
Posted 2 years ago

available agent, i.e. not running anything else.
I mean how long would instance 1 wait until instance 2 of the experiment is up and running?
In other words what happens of all the nodes/agents are working and we still "need" additional instance.
This is basically like "pre-allocating" the nodes, only they wait in real-time until the additional node joins them.
Agent A pulls the 3 node Task, the Task clones itself (Task B) and enqueues on "very high priory queue" Task A wait until Task B is running. Agent B picks Task B and starts running Task A "talks" to Task BThis is the equivalent of "allocating 2 agents" (basically you have to preserve one and wait for the other to be available).
BTW: Is nvcc multi Node or multi GPU ? (I thought it is a single node multi-gpu)

  
  
Posted 2 years ago

I see what you mean. So in a simple "all-or-nothing" solution I have to choose between potentially starving either the single node tasks (high priority + wait) or multi-node tasks (wait for a time when there are enough available agents and only then allocate the resource).

I actually meant NCCL. nvcc is the CUDA compiler 😅
NCCL communication can be both inter- and intra- node

  
  
Posted 2 years ago

So in theory you can clone yourself 2 extra times and push into an execution queue, but the issue might be actually making sure the resources are available. what did you have in mind?

  
  
Posted 2 years ago

I thought of some sort of gang-scheduling scheme should be implemented on top of the job.
Maybe the agents should somehow go through a barrier with a counter and wait there until enough agents arrived

  
  
Posted 2 years ago

pytorch DDP

with what backend ? gloo ? nvcc ? openmpi ?

  
  
Posted 2 years ago

So in a simple "all-or-nothing"

Actually this is the only solution unless preemption is supported, i.e. abort running Task to free-up an agent...
There is no "magic" solution for complex multi-node scheduling, even SLURM will essentially do the same ...

  
  
Posted 2 years ago

not really... what do you mean by "free" agent?

  
  
Posted 2 years ago
642 Views
11 Answers
2 years ago
one year ago
Tags