Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Unanswered
Hi, If I'Ve Clearml Agents Installed On Several Servers, Each With A Single Gpu. How Can I Train A Gpt2 Model That Would Require Multiple Gpus?


Thanks. The challenge we encountered is that we only expose our Devs to the ClearML queues, so users have no idea what's beyond the queue except that it will offer them the resources associated with the queue. In the backend, each queue is associated with more than one host.

So what we tried is as followed.
We create a train.py script much like what Tobias shared above. In this script, we use the socket library to pull the ipaddr.

import socket
hostname=socket.gethostname()
ipaddr=docker.gethostbyname(hostname)

Above script is then used to generate a ClearML Task.

Then we create a ClearML pipeline that look as follows, all from the same task.

            |-- taskslave1
taskmaster--|-- taskslave2
            |-- taskslave3

The i[addr from the master task is expected to be retrived and passed to the slave tasks as a argument.

Two problems come in when running the pipeline;

  • Taskmaster is actually waiting to sync with the configured number of nodes, so its not returning and in turn the IP addr cannot be passed on to the slave nodes.
  • The IPAddr pulled is actually that of the docker ip, which cannot be pinged from another host.
  
  
Posted one year ago
164 Views
0 Answers
one year ago
one year ago