Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
[Distributed Training] Hi, I Have A Clearml Setup With K8Sglue That Spins Up Pods Of 4 Gpus When Picking Tasks Off The Clearml Queue. We Would Now Want To Proceed With Multi-Node Training, And Some Of The Examples We Are Trying Are Here.

[Distributed Training] Hi, i have a ClearML setup with K8SGlue that spins up pods of 4 GPUs when picking tasks off the clearml queue. We would now want to proceed with multi-node training, and some of the examples we are trying are here.

We have yet to try this but I understand that all the logs and scalars will be consolidated at RANK0, and ClearML simply pulls them from RANK0. My questions are as follows;
For the torch.distributed.launch and torchrun examples above, how should we launch the master and each worker via ClearML queues? If we mamaged to launch above, how would we know the IP addresses since this information is required apriori and ClearML is launching K8S Pods which means i won't have publicly addressable IP addresses? Same question for the mpirun example, how do we do the above with clearml queues and without knowledge of the IP addresses?

Posted 10 months ago
Votes Newest


Hi SubstantialElk6 , maybe SuccessfulKoala55 might have more input on this 🙂

Posted 10 months ago
1 Answer
10 months ago
10 months ago