Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Unanswered
Hi Folks! Can Someone Advise/Share Examples On How To Properly Combine Hydra And Clearml For Working With Hyperparameters And Ddp? I Tried To Follow The Documentation (Here

Hi folks! Can someone advise/share examples on how to properly combine Hydra and ClearML for working with hyperparameters and DDP? I tried to follow the documentation (here None and there None ), but it works somewhat strangely, hyperparameters are passed, but the number of instances launched is as specified in train.py.
For example:
Here I would like to training on 4 k8s nodes:

python3 train.py trainer.max_epochs=6 trainer=ddp trainer.devices=1 trainer.num_nodes=4 ++logger.mlflow.tracking_uri=
 +logger.mlflow.experiment_name="debug-exp"

but only 3 nodes are spawned, as it written in train.py:

task.launch_multi_node(total_num_nodes=3, port=29500, queue='default', wait=True, addr=None)

As a result, the training runs indefinitely (does not start at all) because it expects the fourth node/instance to be present.

I would appreciate any help!

P.S. The other hyperparams, like numbers of epochs etc are always the same as specified, e.g. trainer.max_epochs=6 runs the training with 6 epochs
Thanks in advance!

  
  
Posted 23 days ago
Votes Newest

Answers

46 Views
0 Answers
23 days ago
23 days ago
Tags