Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Unanswered
[Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With


Results of a bit more investigation:

The ClearML example does use the Pytorch dist package but none of the DistributedDataParallel functionality, instead, it reduces gradients “manually”. This script is also not prepared for torchrun as it launches more processes itself (w/o using the multiprocessing of Python or Pytorch.)

When running a simple example (code attached below) that includes artifact uploads to S3 and launching processes via torch.multiprocessing , the training hangs at the end - any idea where to investigate closer?

ClearML Task: created new task id=f070414bfb84402baa597a0167d1a21e
2023-01-26 17:34:22,564 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page: 

Running basic DDP on rank 2.
Running basic DDP on rank 0.
Running basic DDP on rank 1.
saving...
2023-01-26 17:34:35,507 - clearml.Task - INFO - Waiting to finish uploads
2023-01-26 17:34:35,510 - clearml.Task - INFO - Waiting to finish uploads
saved
2023-01-26 17:34:37,042 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_olqpu7no.tmp => glass-clearml/Glass-ClearML Demo/Distributed basic mp.spawn, S3 upload.f070414bfb84402baa597a0167d1a21e/models/checkpoint.pth
2023-01-26 17:34:37,048 - clearml.Task - INFO - Waiting to finish uploads
2023-01-26 17:34:37,550 - clearml.Task - INFO - Completed model upload to 
 Demo/Distributed basic mp.spawn, S3 upload.f070414bfb84402baa597a0167d1a21e/models/checkpoint.pth
2023-01-26 17:34:44,129 - clearml.Task - INFO - Finished uploading
2023-01-26 17:34:45,926 - clearml.Task - INFO - Finished uploading
  
  
Posted one year ago
137 Views
0 Answers
one year ago
one year ago