Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
[Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

[ClearML with Pytorch-based distributed training}
Hi everyone! Is the combination of ClearML with torch.distributed.launch or torchrun actively supported? A brief search in this channel for “distributed” showed some threads but wasn’t clear on whether this completely works or not.

What I’ve discovered is that the training gets somehow stuck when I have it create a ClearML Task (event if CLEARML_OFFLINE_MODE=True) is set. I can show some outputs but I couldn’t derive much from that.

Also, I’ve seen that the https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_distributed_example.py does not seem to use torchrun or distributed.launch but rather starts a set of sub-processes itself - is that the intended way to get distributed working with ClearML?

  
  
Posted one year ago
Votes Newest

Answers 10


When running on our bigger research repository which includes saving checkpoints and uploading to S3, the training ends with errors as shown below and a Killed message for the main process (I do not abort the main process manually):

2023-01-26 17:37:17,527 INFO: Save the latest model.
2023-01-26 17:37:19,158 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_cvqpor8r.tmp => glass-clearml/RealESR/Glass-ClearML Demo/[Lambda] FMEN distributed check, v10 fileserver upload.5af23077a8d2481ebd904f749af7ee51/models/net_g_latest.pth
2023-01-26 17:37:22,133 - clearml.storage - INFO - Uploading: 5.02MB / 18.77MB @ 1.69MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/home/manuel/venv/real-esr/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
2023-01-26 17:37:24,405 - clearml.model - INFO - Selected model id: 31f67a1ac95643d4aa12af9eb52ed032
2023-01-26 17:37:25,318 - clearml.storage - INFO - Uploading: 10.02MB / 18.77MB @ 1.57MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
Loading model from: /home/manuel/venv/real-esr/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth
2023-01-26 17:37:25,832 - clearml.model - INFO - Selected model id: 108e1a350bf1457da94f408cde9cfd82
2023-01-26 17:37:27,589 - clearml.storage - INFO - Uploading: 15.02MB / 18.77MB @ 2.20MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
2023-01-26 17:37:30,226 - clearml.Task - INFO - Completed model upload to 
 Demo/[Lambda] FMEN distributed check, v10 fileserver upload.5af23077a8d2481ebd904f749af7ee51/models/net_g_latest.pth
2023-01-26 17:37:57,508 INFO: Validation validation
	 # ssim: 0.1691	Best: 0.1691 @ 11 iter
	 # lpips: 0.7296	Best: 0.7296 @ 11 iter

2023-01-26 17:38:39,719 INFO: Validation train-val
	 # ssim: 0.1691	Best: 0.1691 @ 11 iter
	 # lpips: 0.7296	Best: 0.7296 @ 11 iter



2023-01-26 17:38:56,935 - clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ###
Killed
  
  
Posted one year ago

Sorry that these issues go quite deep and chaotic - we would appreciate any help or ideas you can think of!

  
  
Posted one year ago

It should actually work the same, if you find out it fails to properly register let me know (and then I guess a github issue is the next step)

  
  
Posted one year ago

Results of a bit more investigation:

The ClearML example does use the Pytorch dist package but none of the DistributedDataParallel functionality, instead, it reduces gradients “manually”. This script is also not prepared for torchrun as it launches more processes itself (w/o using the multiprocessing of Python or Pytorch.)

When running a simple example (code attached below) that includes artifact uploads to S3 and launching processes via torch.multiprocessing , the training hangs at the end - any idea where to investigate closer?

ClearML Task: created new task id=f070414bfb84402baa597a0167d1a21e
2023-01-26 17:34:22,564 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page: 

Running basic DDP on rank 2.
Running basic DDP on rank 0.
Running basic DDP on rank 1.
saving...
2023-01-26 17:34:35,507 - clearml.Task - INFO - Waiting to finish uploads
2023-01-26 17:34:35,510 - clearml.Task - INFO - Waiting to finish uploads
saved
2023-01-26 17:34:37,042 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_olqpu7no.tmp => glass-clearml/Glass-ClearML Demo/Distributed basic mp.spawn, S3 upload.f070414bfb84402baa597a0167d1a21e/models/checkpoint.pth
2023-01-26 17:34:37,048 - clearml.Task - INFO - Waiting to finish uploads
2023-01-26 17:34:37,550 - clearml.Task - INFO - Completed model upload to 
 Demo/Distributed basic mp.spawn, S3 upload.f070414bfb84402baa597a0167d1a21e/models/checkpoint.pth
2023-01-26 17:34:44,129 - clearml.Task - INFO - Finished uploading
2023-01-26 17:34:45,926 - clearml.Task - INFO - Finished uploading
  
  
Posted one year ago

Hi @<1523701205467926528:profile|AgitatedDove14> , so I’ve managed to reproduce a bit more.
When I run very basic code via torchrun or torch.distributed.run then multiple ClearML tasks are created and visible in the UI (screenshot below). The logs and scalars are not aggregated but the task of each rank reports its own.

If however I branch out via torch.multiprocessing like below, everything works as expected. The “script path” just shows the single python script, all logs and scalars from all ranks are aggregated into a single task.

    task = Task.init("Glass-ClearML Demo", "Distributed basic mp.spawn, simple model, v3")
    n_gpus = torch.cuda.device_count()
    world_size = n_gpus
    mp.spawn(demo_basic, args=(world_size,), nprocs=world_size, join=True)

All code is taken from the Pytorch tutorial , I just add a ClearML Task into it as shown above.
ClearML version is 1.7.1
image

  
  
Posted one year ago

AgitatedDove14 maybe to come at this from a broader angle:
Is ClearML combined with DataParallel or DistributedDataParallel officially supported / should that work without many adjustments? If so, would it be started via python ... or via torchrun ... ? What about remote runs, how will they support the parallel execution? To go even deeper, what about the machines started via ClearML Autoscaler? Can they either run multiple agents on them and/or start remote distributed launches?

  
  
Posted one year ago

So my own repo I’m launching with either
torchrun --nproc_per_node 2 --standalone --master_addr 127.0.0.1 --master_port 29500 -m http://my_folder.my _script --some_option
or
python3 -m torch.distributed.launch --nproc_per_node 2 --master_addr 127.0.0.1 --master_port 29500 -m http://my_folder.my _script --some_option

  
  
Posted one year ago

Hi ScantChimpanzee51
How are you launching the code ?
Basically the easiest way is to do so with the example you just mentioned,
Can this issue be reproduced ?

  
  
Posted one year ago

Ok great! I will debug starting with a simpler training script.
Just as a last question, is torchrun also supported rather than the (now deprecated but still usable) torch.distributed.launch ?

  
  
Posted one year ago

Is ClearML combined with DataParallel or DistributedDataParallel officially supported / should that work without many adjustments?Yes it is suported, and should work
If so, would it be started via python ... or via torchrun ... ?Yes it should, hence the request for a code snippet to reproduce the issue you are experiencing
What about remote runs, how will they support the parallel execution?Supported, You should see in the "script entry" something like "-m -m torch.distributed.launch --nproc_per_node 2 ..."

To go even deeper, what about the machines started via ClearML Autoscaler?

Should work out of the box, this is considered a single Job/Task no need to spin multiple agents for that

  
  
Posted one year ago