When running on our bigger research repository which includes saving checkpoints and uploading to S3, the training ends with errors as shown below and a
Killed message for the main process (I do not abort the main process manually):
2023-01-26 17:37:17,527 INFO: Save the latest model. 2023-01-26 17:37:19,158 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_cvqpor8r.tmp => glass-clearml/RealESR/Glass-ClearML Demo/[Lambda] FMEN distributed check, v10 fileserver upload.5af23077a8d2481ebd904f749af7ee51/models/net_g_latest.pth 2023-01-26 17:37:22,133 - clearml.storage - INFO - Uploading: 5.02MB / 18.77MB @ 1.69MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off] /home/manuel/venv/real-esr/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights. warnings.warn(msg) 2023-01-26 17:37:24,405 - clearml.model - INFO - Selected model id: 31f67a1ac95643d4aa12af9eb52ed032 2023-01-26 17:37:25,318 - clearml.storage - INFO - Uploading: 10.02MB / 18.77MB @ 1.57MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp Loading model from: /home/manuel/venv/real-esr/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth 2023-01-26 17:37:25,832 - clearml.model - INFO - Selected model id: 108e1a350bf1457da94f408cde9cfd82 2023-01-26 17:37:27,589 - clearml.storage - INFO - Uploading: 15.02MB / 18.77MB @ 2.20MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp 2023-01-26 17:37:30,226 - clearml.Task - INFO - Completed model upload to
Demo/[Lambda] FMEN distributed check, v10 fileserver upload.5af23077a8d2481ebd904f749af7ee51/models/net_g_latest.pth 2023-01-26 17:37:57,508 INFO: Validation validation # ssim: 0.1691 Best: 0.1691 @ 11 iter # lpips: 0.7296 Best: 0.7296 @ 11 iter 2023-01-26 17:38:39,719 INFO: Validation train-val # ssim: 0.1691 Best: 0.1691 @ 11 iter # lpips: 0.7296 Best: 0.7296 @ 11 iter 2023-01-26 17:38:56,935 - clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ### Killed
Results of a bit more investigation:
The ClearML example does use the Pytorch
dist package but none of the
DistributedDataParallel functionality, instead, it reduces gradients “manually”. This script is also not prepared for
torchrun as it launches more processes itself (w/o using the multiprocessing of Python or Pytorch.)
When running a simple example (code attached below) that includes artifact uploads to S3 and launching processes via
torch.multiprocessing , the training hangs at the end - any idea where to investigate closer?
ClearML Task: created new task id=f070414bfb84402baa597a0167d1a21e 2023-01-26 17:34:22,564 - clearml.Task - INFO - No repository found, storing script code instead ClearML results page:
Running basic DDP on rank 2. Running basic DDP on rank 0. Running basic DDP on rank 1. saving... 2023-01-26 17:34:35,507 - clearml.Task - INFO - Waiting to finish uploads 2023-01-26 17:34:35,510 - clearml.Task - INFO - Waiting to finish uploads saved 2023-01-26 17:34:37,042 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_olqpu7no.tmp => glass-clearml/Glass-ClearML Demo/Distributed basic mp.spawn, S3 upload.f070414bfb84402baa597a0167d1a21e/models/checkpoint.pth 2023-01-26 17:34:37,048 - clearml.Task - INFO - Waiting to finish uploads 2023-01-26 17:34:37,550 - clearml.Task - INFO - Completed model upload to
Demo/Distributed basic mp.spawn, S3 upload.f070414bfb84402baa597a0167d1a21e/models/checkpoint.pth 2023-01-26 17:34:44,129 - clearml.Task - INFO - Finished uploading 2023-01-26 17:34:45,926 - clearml.Task - INFO - Finished uploading
AgitatedDove14 maybe to come at this from a broader angle:
Is ClearML combined with
DistributedDataParallel officially supported / should that work without many adjustments? If so, would it be started via
python ... or via
torchrun ... ? What about remote runs, how will they support the parallel execution? To go even deeper, what about the machines started via ClearML Autoscaler? Can they either run multiple agents on them and/or start remote distributed launches?
So my own repo I’m launching with either
torchrun --nproc_per_node 2 --standalone --master_addr 127.0.0.1 --master_port 29500 -m http://my_folder.my
python3 -m torch.distributed.launch --nproc_per_node 2 --master_addr 127.0.0.1 --master_port 29500 -m http://my_folder.my
Is ClearML combined with
DistributedDataParallel officially supported / should that work without many adjustments?Yes it is suported, and should work
If so, would it be started via
python ... or via
torchrun ... ?Yes it should, hence the request for a code snippet to reproduce the issue you are experiencing
What about remote runs, how will they support the parallel execution?Supported, You should see in the "script entry" something like "-m -m torch.distributed.launch --nproc_per_node 2 ..."
To go even deeper, what about the machines started via ClearML Autoscaler?
Should work out of the box, this is considered a single Job/Task no need to spin multiple agents for that
Hi @<1523701205467926528:profile|AgitatedDove14> , so I’ve managed to reproduce a bit more.
When I run very basic code via
torch.distributed.run then multiple ClearML tasks are created and visible in the UI (screenshot below). The logs and scalars are not aggregated but the task of each rank reports its own.
If however I branch out via
torch.multiprocessing like below, everything works as expected. The “script path” just shows the single python script, all logs and scalars from all ranks are aggregated into a single task.
task = Task.init("Glass-ClearML Demo", "Distributed basic mp.spawn, simple model, v3") n_gpus = torch.cuda.device_count() world_size = n_gpus mp.spawn(demo_basic, args=(world_size,), nprocs=world_size, join=True)
All code is taken from the Pytorch tutorial , I just add a ClearML Task into it as shown above.
ClearML version is 1.7.1