When running on our bigger research repository which includes saving checkpoints and uploading to S3, the training ends with errors as shown below and a Killed
message for the main process (I do not abort the main process manually):
2023-01-26 17:37:17,527 INFO: Save the latest model.
2023-01-26 17:37:19,158 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_cvqpor8r.tmp => glass-clearml/RealESR/Glass-ClearML Demo/[Lambda] FMEN distributed check, v10 fileserver upload.5af23077a8d2481ebd904f749af7ee51/models/net_g_latest.pth
2023-01-26 17:37:22,133 - clearml.storage - INFO - Uploading: 5.02MB / 18.77MB @ 1.69MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/home/manuel/venv/real-esr/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
2023-01-26 17:37:24,405 - clearml.model - INFO - Selected model id: 31f67a1ac95643d4aa12af9eb52ed032
2023-01-26 17:37:25,318 - clearml.storage - INFO - Uploading: 10.02MB / 18.77MB @ 1.57MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
Loading model from: /home/manuel/venv/real-esr/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth
2023-01-26 17:37:25,832 - clearml.model - INFO - Selected model id: 108e1a350bf1457da94f408cde9cfd82
2023-01-26 17:37:27,589 - clearml.storage - INFO - Uploading: 15.02MB / 18.77MB @ 2.20MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
2023-01-26 17:37:30,226 - clearml.Task - INFO - Completed model upload to
Demo/[Lambda] FMEN distributed check, v10 fileserver upload.5af23077a8d2481ebd904f749af7ee51/models/net_g_latest.pth
2023-01-26 17:37:57,508 INFO: Validation validation
# ssim: 0.1691 Best: 0.1691 @ 11 iter
# lpips: 0.7296 Best: 0.7296 @ 11 iter
2023-01-26 17:38:39,719 INFO: Validation train-val
# ssim: 0.1691 Best: 0.1691 @ 11 iter
# lpips: 0.7296 Best: 0.7296 @ 11 iter
2023-01-26 17:38:56,935 - clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ###
Killed
Sorry that these issues go quite deep and chaotic - we would appreciate any help or ideas you can think of!
It should actually work the same, if you find out it fails to properly register let me know (and then I guess a github issue is the next step)
Results of a bit more investigation:
The ClearML example does use the Pytorch dist
package but none of the DistributedDataParallel
functionality, instead, it reduces gradients “manually”. This script is also not prepared for torchrun
as it launches more processes itself (w/o using the multiprocessing of Python or Pytorch.)
When running a simple example (code attached below) that includes artifact uploads to S3 and launching processes via torch.multiprocessing
, the training hangs at the end - any idea where to investigate closer?
ClearML Task: created new task id=f070414bfb84402baa597a0167d1a21e
2023-01-26 17:34:22,564 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page:
Running basic DDP on rank 2.
Running basic DDP on rank 0.
Running basic DDP on rank 1.
saving...
2023-01-26 17:34:35,507 - clearml.Task - INFO - Waiting to finish uploads
2023-01-26 17:34:35,510 - clearml.Task - INFO - Waiting to finish uploads
saved
2023-01-26 17:34:37,042 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_olqpu7no.tmp => glass-clearml/Glass-ClearML Demo/Distributed basic mp.spawn, S3 upload.f070414bfb84402baa597a0167d1a21e/models/checkpoint.pth
2023-01-26 17:34:37,048 - clearml.Task - INFO - Waiting to finish uploads
2023-01-26 17:34:37,550 - clearml.Task - INFO - Completed model upload to
Demo/Distributed basic mp.spawn, S3 upload.f070414bfb84402baa597a0167d1a21e/models/checkpoint.pth
2023-01-26 17:34:44,129 - clearml.Task - INFO - Finished uploading
2023-01-26 17:34:45,926 - clearml.Task - INFO - Finished uploading
Hi @<1523701205467926528:profile|AgitatedDove14> , so I’ve managed to reproduce a bit more.
When I run very basic code via torchrun
or torch.distributed.run
then multiple ClearML tasks are created and visible in the UI (screenshot below). The logs and scalars are not aggregated but the task of each rank reports its own.
If however I branch out via torch.multiprocessing
like below, everything works as expected. The “script path” just shows the single python script, all logs and scalars from all ranks are aggregated into a single task.
task = Task.init("Glass-ClearML Demo", "Distributed basic mp.spawn, simple model, v3")
n_gpus = torch.cuda.device_count()
world_size = n_gpus
mp.spawn(demo_basic, args=(world_size,), nprocs=world_size, join=True)
All code is taken from the Pytorch tutorial , I just add a ClearML Task into it as shown above.
ClearML version is 1.7.1
AgitatedDove14 maybe to come at this from a broader angle:
Is ClearML combined with DataParallel
or DistributedDataParallel
officially supported / should that work without many adjustments? If so, would it be started via python ...
or via torchrun ...
? What about remote runs, how will they support the parallel execution? To go even deeper, what about the machines started via ClearML Autoscaler? Can they either run multiple agents on them and/or start remote distributed launches?
So my own repo I’m launching with eithertorchrun --nproc_per_node 2 --standalone --master_addr 127.0.0.1 --master_port 29500 -m
http://my_folder.my _script --some_option
orpython3 -m torch.distributed.launch --nproc_per_node 2 --master_addr 127.0.0.1 --master_port 29500 -m
http://my_folder.my _script --some_option
Hi ScantChimpanzee51
How are you launching the code ?
Basically the easiest way is to do so with the example you just mentioned,
Can this issue be reproduced ?
Ok great! I will debug starting with a simpler training script.
Just as a last question, is torchrun
also supported rather than the (now deprecated but still usable) torch.distributed.launch
?
Is ClearML combined with DataParallel
or DistributedDataParallel
officially supported / should that work without many adjustments?Yes it is suported, and should work
If so, would it be started via python ...
or via torchrun ...
?Yes it should, hence the request for a code snippet to reproduce the issue you are experiencing
What about remote runs, how will they support the parallel execution?Supported, You should see in the "script entry" something like "-m -m torch.distributed.launch --nproc_per_node 2 ..."
To go even deeper, what about the machines started via ClearML Autoscaler?
Should work out of the box, this is considered a single Job/Task no need to spin multiple agents for that