Unanswered
[Clearml With Pytorch-Based Distributed Training}
Hi Everyone! Is The Combination Of Clearml With
When running on our bigger research repository which includes saving checkpoints and uploading to S3, the training ends with errors as shown below and a Killed
message for the main process (I do not abort the main process manually):
2023-01-26 17:37:17,527 INFO: Save the latest model.
2023-01-26 17:37:19,158 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_cvqpor8r.tmp => glass-clearml/RealESR/Glass-ClearML Demo/[Lambda] FMEN distributed check, v10 fileserver upload.5af23077a8d2481ebd904f749af7ee51/models/net_g_latest.pth
2023-01-26 17:37:22,133 - clearml.storage - INFO - Uploading: 5.02MB / 18.77MB @ 1.69MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/home/manuel/venv/real-esr/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
2023-01-26 17:37:24,405 - clearml.model - INFO - Selected model id: 31f67a1ac95643d4aa12af9eb52ed032
2023-01-26 17:37:25,318 - clearml.storage - INFO - Uploading: 10.02MB / 18.77MB @ 1.57MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
Loading model from: /home/manuel/venv/real-esr/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth
2023-01-26 17:37:25,832 - clearml.model - INFO - Selected model id: 108e1a350bf1457da94f408cde9cfd82
2023-01-26 17:37:27,589 - clearml.storage - INFO - Uploading: 15.02MB / 18.77MB @ 2.20MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
2023-01-26 17:37:30,226 - clearml.Task - INFO - Completed model upload to
Demo/[Lambda] FMEN distributed check, v10 fileserver upload.5af23077a8d2481ebd904f749af7ee51/models/net_g_latest.pth
2023-01-26 17:37:57,508 INFO: Validation validation
# ssim: 0.1691 Best: 0.1691 @ 11 iter
# lpips: 0.7296 Best: 0.7296 @ 11 iter
2023-01-26 17:38:39,719 INFO: Validation train-val
# ssim: 0.1691 Best: 0.1691 @ 11 iter
# lpips: 0.7296 Best: 0.7296 @ 11 iter
2023-01-26 17:38:56,935 - clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ###
Killed
161 Views
0
Answers
one year ago
one year ago