Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Unanswered
[Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With


When running on our bigger research repository which includes saving checkpoints and uploading to S3, the training ends with errors as shown below and a Killed message for the main process (I do not abort the main process manually):

2023-01-26 17:37:17,527 INFO: Save the latest model.
2023-01-26 17:37:19,158 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_cvqpor8r.tmp => glass-clearml/RealESR/Glass-ClearML Demo/[Lambda] FMEN distributed check, v10 fileserver upload.5af23077a8d2481ebd904f749af7ee51/models/net_g_latest.pth
2023-01-26 17:37:22,133 - clearml.storage - INFO - Uploading: 5.02MB / 18.77MB @ 1.69MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/home/manuel/venv/real-esr/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
2023-01-26 17:37:24,405 - clearml.model - INFO - Selected model id: 31f67a1ac95643d4aa12af9eb52ed032
2023-01-26 17:37:25,318 - clearml.storage - INFO - Uploading: 10.02MB / 18.77MB @ 1.57MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
Loading model from: /home/manuel/venv/real-esr/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth
2023-01-26 17:37:25,832 - clearml.model - INFO - Selected model id: 108e1a350bf1457da94f408cde9cfd82
2023-01-26 17:37:27,589 - clearml.storage - INFO - Uploading: 15.02MB / 18.77MB @ 2.20MBs from /tmp/.clearml.upload_model_cvqpor8r.tmp
2023-01-26 17:37:30,226 - clearml.Task - INFO - Completed model upload to 
 Demo/[Lambda] FMEN distributed check, v10 fileserver upload.5af23077a8d2481ebd904f749af7ee51/models/net_g_latest.pth
2023-01-26 17:37:57,508 INFO: Validation validation
	 # ssim: 0.1691	Best: 0.1691 @ 11 iter
	 # lpips: 0.7296	Best: 0.7296 @ 11 iter

2023-01-26 17:38:39,719 INFO: Validation train-val
	 # ssim: 0.1691	Best: 0.1691 @ 11 iter
	 # lpips: 0.7296	Best: 0.7296 @ 11 iter



2023-01-26 17:38:56,935 - clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ###
Killed
  
  
Posted one year ago
144 Views
0 Answers
one year ago
one year ago