Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Unanswered
Hello, Has Anyone Had Problems Where Their Experiments Are Deleted And Tasks Killed For Some Reason? Multiple Times Now I Have Had My Experiments Aborted And Completely Wiped Out From Clearml, Including Images, Checkpoints And All. I Cant Figure Out What

hello, has anyone had problems where their experiments are deleted and tasks killed for some reason?
Multiple times now I have had my experiments aborted and completely wiped out from clearml, including images, checkpoints and all. I cant figure out what causing this

All I see is that a command was issued in logs. How can I track it down, where did the command come from? About 5-6 experiments lost

Logs from one of experiment just before purge:
Epoch 249: 100%|█| 571/571 [13:58<00:00, 1.47s/it, loss_classifier=0.0152, loss_box_reg=0.00758, loss_mask=0.0635, loss_keypoint=0.420, loss_objectness=2.91e-5
Validation DataLoader 0: 3%|███▍ | 5/143 [00:37<17:26,
Validation DataLoader 0: 4%|███▋ | 6/143 [00:45<17:28, 7.66s/it]
Validation DataLoader 0: 52%|██████████████████████████████████████████████████▊ | 75/143 [08:44<07:55,
Epoch 249: 100%|█| 571/571 [30:34<00:00, 3.21s/it, loss_classifier=0.0152, loss_box_reg=0.00758, loss_mask=0.0635, loss_keypoint=0.420, loss_objectnesEpoch 250: 0%| | 0/571 [00:00<?, ?it/s, loss_classifier=0.0152, loss_box_reg=0.00758, loss_mask=0.0635, loss_keypoint=0.420, loss_objectness=2.91e-5,2023-10-06 08:01:18,923 - clearml.Task - INFO - Completed model upload to file:///mnt/data/furniture-detection/bathroom-test.a2f1af5b23f6438a92680c64821d01e4/models/last.ckpt
Epoch 250: 27%|▎| 154/571 [03:47<10:16, 1.48s/it, loss_classifier=0.0108, loss_box_reg=0.00754, loss_mask=0.0591, loss_keypoint=0.319, loss_objectnes2023-10-06 08:05:02,397 - clearml.Task - WARNING - Task a2f1af5b23f6438a92680c64821d01e4 was reset! if state is consistent we shall terminate.
Epoch 250: 27%|▎| 155/571 [03:49<10:15, 1.48s/it, loss_classifier=0.00384, loss_box_reg=0.00789, loss_mask=0.0587, loss_keypoint=0.261, loss_objectne2023-10-06 08:05:04,414 - clearml.Task - WARNING - Task a2f1af5b23f6438a92680c64821d01e4 was reset! if state is consistent we shall terminate.
Epoch 250: 27%|▎| 156/571 [03:50<10:13, 1.48s/it, loss_classifier=0.00389, loss_box_reg=0.00578, loss_mask=0.0587, loss_keypoint=0.256, loss_objectne2023-10-06 08:05:06,430 - clearml.Task - WARNING - Task a2f1af5b23f6438a92680c64821d01e4 was reset! if state is consistent we shall terminate.
Epoch 250: 28%|▎| 158/571 [03:53<10:10, 1.48s/it, loss_classifier=0.0032, loss_box_reg=0.00515, loss_mask=0.0574, loss_keypoint=0.223, loss_objectnes2023-10-06 08:05:08,477 - clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - RESET ###
Epoch 250: 28%|▎| 162/571 [03:59<10:05, 1.48s/it, loss_classifier=0.00522, loss_box_reg=0.00985, loss_mask=0.0646, loss_keypoint=0.315, loss_objectne2023-10-06 08:05:14,525 - clearml.reporter - WARNING - Event reporting sub-process lost, switching to thread based reporting
2023-10-06 08:05:14,526 - clearml.log - WARNING - Event reporting sub-process lost, switching to thread based reporting

  
  
Posted 6 months ago
Votes Newest

Answers

418 Views
0 Answers
6 months ago
6 months ago
Tags