Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, I'M Running Dl Experiments On Top Of Mmdetection. The Experiments Are Deployed Remotely On A Dedicated Ec2 Instance Through

Hi all,
I'm running DL experiments on top of mmdetection. The experiments are deployed remotely on a dedicated EC2 instance through clearml-task --queue ... , and reporting is done through a logging hook https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/hooks/logger/clearml.py with some additional custom reporting (e.g. debug samples each n epochs and similar).
This works quite well usually, but I've run into a strange error that occurred twice and stopped the training. The log reports User aborted, but in fact it was aborted automagically. This is a snap of the log from the first error to the end of training. The missing parts are regular iteration reports.
Do you have an idea what might have happened?
` 2022-08-11 09:46:02,972 - mmdet - INFO - Epoch [12][58/157] lr: 8.983e-04, eta: 12:35:17, time: 3.010, data_time: 1.878, memory: 6927, loss_cls: 0.4392, loss_bbox: 0.5426, loss: 0.9818
2022-08-11 09:46:05,729 - mmdet - INFO - Epoch [12][59/157] lr: 8.983e-04, eta: 12:35:09, time: 2.746, data_time: 1.592, memory: 6927, loss_cls: 0.4325, loss_bbox: 0.5338, loss: 0.9663
2022-08-11 09:46:06,084 - clearml.Metrics - ERROR - Action failed <400/131: events.add_batch/v1.0 (Events not added: Invalid task id=20)>
2022-08-11 11:46:11
2022-08-11 09:46:08,487 - mmdet - INFO - Epoch [12][60/157] lr: 8.983e-04, eta: 12:35:02, time: 2.788, data_time: 1.616, memory: 6927, loss_cls: 0.4490, loss_bbox: 0.5404, loss: 0.9894
2022-08-11 11:46:16
2022-08-11 09:46:12,713 - mmdet - INFO - Epoch [12][61/157] lr: 8.983e-04, eta: 12:35:07, time: 4.224, data_time: 3.062, memory: 6927, loss_cls: 0.4334, loss_bbox: 0.5523, loss: 0.9857

[...]

2022-08-11 09:46:53,836 - mmdet - INFO - Epoch [12][74/157] lr: 8.980e-04, eta: 12:34:15, time: 3.056, data_time: 1.935, memory: 6927, loss_cls: 0.4384, loss_bbox: 0.5379, loss: 0.9763
2022-08-11 09:46:56,613 - mmdet - INFO - Epoch [12][75/157] lr: 8.980e-04, eta: 12:34:08, time: 2.775, data_time: 1.579, memory: 6927, loss_cls: 0.4363, loss_bbox: 0.5306, loss: 0.9669
2022-08-11 11:47:02
2022-08-11 09:46:57,390 - clearml.Metrics - ERROR - Action failed <400/131: events.add_batch/v1.0 (Events not added: Invalid task id=10)>
2022-08-11 09:46:59,348 - mmdet - INFO - Epoch [12][76/157] lr: 8.980e-04, eta: 12:34:01, time: 2.730, data_time: 1.591, memory: 6927, loss_cls: 0.4576, loss_bbox: 0.5532, loss: 1.0107
2022-08-11 11:47:07
2022-08-11 09:47:03,532 - mmdet - INFO - Epoch [12][77/157] lr: 8.980e-04, eta: 12:34:05, time: 4.214, data_time: 3.053, memory: 6927, loss_cls: 0.4571, loss_bbox: 0.5552, loss: 1.0124
2022-08-11 11:47:12

[...]

2022-08-11 09:47:41,446 - mmdet - INFO - Epoch [12][89/157] lr: 8.978e-04, eta: 12:33:16, time: 2.769, data_time: 1.659, memory: 6927, loss_cls: 0.4359, loss_bbox: 0.5391, loss: 0.9750
2022-08-11 11:47:47
2022-08-11 09:47:46,445 - mmdet - INFO - Epoch [12][90/157] lr: 8.978e-04, eta: 12:33:27, time: 5.015, data_time: 3.867, memory: 6927, loss_cls: 0.4289, loss_bbox: 0.5402, loss: 0.9692
2022-08-11 11:47:47
User aborted: stopping task (5) `

  
  
Posted 2 years ago
Votes Newest

Answers 9


Ok good idea thanks, will do in the next run

  
  
Posted 2 years ago

For example:
task 613b77be5dac4f6f9eaea7962bf4e034 pulled from eb1c9d9c680d4bdea2dbf5cf90e54af2 by worker worker-bruce:3 Running task '613b77be5dac4f6f9eaea7962bf4e034' Storing stdout and stderr log to '/tmp/.clearml_agent_out._sox_04u.txt', '/tmp/.clearml_agent_out._sox_04u.txt'

  
  
Posted 2 years ago

Also, in the Scalers section you can see the machine statistics to maybe get an idea. If the memory usage is high this might be the issue. If not then we can cancel out this hypothesis (probably)

  
  
Posted 2 years ago

Hi CostlyOstrich36 is there a default location for the agents local log?

  
  
Posted 2 years ago

Hi ResponsiveHedgehong88 , I was trying to do the same thing but the loggerhook doesn't seem to work. The console log and scalar logs didn't come out when I registered via init.py and load via log_config. Are you able to share how you configure it?

  
  
Posted 2 years ago

unfortunately the experiment is run in docker and the container is down already... I don't know if this happened at the same time. So you're saying it might be memory issues? Any other hints i might check while running a new experiment?

  
  
Posted 2 years ago

ResponsiveHedgehong88 , do you have an option to log into the machine and see the state or if there were any errors? Is there any chance it's running out of memory? The agent also keeps a local log, can you take a look there to see if there is any discrepancy?

  
  
Posted 2 years ago

ResponsiveHedgehong88 you can try mapping out the /tmp/ folder inside the docker outside for later inspection so the data wouldn't be lost. This could give us a better idea of what's happening

  
  
Posted 2 years ago

When the agent starts running a task it will print out where the logs are being saved

  
  
Posted 2 years ago
1K Views
9 Answers
2 years ago
one year ago
Tags