Hi All, I'M Running Dl Experiments On Top Of Mmdetection. The Experiments Are Deployed Remotely On A Dedicated Ec2 Instance Through

Answered

Hi all,
I'm running DL experiments on top of mmdetection. The experiments are deployed remotely on a dedicated EC2 instance through clearml-task --queue ... , and reporting is done through a logging hook https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/hooks/logger/clearml.py with some additional custom reporting (e.g. debug samples each n epochs and similar).
This works quite well usually, but I've run into a strange error that occurred twice and stopped the training. The log reports User aborted, but in fact it was aborted automagically. This is a snap of the log from the first error to the end of training. The missing parts are regular iteration reports.
Do you have an idea what might have happened?
` 2022-08-11 09:46:02,972 - mmdet - INFO - Epoch [12][58/157] lr: 8.983e-04, eta: 12:35:17, time: 3.010, data_time: 1.878, memory: 6927, loss_cls: 0.4392, loss_bbox: 0.5426, loss: 0.9818
2022-08-11 09:46:05,729 - mmdet - INFO - Epoch [12][59/157] lr: 8.983e-04, eta: 12:35:09, time: 2.746, data_time: 1.592, memory: 6927, loss_cls: 0.4325, loss_bbox: 0.5338, loss: 0.9663
2022-08-11 09:46:06,084 - clearml.Metrics - ERROR - Action failed <400/131: events.add_batch/v1.0 (Events not added: Invalid task id=20)>
2022-08-11 11:46:11
2022-08-11 09:46:08,487 - mmdet - INFO - Epoch [12][60/157] lr: 8.983e-04, eta: 12:35:02, time: 2.788, data_time: 1.616, memory: 6927, loss_cls: 0.4490, loss_bbox: 0.5404, loss: 0.9894
2022-08-11 11:46:16
2022-08-11 09:46:12,713 - mmdet - INFO - Epoch [12][61/157] lr: 8.983e-04, eta: 12:35:07, time: 4.224, data_time: 3.062, memory: 6927, loss_cls: 0.4334, loss_bbox: 0.5523, loss: 0.9857

[...]

2022-08-11 09:46:53,836 - mmdet - INFO - Epoch [12][74/157] lr: 8.980e-04, eta: 12:34:15, time: 3.056, data_time: 1.935, memory: 6927, loss_cls: 0.4384, loss_bbox: 0.5379, loss: 0.9763
2022-08-11 09:46:56,613 - mmdet - INFO - Epoch [12][75/157] lr: 8.980e-04, eta: 12:34:08, time: 2.775, data_time: 1.579, memory: 6927, loss_cls: 0.4363, loss_bbox: 0.5306, loss: 0.9669
2022-08-11 11:47:02
2022-08-11 09:46:57,390 - clearml.Metrics - ERROR - Action failed <400/131: events.add_batch/v1.0 (Events not added: Invalid task id=10)>
2022-08-11 09:46:59,348 - mmdet - INFO - Epoch [12][76/157] lr: 8.980e-04, eta: 12:34:01, time: 2.730, data_time: 1.591, memory: 6927, loss_cls: 0.4576, loss_bbox: 0.5532, loss: 1.0107
2022-08-11 11:47:07
2022-08-11 09:47:03,532 - mmdet - INFO - Epoch [12][77/157] lr: 8.980e-04, eta: 12:34:05, time: 4.214, data_time: 3.053, memory: 6927, loss_cls: 0.4571, loss_bbox: 0.5552, loss: 1.0124
2022-08-11 11:47:12

[...]

2022-08-11 09:47:41,446 - mmdet - INFO - Epoch [12][89/157] lr: 8.978e-04, eta: 12:33:16, time: 2.769, data_time: 1.659, memory: 6927, loss_cls: 0.4359, loss_bbox: 0.5391, loss: 0.9750
2022-08-11 11:47:47
2022-08-11 09:47:46,445 - mmdet - INFO - Epoch [12][90/157] lr: 8.978e-04, eta: 12:33:27, time: 5.015, data_time: 3.867, memory: 6927, loss_cls: 0.4289, loss_bbox: 0.5402, loss: 0.9692
2022-08-11 11:47:47
User aborted: stopping task (5) `

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveHedgehong88
				
					0
					 × 1

Votes Newest

Answers 9

ResponsiveHedgehong88 you can try mapping out the /tmp/ folder inside the docker outside for later inspection so the data wouldn't be lost. This could give us a better idea of what's happening

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

unfortunately the experiment is run in docker and the container is down already... I don't know if this happened at the same time. So you're saying it might be memory issues? Any other hints i might check while running a new experiment?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveHedgehong88
				
					0
					 × 1

For example:
task 613b77be5dac4f6f9eaea7962bf4e034 pulled from eb1c9d9c680d4bdea2dbf5cf90e54af2 by worker worker-bruce:3 Running task '613b77be5dac4f6f9eaea7962bf4e034' Storing stdout and stderr log to '/tmp/.clearml_agent_out._sox_04u.txt', '/tmp/.clearml_agent_out._sox_04u.txt'

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Also, in the Scalers section you can see the machine statistics to maybe get an idea. If the memory usage is high this might be the issue. If not then we can cancel out this hypothesis (probably)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

When the agent starts running a task it will print out where the logs are being saved

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Hi CostlyOstrich36 is there a default location for the agents local log?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveHedgehong88
				
					0
					 × 1

Ok good idea thanks, will do in the next run

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveHedgehong88
				
					0
					 × 1

Hi ResponsiveHedgehong88 , I was trying to do the same thing but the loggerhook doesn't seem to work. The console log and scalar logs didn't come out when I registered via init.py and load via log_config. Are you able to share how you configure it?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

ResponsiveHedgehong88 , do you have an option to log into the machine and see the state or if there were any errors? Is there any chance it's running out of memory? The agent also keeps a local log, can you take a look there to see if there is any discrepancy?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

2K Views

9 Answers

3 years ago

2 years ago