Hi, What Could Be The Reason That A Task Ran On An Agent Just Stopped Updating? The Status Is Still "Running" But It Doesn'T Seems Like It. The Agent Is Running On A Docker On A Gpu. It Completed 92 Epochs And Started 93. Run Started At 18:37 Feb 27, Last

Answered

Hi,
What could be the reason that a Task ran on an agent just stopped updating? the status is still "Running" but it doesn't seems like it.
the agent is running on a docker on a gpu. it completed 92 epochs and started 93. run started at 18:37 Feb 27, last update was 03:21 Feb 28.
i checked in the server log and the agent log, no visible errors or something going wrong.
attached an image taken today at 9am

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnchantingHippopotamus83
				
					0
					 × 1

Votes Newest

Answers 13

Are you using a self hosted server or the community server?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

self hosted server

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnchantingHippopotamus83
				
					0
					 × 1

Can you see if in the APIserver logs something happened during this time? Is the agent still reporting?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

looked in clearml_server/logs/apiserver.log:
last report on 2023-02-28 08:39:27,981. nothing wrong.
looking for the last update message on 03:21:
[2023-02-28 03:21:21,380] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 46ms
[2023-02-28 03:21:25,103] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
[2023-02-28 03:21:25,119] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 7ms
[2023-02-28 03:21:25,128] [9] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 5ms
[2023-02-28 03:21:25,145] [9] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 10ms
[2023-02-28 03:21:26,454] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 61ms
[2023-02-28 03:21:30,142] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms

looks fine - these lines repeats themselves the entire log.

looking in /tmp/.clearml_agent_daemon_outsw6p97f4.txt
filed last modified at 3:18. last lines middle of epoch 92 exactly like reported on webserver.

looking in /tmp/.clearml_agent_out.t3g81c0n.txt
file last modified 3:21 and the last line is exactly like reported on webserver

can't see anything abnormal

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnchantingHippopotamus83
				
					0
					 × 1

What about the worker that was running the experiment?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

this is not the worker output /tmp/.clearml_agent_out.t3g81c0n.txt ?
im kinda new to clearml so fogive me for mixing up terms

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnchantingHippopotamus83
				
					0
					 × 1

@<1523701070390366208:profile|CostlyOstrich36>

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnchantingHippopotamus83
				
					0
					 × 1

@<1541592227107573760:profile|EnchantingHippopotamus83> are you still seeing tasks.ping in the server log while the task seems to be stuck?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

yes. task's last update was on 3:21 Feb 28.
here are some lines from the log:
[2023-02-28 03:41:29,212] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
[2023-02-28 04:53:02,019] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms

way after the task's last update i can see couple of WARNINGS in log. to be honest, im not sure if the regard to the same task of a new one, nevertheless ill add them. maybe they will help (replaced company value with <xxx> ):
[2023-02-28 09:01:19,988] [9] [WARNING] [clearml.service_repo] Returned 404 for workers.get_runtime_properties in 0ms, msg=Unable to find endpoint for name workers.get_runtime_properties and version 2.23
[2023-02-28 09:01:52,913] [9] [WARNING] [clearml.service_repo] Returned 400 for queues.get_by_id in 4ms, msg=Invalid queue id: id=gpu1_queue, company=<xxx>
[2023-02-28 09:02:07,534] [9] [WARNING] [clearml.service_repo] Returned 400 for tasks.dequeue in 11ms, msg=Invalid task id: status=in_progress, expected=queued
@<1523701087100473344:profile|SuccessfulKoala55>

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnchantingHippopotamus83
				
					0
					 × 1

And are there any other tasks running at this time?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

it was the only task @<1523701087100473344:profile|SuccessfulKoala55>
did you encounter something like this?
just a recap, task status was running, but seems to be stuck. nvidia-smi showed gpu still has memory allocated, ruling out the server web disconnecting from the agent and the agent finished. If someone did use the GPU outside clearML, i would expect some sort of CUDA crash in the agent's run

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnchantingHippopotamus83
				
					0
					 × 1

Well, if the server still received tasks.ping , it means the ClearML SDK code was still running - the only thing left was that perhaps your code was stuck somewhere?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

well if it is the case, that's the first out of many experiments on almost the same code. Let's hope i will not see this issue again.
@<1523701087100473344:profile|SuccessfulKoala55> @<1523701070390366208:profile|CostlyOstrich36> - thank you for your time and help

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnchantingHippopotamus83
				
					0
					 × 1

Write your answer

2K Views

13 Answers

2 years ago