Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, What Could Be The Reason That A Task Ran On An Agent Just Stopped Updating? The Status Is Still "Running" But It Doesn'T Seems Like It. The Agent Is Running On A Docker On A Gpu. It Completed 92 Epochs And Started 93. Run Started At 18:37 Feb 27, Last

Hi,
What could be the reason that a Task ran on an agent just stopped updating? the status is still "Running" but it doesn't seems like it.
the agent is running on a docker on a gpu. it completed 92 epochs and started 93. run started at 18:37 Feb 27, last update was 03:21 Feb 28.
i checked in the server log and the agent log, no visible errors or something going wrong.
attached an image taken today at 9am
image

  
  
Posted one year ago
Votes Newest

Answers 13


Well, if the server still received tasks.ping , it means the ClearML SDK code was still running - the only thing left was that perhaps your code was stuck somewhere?

  
  
Posted one year ago

well if it is the case, that's the first out of many experiments on almost the same code. Let's hope i will not see this issue again.
@<1523701087100473344:profile|SuccessfulKoala55> @<1523701070390366208:profile|CostlyOstrich36> - thank you for your time and help

  
  
Posted one year ago

And are there any other tasks running at this time?

  
  
Posted one year ago

self hosted server

  
  
Posted one year ago

looked in clearml_server/logs/apiserver.log:
last report on 2023-02-28 08:39:27,981. nothing wrong.
looking for the last update message on 03:21:
[2023-02-28 03:21:21,380] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 46ms
[2023-02-28 03:21:25,103] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
[2023-02-28 03:21:25,119] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 7ms
[2023-02-28 03:21:25,128] [9] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 5ms
[2023-02-28 03:21:25,145] [9] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 10ms
[2023-02-28 03:21:26,454] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 61ms
[2023-02-28 03:21:30,142] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms

looks fine - these lines repeats themselves the entire log.

looking in /tmp/.clearml_agent_daemon_outsw6p97f4.txt
filed last modified at 3:18. last lines middle of epoch 92 exactly like reported on webserver.

looking in /tmp/.clearml_agent_out.t3g81c0n.txt
file last modified 3:21 and the last line is exactly like reported on webserver

can't see anything abnormal

  
  
Posted one year ago

yes. task's last update was on 3:21 Feb 28.
here are some lines from the log:
[2023-02-28 03:41:29,212] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
[2023-02-28 04:53:02,019] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms

way after the task's last update i can see couple of WARNINGS in log. to be honest, im not sure if the regard to the same task of a new one, nevertheless ill add them. maybe they will help (replaced company value with <xxx> ):
[2023-02-28 09:01:19,988] [9] [WARNING] [clearml.service_repo] Returned 404 for workers.get_runtime_properties in 0ms, msg=Unable to find endpoint for name workers.get_runtime_properties and version 2.23
[2023-02-28 09:01:52,913] [9] [WARNING] [clearml.service_repo] Returned 400 for queues.get_by_id in 4ms, msg=Invalid queue id: id=gpu1_queue, company=<xxx>
[2023-02-28 09:02:07,534] [9] [WARNING] [clearml.service_repo] Returned 400 for tasks.dequeue in 11ms, msg=Invalid task id: status=in_progress, expected=queued
@<1523701087100473344:profile|SuccessfulKoala55>

  
  
Posted one year ago

this is not the worker output /tmp/.clearml_agent_out.t3g81c0n.txt ?
im kinda new to clearml so fogive me for mixing up terms

  
  
Posted one year ago

it was the only task @<1523701087100473344:profile|SuccessfulKoala55>
did you encounter something like this?
just a recap, task status was running, but seems to be stuck. nvidia-smi showed gpu still has memory allocated, ruling out the server web disconnecting from the agent and the agent finished. If someone did use the GPU outside clearML, i would expect some sort of CUDA crash in the agent's run

  
  
Posted one year ago

What about the worker that was running the experiment?

  
  
Posted one year ago

@<1541592227107573760:profile|EnchantingHippopotamus83> are you still seeing tasks.ping in the server log while the task seems to be stuck?

  
  
Posted one year ago

@<1523701070390366208:profile|CostlyOstrich36>

  
  
Posted one year ago

Can you see if in the APIserver logs something happened during this time? Is the agent still reporting?

  
  
Posted one year ago

Are you using a self hosted server or the community server?

  
  
Posted one year ago
802 Views
13 Answers
one year ago
one year ago
Tags