What about the worker that was running the experiment?
Are you using a self hosted server or the community server?
well if it is the case, that's the first out of many experiments on almost the same code. Let's hope i will not see this issue again.
@<1523701087100473344:profile|SuccessfulKoala55> @<1523701070390366208:profile|CostlyOstrich36> - thank you for your time and help
this is not the worker output /tmp/.clearml_agent_out.t3g81c0n.txt ?
im kinda new to clearml so fogive me for mixing up terms
@<1523701070390366208:profile|CostlyOstrich36>
Well, if the server still received tasks.ping
, it means the ClearML SDK code was still running - the only thing left was that perhaps your code was stuck somewhere?
it was the only task @<1523701087100473344:profile|SuccessfulKoala55>
did you encounter something like this?
just a recap, task status was running, but seems to be stuck. nvidia-smi showed gpu still has memory allocated, ruling out the server web disconnecting from the agent and the agent finished. If someone did use the GPU outside clearML, i would expect some sort of CUDA crash in the agent's run
looked in clearml_server/logs/apiserver.log:
last report on 2023-02-28 08:39:27,981. nothing wrong.
looking for the last update message on 03:21:
[2023-02-28 03:21:21,380] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 46ms
[2023-02-28 03:21:25,103] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
[2023-02-28 03:21:25,119] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 7ms
[2023-02-28 03:21:25,128] [9] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 5ms
[2023-02-28 03:21:25,145] [9] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 10ms
[2023-02-28 03:21:26,454] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 61ms
[2023-02-28 03:21:30,142] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
looks fine - these lines repeats themselves the entire log.
looking in /tmp/.clearml_agent_daemon_outsw6p97f4.txt
filed last modified at 3:18. last lines middle of epoch 92 exactly like reported on webserver.
looking in /tmp/.clearml_agent_out.t3g81c0n.txt
file last modified 3:21 and the last line is exactly like reported on webserver
can't see anything abnormal
Can you see if in the APIserver logs something happened during this time? Is the agent still reporting?
yes. task's last update was on 3:21 Feb 28.
here are some lines from the log:
[2023-02-28 03:41:29,212] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
[2023-02-28 04:53:02,019] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
way after the task's last update i can see couple of WARNINGS in log. to be honest, im not sure if the regard to the same task of a new one, nevertheless ill add them. maybe they will help (replaced company value with <xxx> ):
[2023-02-28 09:01:19,988] [9] [WARNING] [clearml.service_repo] Returned 404 for workers.get_runtime_properties in 0ms, msg=Unable to find endpoint for name workers.get_runtime_properties and version 2.23
[2023-02-28 09:01:52,913] [9] [WARNING] [clearml.service_repo] Returned 400 for queues.get_by_id in 4ms, msg=Invalid queue id: id=gpu1_queue, company=<xxx>
[2023-02-28 09:02:07,534] [9] [WARNING] [clearml.service_repo] Returned 400 for tasks.dequeue in 11ms, msg=Invalid task id: status=in_progress, expected=queued
@<1523701087100473344:profile|SuccessfulKoala55>
@<1541592227107573760:profile|EnchantingHippopotamus83> are you still seeing tasks.ping
in the server log while the task seems to be stuck?
And are there any other tasks running at this time?