well if it is the case, that's the first out of many experiments on almost the same code. Let's hope i will not see this issue again.
@<1523701087100473344:profile|SuccessfulKoala55> @<1523701070390366208:profile|CostlyOstrich36> - thank you for your time and help
Well, if the server still received tasks.ping
, it means the ClearML SDK code was still running - the only thing left was that perhaps your code was stuck somewhere?
@<1541592227107573760:profile|EnchantingHippopotamus83> are you still seeing tasks.ping
in the server log while the task seems to be stuck?
looked in clearml_server/logs/apiserver.log:
last report on 2023-02-28 08:39:27,981. nothing wrong.
looking for the last update message on 03:21:
[2023-02-28 03:21:21,380] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 46ms
[2023-02-28 03:21:25,103] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
[2023-02-28 03:21:25,119] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 7ms
[2023-02-28 03:21:25,128] [9] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 5ms
[2023-02-28 03:21:25,145] [9] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 10ms
[2023-02-28 03:21:26,454] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 61ms
[2023-02-28 03:21:30,142] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
looks fine - these lines repeats themselves the entire log.
looking in /tmp/.clearml_agent_daemon_outsw6p97f4.txt
filed last modified at 3:18. last lines middle of epoch 92 exactly like reported on webserver.
looking in /tmp/.clearml_agent_out.t3g81c0n.txt
file last modified 3:21 and the last line is exactly like reported on webserver
can't see anything abnormal
this is not the worker output /tmp/.clearml_agent_out.t3g81c0n.txt ?
im kinda new to clearml so fogive me for mixing up terms
yes. task's last update was on 3:21 Feb 28.
here are some lines from the log:
[2023-02-28 03:41:29,212] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
[2023-02-28 04:53:02,019] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
way after the task's last update i can see couple of WARNINGS in log. to be honest, im not sure if the regard to the same task of a new one, nevertheless ill add them. maybe they will help (replaced company value with <xxx> ):
[2023-02-28 09:01:19,988] [9] [WARNING] [clearml.service_repo] Returned 404 for workers.get_runtime_properties in 0ms, msg=Unable to find endpoint for name workers.get_runtime_properties and version 2.23
[2023-02-28 09:01:52,913] [9] [WARNING] [clearml.service_repo] Returned 400 for queues.get_by_id in 4ms, msg=Invalid queue id: id=gpu1_queue, company=<xxx>
[2023-02-28 09:02:07,534] [9] [WARNING] [clearml.service_repo] Returned 400 for tasks.dequeue in 11ms, msg=Invalid task id: status=in_progress, expected=queued
@<1523701087100473344:profile|SuccessfulKoala55>
And are there any other tasks running at this time?
What about the worker that was running the experiment?
Are you using a self hosted server or the community server?
Can you see if in the APIserver logs something happened during this time? Is the agent still reporting?
@<1523701070390366208:profile|CostlyOstrich36>
it was the only task @<1523701087100473344:profile|SuccessfulKoala55>
did you encounter something like this?
just a recap, task status was running, but seems to be stuck. nvidia-smi showed gpu still has memory allocated, ruling out the server web disconnecting from the agent and the agent finished. If someone did use the GPU outside clearML, i would expect some sort of CUDA crash in the agent's run