Reputation
Badges 1
11 × Eureka!well if it is the case, that's the first out of many experiments on almost the same code. Let's hope i will not see this issue again.
@<1523701087100473344:profile|SuccessfulKoala55> @<1523701070390366208:profile|CostlyOstrich36> - thank you for your time and help
yes. task's last update was on 3:21 Feb 28.
here are some lines from the log:
[2023-02-28 03:41:29,212] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
[2023-02-28 04:53:02,019] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
way after the task's last update i can see couple of WARNINGS in log. to be honest, im not sure if the regard to the same task of a new one, nevertheless ill add them. maybe they will help (replaced company value with <xxx> ):
...
looked in clearml_server/logs/apiserver.log:
last report on 2023-02-28 08:39:27,981. nothing wrong.
looking for the last update message on 03:21:
[2023-02-28 03:21:21,380] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 46ms
[2023-02-28 03:21:25,103] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
[2023-02-28 03:21:25,119] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 7ms
[2023-02-28 03:21:25,128] [9] [INFO] [clearml.service_re...
this is not the worker output /tmp/.clearml_agent_out.t3g81c0n.txt ?
im kinda new to clearml so fogive me for mixing up terms
it was the only task @<1523701087100473344:profile|SuccessfulKoala55>
did you encounter something like this?
just a recap, task status was running, but seems to be stuck. nvidia-smi showed gpu still has memory allocated, ruling out the server web disconnecting from the agent and the agent finished. If someone did use the GPU outside clearML, i would expect some sort of CUDA crash in the agent's run
@<1523701070390366208:profile|CostlyOstrich36>
yes. the lines above were from the task log. let me add more info from the task log:
task yyy pulled from zzz by worker www # first line
Running Task xxx inside default docker: <my docker name> arguments: [] # second line on the task log
Executing: ['docker', 'run', '-t', '--gpus', '"device=1"', '--shm-size', '8G', ...] # begining of the third line
agent.extra_docker_arguments.0 = --shm-size # later on
agent.extra_docker_arguments.1 = 8G # later on
as for agent.default_docker.arguments:
add to the conf?
default_docker: {
arguments: ["--shm-size=8G",]
}
i was hoping there is a way to keep the artifacts but "clean" the reported metrics, plots and debug samples.
thanks anyway
@<1523701070390366208:profile|CostlyOstrich36> @<1577830978284425216:profile|ContemplativeButterfly4>
default_docker: {
arguments: ["--shm-size", 8G]
}
the above seems to do the trick.
second line on the web console output:
Running Task xxx inside default docker: <my docker name> arguments: ['--shm-size', '8G']
later on:
agent.default_docker.arguments.0 = --shm-size
agent.default_docker.arguments.1 = 8G
later on:
docker_cmd = <my docker name> --shm-size 8G
thank you for your help @<1523701070390366208:profile|CostlyOstrich36> :)
Hi,
I recently migrated my clearml server to a different machine. i copied all the data folder as recommended above. on the new clearml server i can see all my old experiments and datasets. unfortunately, when running a task with a dataset from the previous machine, the tasks fails and writes the old server ip.
2023-03-12 12:55:59,934 - clearml.storage - ERROR - Could not download None .............
i replaced everywhere i could find the old ip with the new one.
i...