
Reputation
Badges 1
11 × Eureka!as for agent.default_docker.arguments:
add to the conf?
default_docker: {
arguments: ["--shm-size=8G",]
}
yes. task's last update was on 3:21 Feb 28.
here are some lines from the log:
[2023-02-28 03:41:29,212] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
[2023-02-28 04:53:02,019] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
way after the task's last update i can see couple of WARNINGS in log. to be honest, im not sure if the regard to the same task of a new one, nevertheless ill add them. maybe they will help (replaced company value with <xxx> ):
...
@<1523701070390366208:profile|CostlyOstrich36>
it was the only task @<1523701087100473344:profile|SuccessfulKoala55>
did you encounter something like this?
just a recap, task status was running, but seems to be stuck. nvidia-smi showed gpu still has memory allocated, ruling out the server web disconnecting from the agent and the agent finished. If someone did use the GPU outside clearML, i would expect some sort of CUDA crash in the agent's run
well if it is the case, that's the first out of many experiments on almost the same code. Let's hope i will not see this issue again.
@<1523701087100473344:profile|SuccessfulKoala55> @<1523701070390366208:profile|CostlyOstrich36> - thank you for your time and help
this is not the worker output /tmp/.clearml_agent_out.t3g81c0n.txt ?
im kinda new to clearml so fogive me for mixing up terms
Hi,
I recently migrated my clearml server to a different machine. i copied all the data folder as recommended above. on the new clearml server i can see all my old experiments and datasets. unfortunately, when running a task with a dataset from the previous machine, the tasks fails and writes the old server ip.
2023-03-12 12:55:59,934 - clearml.storage - ERROR - Could not download None .............
i replaced everywhere i could find the old ip with the new one.
i...
i was hoping there is a way to keep the artifacts but "clean" the reported metrics, plots and debug samples.
thanks anyway
@<1523701070390366208:profile|CostlyOstrich36> @<1577830978284425216:profile|ContemplativeButterfly4>
yes. the lines above were from the task log. let me add more info from the task log:
task yyy pulled from zzz by worker www # first line
Running Task xxx inside default docker: <my docker name> arguments: [] # second line on the task log
Executing: ['docker', 'run', '-t', '--gpus', '"device=1"', '--shm-size', '8G', ...] # begining of the third line
agent.extra_docker_arguments.0 = --shm-size # later on
agent.extra_docker_arguments.1 = 8G # later on
default_docker: {
arguments: ["--shm-size", 8G]
}
the above seems to do the trick.
second line on the web console output:
Running Task xxx inside default docker: <my docker name> arguments: ['--shm-size', '8G']
later on:
agent.default_docker.arguments.0 = --shm-size
agent.default_docker.arguments.1 = 8G
later on:
docker_cmd = <my docker name> --shm-size 8G
thank you for your help @<1523701070390366208:profile|CostlyOstrich36> :)
looked in clearml_server/logs/apiserver.log:
last report on 2023-02-28 08:39:27,981. nothing wrong.
looking for the last update message on 03:21:
[2023-02-28 03:21:21,380] [9] [INFO] [clearml.service_repo] Returned 200 for events.add_batch in 46ms
[2023-02-28 03:21:25,103] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.ping in 8ms
[2023-02-28 03:21:25,119] [9] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all in 7ms
[2023-02-28 03:21:25,128] [9] [INFO] [clearml.service_re...