I think the main issue is that for some reason the container running changed one of the files inside the temp folder. then the host machine is "stuck" with a file that the root user owned/changed, and now it cannot reuse / delete the temp folder.
I think the fix is to make sure the container deleted the temp folder when it is done
I think I narrowed down the problem to the using of ssh agent forwarding or not. When I used ssh config and connected without password I had an option in my config ForwardAgent yes
, and with this enabled when I started the agent on the remote machine it didn't mount .ssh folder by default until adding "-v", "/home/{user}/.ssh:/root/.ssh"
to the arguments. So, without ssh agent forwarding everything works as expected.
So the only difference is how I log in into machine to start clear-ml
the only different that I can think of is the OS Environments in the two login types:
can you run export
in the two cases and check the diff between them?export
So the only difference is how I log in into machine to start clear-ml (it somehow messes up the usage of .ssh folder by the training container)
And I would see another error if I log in without the password (with the help of authorized keys) and remove this extra argument about .ssh
volume from docker command:fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists. Repository cloning failed
So it's not using .ssh folder in the host user folder, until I add "-v", "/home/{user}/.ssh:/root/.ssh"
to docker arguments
AgitatedDove14 Actually, It happens on the same machine where clearml-agent started with clearml-agent daemon --detached --queue training-rig --gpus 1 --docker
. The only difference is how I log in into machine to start the agent (as described in the message above).
When I log in over ssh using password, use the command above to start the agent and add extra "-v", "/home/{user}/.ssh:/root/.ssh"
to docker arguments and send a task to execution on this agent I see:2022-07-28 16:31:34 latest: Pulling from {image_name} Status: Image is up to date for {image_name}:latest 2022-07-28 16:31:39 docker: Error response from daemon: Duplicate mount point: /root/.ssh. See 'docker run --help'. 2022-07-28 16:31:39 Process failed, exit code 125
But if I do exactly the same on the same machine but log in into it without the password (by adding my public ssh key to its authorized keys), and start the agent with identical command I don't see this error and everything works fine
BurlyRaccoon64 by default if .ssh exists in the host user folder it should mount it to the container (actually mount a copy of it). do you have a log of two tasks from two diff machines, one failing one passes? because this is quite odd (assuming the setup itself is identical)