Hello! Thank You All For Your Work! I Have A Question (Which Is Probably Not Clearml Related At All). I Am Using Clearml-Agent Running In Docker Mode On Several Machines With Gpu In Our Local Network And Get Different Behaviour Depending On How I Logged I

Answered

Hello! Thank you all for your work! I have a question (which is probably not ClearML related at all). I am using clearml-agent running in docker mode on several machines with GPU in our local network and get different behaviour depending on how I logged in into machine to start the agent. More precisely, if I use a registered ssh key added to authorized keys to log in and bring the worker up, at the moment when it tries to set up a container for training it will say it doesn't have sufficient permissions to download the git repo unless I explicitly add sharing of .ssh volume to docker arguments. Therefore, I have the following config file on the agent machine:
` default_docker: {
...
arguments: [... "-v", "/home/{user}/.ssh:/root/.ssh" ...]
...},

docker_internal_mounts {
...
ssh_folder: "/root/.ssh"
...
} However, when I use ssh password to access the machine and start the worker, at the moment when it tries to set up a container for training it will complain about .ssh folder being a duplicated shared volume (what may be fixed by removing explicit volume sharing for .ssh ` folder from docker arguments). Can anybody guess what causes this behaviour?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BurlyRaccoon64
				
					0
					 × 1

Votes Newest

Answers 7

I think the main issue is that for some reason the container running changed one of the files inside the temp folder. then the host machine is "stuck" with a file that the root user owned/changed, and now it cannot reuse / delete the temp folder.
I think the fix is to make sure the container deleted the temp folder when it is done

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So the only difference is how I log in into machine to start clear-ml

the only different that I can think of is the OS Environments in the two login types:
can you run export in the two cases and check the diff between them?
export

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

And I would see another error if I log in without the password (with the help of authorized keys) and remove this extra argument about .ssh volume from docker command:
fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists. Repository cloning failedSo it's not using .ssh folder in the host user folder, until I add "-v", "/home/{user}/.ssh:/root/.ssh" to docker arguments

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BurlyRaccoon64
				
					0
					 × 1

BurlyRaccoon64 by default if .ssh exists in the host user folder it should mount it to the container (actually mount a copy of it). do you have a log of two tasks from two diff machines, one failing one passes? because this is quite odd (assuming the setup itself is identical)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So the only difference is how I log in into machine to start clear-ml (it somehow messes up the usage of .ssh folder by the training container)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BurlyRaccoon64
				
					0
					 × 1

AgitatedDove14 Actually, It happens on the same machine where clearml-agent started with clearml-agent daemon --detached --queue training-rig --gpus 1 --docker
. The only difference is how I log in into machine to start the agent (as described in the message above).
When I log in over ssh using password, use the command above to start the agent and add extra "-v", "/home/{user}/.ssh:/root/.ssh" to docker arguments and send a task to execution on this agent I see:
2022-07-28 16:31:34 latest: Pulling from {image_name} Status: Image is up to date for {image_name}:latest 2022-07-28 16:31:39 docker: Error response from daemon: Duplicate mount point: /root/.ssh. See 'docker run --help'. 2022-07-28 16:31:39 Process failed, exit code 125But if I do exactly the same on the same machine but log in into it without the password (by adding my public ssh key to its authorized keys), and start the agent with identical command I don't see this error and everything works fine

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BurlyRaccoon64
				
					0
					 × 1

I think I narrowed down the problem to the using of ssh agent forwarding or not. When I used ssh config and connected without password I had an option in my config ForwardAgent yes , and with this enabled when I started the agent on the remote machine it didn't mount .ssh folder by default until adding "-v", "/home/{user}/.ssh:/root/.ssh" to the arguments. So, without ssh agent forwarding everything works as expected.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BurlyRaccoon64
				
					0
					 × 1

Write your answer

1K Views

7 Answers

2 years ago

one year ago