My Autoscaled Instance Fails When Running "Git Clone" On A Private Repo. I

Answered

My autoscaled instance fails when running "git clone" on a private repo.

I do have the SSH key placed at /root/.ssh/id_rsa on the machine, and when I SSH into the machine and run sudo su; git clone <the repo> it succeeds.

Also, in the extra_vm_bash_script field: I added a whoami command which prints root , so it seems like the user being used to run the git clone during task execution is in fact root .

For context, here's the startup command that the autoscaler runs:

python -m clearml_agent --config-file /root/clearml.conf daemon --queue aws_4gpu_machines --docker python:3.9

Full log included...

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Votes Newest

Answers 34

Well wow, I figured it out. You equiped me with a solid debugging tool AKA running bash commands within the docker container.

I had to pre-add GitHub and Bitbucket to known hosts by adding keyscan commands

configurations:
  extra_clearml_conf: ""
  extra_trains_conf: ""
  extra_vm_bash_script: |
    echo "fetching github key" && (aws ssm get-parameter --region us-west-2 --name /clearml/github_ssh_private_key --with-decryption --query Parameter.Value --output text > ~/.ssh/id_rsa && chmod 600 ~/.ssh/id_rsa) || echo "failed"
    source /clearml_agent_venv/bin/activate
    echo "fetching github key" && (aws ssm get-parameter --region us-west-2 --name /clearml/github_ssh_public_key --with-decryption --query Parameter.Value --output text > ~/.ssh/id_rsa.pub && chmod 600 ~/.ssh/id_rsa.pub) || echo "failed"
    source /clearml_agent_venv/bin/activate

    # I added these new lines:
    ssh-keyscan github.com >> ~/.ssh/known_hosts
    ssh-keyscan bitbucket.org >> ~/.ssh/known_hosts

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

I don't see it as an argument in Task.init or Task.execute_remotely

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

I do agree with your earlier observation that the target of that mount seems wrong. I would think that the volume mount should be -v /root/.ssh:/root/.ssh but instead it's -v /root.ssh:/.ssh

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Actually, dumb question: how do I set the setup script for a task?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

I'm not seeing a extra_docker_shell_script in my clearml.conf generated by clearml-agent init like in this guide

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Here's a screenshot if a session where I first try to clone as ssm-user , but it fails, then I change to root and it succeeds

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Let's see. The task log? I think this is it.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Actually that's wrong: really this is the current volume mount

'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh',

Could changing these values to /root/.ssh work? Do you know what use within the docker image ClearML is using?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

So I get output with this one, but the console only shows me the output from my machine. For example, the SSH key is present, and whoami results in ericriddoch

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Let's see. The screenshots above are me running on the host, not attaching to a running container. So I believe I do want the keys to be mounted into the running containers.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Remove this from your startup script:

#!/bin/bash

there is no need that, it actually "markes out" the entire thing

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Haha, that was a total gotcha for me. Yeah, a lot just wasn't even getting run due to the #!/bin/bash part.

Anyway, wow! I finally got the precious console logs you thought to find, here they are:

2023-05-06 00:19:21
User aborted: stopping task (3)
2023-05-06 00:19:21
Successfully installed PyYAML-6.0 attrs-22.2.0 certifi-2022.12.7 charset-normalizer-3.1.0 clearml-agent-1.5.2 distlib-0.3.6 filelock-3.12.0 furl-2.1.3 idna-3.4 jsonschema-4.17.3 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 platformdirs-3.5.0 psutil-5.9.5 pyjwt-2.6.0 pyparsing-3.0.9 pyrsistent-0.19.3 python-dateutil-2.8.2 requests-2.28.2 six-1.16.0 urllib3-1.26.15 virtualenv-20.23.0
WARNING: You are using pip version 20.1.1; however, version 23.1.2 is available.
You should consider upgrading via the '/usr/local/bin/python3.9 -m pip install --upgrade pip' command.
+ ls -la /.ssh
total 12
drwx------ 2 root root   61 May  6 06:18 .
drwxr-xr-x 1 root root  123 May  6 06:18 ..
-rw------- 1 root root  722 May  6 06:15 authorized_keys
-rw------- 1 root root 2603 May  6 06:18 id_rsa
-rw------- 1 root root  568 May  6 06:18 id_rsa.pub
+ ls -la /root/.ssh
total 12
drwx------ 2 root root   61 May  6 06:19 .
drwx------ 1 root root   48 May  6 06:19 ..
-rw------- 1 root root  722 May  6 06:19 authorized_keys
-rw------- 1 root root 2603 May  6 06:19 id_rsa
-rw------- 1 root root  568 May  6 06:19 id_rsa.pub
+ whoami
root
+ cat /root/.ssh/id_rsa
+ head -n 3
-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAABlwAAAAdzc2gtcn
NhAAAAAwEAAQAAAYEA8IluYkpM1l7TK/O1JnhEzeLJKa7+aWO+Gn20R4Ql59FlxQsTq/UE

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

The key seems to be placed in the expected location

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh',

It's my bad, after that inside the container it does cp -Rf /.ssh ~/.ssh
The reason is that you cannot know the user home folder before spinning the container
Anyhow the point is, are you sure that you have ~/.ssh on the Host machine configured?
And if you do, are you saying this is part of your AMI? if not how did you put it there?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Or the log of the init script?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

It doesn't seem to want to show me stdout

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

That's with the key at /root/.ssh/id_rsa

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Hi @<1541954607595393024:profile|BattyCrocodile47>

I

do

have the SSH key placed at

/root/.ssh/id_rsa

on the machine,

Notice that the .ssh folder is mounted from the host (EC2 / GCP) into the container,

'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh'

This is odd, why is it mounting it to /.ssh and not /root/.ssh ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

That's with the key at

/root/.ssh/id_rsa

You mean inside the container that the autoscaler spinned ?
Notice that the agent by defult would mount the Host .ssh over the existing .ssh inside the container, if you do not want this behavior you need to set: agent.disable_ssh_mount: true in clearml.conf

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

On it

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

So here's a snippet from my aws_autoscaler.yaml file

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Here we go. Trying with this

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Wow, it really does not want to show the output of those print statements in stdout. Here's the output of the task from the console after cloning it. Confirmed that the setup script and all code changes are present:

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

So, we've been able to run sudo su and then git clone with our private repos a few times now

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

It's an Amazon Linux AMI with the AWS CLI pre-installed on it. It uses the AWS CLI to fetch the key from AWS SSM Parameter Store. It's granted read access to that SSM Parameter via the instance role.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

DM me the entire log, I would assume this is something with the configuration

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I can't think of any changes we might have made on our side to cause that 🤔

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

I

do

have the SSH key placed at

/root/.ssh/id_rsa

on the machine,

@<1541954607595393024:profile|BattyCrocodile47> is the SSH key part of the containers? or are you saying it is on the EC2 instance ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Actually, dumb question: how do I set the setup script for a task?

When you clone/edit the Task in the UI, under Execution / Container you should have it
After you edit it, just push it into the execution with the autoscaler and wait 🙂

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I have the same behavior whether or not I put task.execute_remotely(...) before or after the call to run_shell_script()

  				
Posted 
	one year ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Show more results

Write your answer

68K Views

34 Answers

one year ago