Haha, that was a total gotcha for me. Yeah, a lot just wasn't even getting run due to the #!/bin/bash part.
Anyway, wow! I finally got the precious console logs you thought to find, here they are:
2023-05-06 00:19:21
User aborted: stopping task (3)
2023-05-06 00:19:21
Successfully installed PyYAML-6.0 attrs-22.2.0 certifi-2022.12.7 charset-normalizer-3.1.0 clearml-agent-1.5.2 distlib-0.3.6 filelock-3.12.0 furl-2.1.3 idna-3.4 jsonschema-4.17.3 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 platformdirs-3.5.0 psutil-5.9.5 pyjwt-2.6.0 pyparsing-3.0.9 pyrsistent-0.19.3 python-dateutil-2.8.2 requests-2.28.2 six-1.16.0 urllib3-1.26.15 virtualenv-20.23.0
WARNING: You are using pip version 20.1.1; however, version 23.1.2 is available.
You should consider upgrading via the '/usr/local/bin/python3.9 -m pip install --upgrade pip' command.
+ ls -la /.ssh
total 12
drwx------ 2 root root 61 May 6 06:18 .
drwxr-xr-x 1 root root 123 May 6 06:18 ..
-rw------- 1 root root 722 May 6 06:15 authorized_keys
-rw------- 1 root root 2603 May 6 06:18 id_rsa
-rw------- 1 root root 568 May 6 06:18 id_rsa.pub
+ ls -la /root/.ssh
total 12
drwx------ 2 root root 61 May 6 06:19 .
drwx------ 1 root root 48 May 6 06:19 ..
-rw------- 1 root root 722 May 6 06:19 authorized_keys
-rw------- 1 root root 2603 May 6 06:19 id_rsa
-rw------- 1 root root 568 May 6 06:19 id_rsa.pub
+ whoami
root
+ cat /root/.ssh/id_rsa
+ head -n 3
-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAABlwAAAAdzc2gtcn
NhAAAAAwEAAQAAAYEA8IluYkpM1l7TK/O1JnhEzeLJKa7+aWO+Gn20R4Ql59FlxQsTq/UE
Actually that's wrong: really this is the current volume mount
'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh',
Could changing these values to /root/.ssh work? Do you know what use within the docker image ClearML is using?
Let's see. The task log? I think this is it.
I can't think of any changes we might have made on our side to cause that 🤔
Here's a screenshot if a session where I first try to clone as ssm-user , but it fails, then I change to root and it succeeds
DM me the entire log, I would assume this is something with the configuration
'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh',
It's my bad, after that inside the container it does cp -Rf /.ssh ~/.ssh
The reason is that you cannot know the user home folder before spinning the container
Anyhow the point is, are you sure that you have ~/.ssh on the Host machine configured?
And if you do, are you saying this is part of your AMI? if not how did you put it there?
I
do
have the SSH key placed at
/root/.ssh/id_rsa
on the machine,
@<1541954607595393024:profile|BattyCrocodile47> is the SSH key part of the containers? or are you saying it is on the EC2 instance ?
That's with the key at /root/.ssh/id_rsa
It doesn't seem to want to show me stdout
So, we've been able to run sudo su and then git clone with our private repos a few times now
I'm not seeing a extra_docker_shell_script in my clearml.conf generated by clearml-agent init like in this guide
I do agree with your earlier observation that the target of that mount seems wrong. I would think that the volume mount should be -v /root/.ssh:/root/.ssh but instead it's -v /root.ssh:/.ssh
I have the same behavior whether or not I put task.execute_remotely(...) before or after the call to run_shell_script()
cc: @<1565509803839590400:profile|MoodyBear54>
The key seems to be placed in the expected location
oh that makes sense.
I would add to your Task's docker startup script the following:
ls -la /.ssh
ls -la ~/.ssh
cat ~/.ssh/id_rsa
Let's see what you get
configurations:
extra_clearml_conf: ""
extra_trains_conf: ""
extra_vm_bash_script: |
aws ssm get-parameter --region us-west-2 --name /clearml/github_ssh_private_key --with-decryption --query Parameter.Value --output text > ~/.ssh/id_rsa && chmod 600 ~/.ssh/id_rsa
source /clearml_agent_venv/bin/activate
hyper_params:
iam_arn: arn:aws:iam::<my account id>:instance-profile/clearml-2-AutoscaledInstanceProfileAutoScaledEC2InstanceProfile56A5348F-90fmf6H5OUBx
So I get output with this one, but the console only shows me the output from my machine. For example, the SSH key is present, and whoami results in ericriddoch
I don't see it as an argument in Task.init or Task.execute_remotely
That's with the key at
/root/.ssh/id_rsa
You mean inside the container that the autoscaler spinned ?
Notice that the agent by defult would mount the Host .ssh over the existing .ssh inside the container, if you do not want this behavior you need to set: agent.disable_ssh_mount: true in clearml.conf
Remove this from your startup script:
#!/bin/bash
there is no need that, it actually "markes out" the entire thing
Actually, dumb question: how do I set the setup script for a task?
When you clone/edit the Task in the UI, under Execution / Container you should have it
After you edit it, just push it into the execution with the autoscaler and wait 🙂
Well wow, I figured it out. You equiped me with a solid debugging tool AKA running bash commands within the docker container.
I had to pre-add GitHub and Bitbucket to known hosts by adding keyscan commands
configurations:
extra_clearml_conf: ""
extra_trains_conf: ""
extra_vm_bash_script: |
echo "fetching github key" && (aws ssm get-parameter --region us-west-2 --name /clearml/github_ssh_private_key --with-decryption --query Parameter.Value --output text > ~/.ssh/id_rsa && chmod 600 ~/.ssh/id_rsa) || echo "failed"
source /clearml_agent_venv/bin/activate
echo "fetching github key" && (aws ssm get-parameter --region us-west-2 --name /clearml/github_ssh_public_key --with-decryption --query Parameter.Value --output text > ~/.ssh/id_rsa.pub && chmod 600 ~/.ssh/id_rsa.pub) || echo "failed"
source /clearml_agent_venv/bin/activate
# I added these new lines:
ssh-keyscan github.com >> ~/.ssh/known_hosts
ssh-keyscan bitbucket.org >> ~/.ssh/known_hosts
Let's see. The screenshots above are me running on the host, not attaching to a running container. So I believe I do want the keys to be mounted into the running containers.
It's an Amazon Linux AMI with the AWS CLI pre-installed on it. It uses the AWS CLI to fetch the key from AWS SSM Parameter Store. It's granted read access to that SSM Parameter via the instance role.
