Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
My Autoscaled Instance Fails When Running "Git Clone" On A Private Repo. I

My autoscaled instance fails when running "git clone" on a private repo.

I do have the SSH key placed at /root/.ssh/id_rsa on the machine, and when I SSH into the machine and run sudo su; git clone <the repo> it succeeds.

Also, in the extra_vm_bash_script field: I added a whoami command which prints root , so it seems like the user being used to run the git clone during task execution is in fact root .

For context, here's the startup command that the autoscaler runs:

python -m clearml_agent --config-file /root/clearml.conf daemon --queue aws_4gpu_machines --docker python:3.9

Full log included...
image

  
  
Posted 2 years ago
Votes Newest

Answers 34


Hi @<1541954607595393024:profile|BattyCrocodile47>

I

do

have the SSH key placed at

/root/.ssh/id_rsa

on the machine,

Notice that the .ssh folder is mounted from the host (EC2 / GCP) into the container,

'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh'

This is odd, why is it mounting it to /.ssh and not /root/.ssh ?

  
  
Posted 2 years ago

I can't think of any changes we might have made on our side to cause that 🤔

  
  
Posted 2 years ago

DM me the entire log, I would assume this is something with the configuration

  
  
Posted 2 years ago

Let's see. The task log? I think this is it.

  
  
Posted 2 years ago

Or the log of the init script?

  
  
Posted 2 years ago

cc: @<1565509803839590400:profile|MoodyBear54>

  
  
Posted 2 years ago

I

do

have the SSH key placed at

/root/.ssh/id_rsa

on the machine,

@<1541954607595393024:profile|BattyCrocodile47> is the SSH key part of the containers? or are you saying it is on the EC2 instance ?

  
  
Posted 2 years ago

So, we've been able to run sudo su and then git clone with our private repos a few times now

  
  
Posted 2 years ago

That's with the key at /root/.ssh/id_rsa

  
  
Posted 2 years ago

Here's a screenshot if a session where I first try to clone as ssm-user , but it fails, then I change to root and it succeeds
image

  
  
Posted 2 years ago

The key seems to be placed in the expected location
image

  
  
Posted 2 years ago

That's with the key at

/root/.ssh/id_rsa

You mean inside the container that the autoscaler spinned ?
Notice that the agent by defult would mount the Host .ssh over the existing .ssh inside the container, if you do not want this behavior you need to set: agent.disable_ssh_mount: true in clearml.conf

  
  
Posted 2 years ago

Let's see. The screenshots above are me running on the host, not attaching to a running container. So I believe I do want the keys to be mounted into the running containers.

  
  
Posted 2 years ago

I do agree with your earlier observation that the target of that mount seems wrong. I would think that the volume mount should be -v /root/.ssh:/root/.ssh but instead it's -v /root.ssh:/.ssh

  
  
Posted 2 years ago

Actually that's wrong: really this is the current volume mount

'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh',

Could changing these values to /root/.ssh work? Do you know what use within the docker image ClearML is using?
image

  
  
Posted 2 years ago

'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh',

It's my bad, after that inside the container it does cp -Rf /.ssh ~/.ssh
The reason is that you cannot know the user home folder before spinning the container
Anyhow the point is, are you sure that you have ~/.ssh on the Host machine configured?
And if you do, are you saying this is part of your AMI? if not how did you put it there?

  
  
Posted 2 years ago

So here's a snippet from my aws_autoscaler.yaml file

  
  
Posted 2 years ago

configurations:
  extra_clearml_conf: ""
  extra_trains_conf: ""
  extra_vm_bash_script: |
    aws ssm get-parameter --region us-west-2 --name /clearml/github_ssh_private_key --with-decryption --query Parameter.Value --output text > ~/.ssh/id_rsa && chmod 600 ~/.ssh/id_rsa
    source /clearml_agent_venv/bin/activate

hyper_params:
  iam_arn: arn:aws:iam::<my account id>:instance-profile/clearml-2-AutoscaledInstanceProfileAutoScaledEC2InstanceProfile56A5348F-90fmf6H5OUBx
  
  
Posted 2 years ago

It's an Amazon Linux AMI with the AWS CLI pre-installed on it. It uses the AWS CLI to fetch the key from AWS SSM Parameter Store. It's granted read access to that SSM Parameter via the instance role.

  
  
Posted 2 years ago

oh that makes sense.
I would add to your Task's docker startup script the following:

ls -la /.ssh
ls -la ~/.ssh
cat ~/.ssh/id_rsa

Let's see what you get

  
  
Posted 2 years ago

On it

  
  
Posted 2 years ago

Actually, dumb question: how do I set the setup script for a task?

  
  
Posted 2 years ago

I don't see it as an argument in Task.init or Task.execute_remotely

  
  
Posted 2 years ago

I'm not seeing a extra_docker_shell_script in my clearml.conf generated by clearml-agent init like in this guide

  
  
Posted 2 years ago

Here we go. Trying with this

  
  
Posted 2 years ago

It doesn't seem to want to show me stdout
image

  
  
Posted 2 years ago

Trying as a python subprocess...

  
  
Posted 2 years ago

So I get output with this one, but the console only shows me the output from my machine. For example, the SSH key is present, and whoami results in ericriddoch

  
  
Posted 2 years ago

I have the same behavior whether or not I put task.execute_remotely(...) before or after the call to run_shell_script()

  
  
Posted 2 years ago

Actually, dumb question: how do I set the setup script for a task?

When you clone/edit the Task in the UI, under Execution / Container you should have it
After you edit it, just push it into the execution with the autoscaler and wait 🙂

  
  
Posted 2 years ago
178K Views
34 Answers
2 years ago
2 years ago
Tags
Similar posts